Evaluation & Deployment
Common Issues & Fixes
Fine-tuning doesn't always go smoothly. Let's diagnose and fix the most common problems.
Loss Issues
Loss Explodes (NaN or Inf)
Symptoms: Loss becomes NaN or infinity early in training
Causes & Fixes:
# Problem: Learning rate too high
# Fix: Lower learning rate
learning_rate = 5e-6 # Start low, increase if needed
# Problem: Gradient overflow
# Fix: Enable gradient clipping
max_grad_norm = 1.0
# Problem: Mixed precision issues
# Fix: Use bf16 instead of fp16
bf16 = True # More stable than fp16
fp16 = False
# Problem: Bad data (empty strings, special characters)
# Fix: Validate dataset
def validate_example(example):
if not example.get("text") or len(example["text"]) < 10:
return False
if "NaN" in example["text"] or "\x00" in example["text"]:
return False
return True
dataset = dataset.filter(validate_example)
Loss Doesn't Decrease
Symptoms: Loss stays flat or oscillates
Fixes:
# Problem: Learning rate too low
learning_rate = 2e-4 # Try increasing
# Problem: LoRA rank too low
r = 32 # Increase from 8 or 16
# Problem: Not enough target modules
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
] # Include all linear layers
# Problem: Dataset too small or homogeneous
# Fix: Add more diverse examples
Loss Decreases Then Increases
Symptoms: Classic overfitting pattern
Fixes:
# Solution 1: Early stopping
from transformers import EarlyStoppingCallback
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)
# Solution 2: More regularization
lora_dropout = 0.1 # Add dropout
weight_decay = 0.01 # Add weight decay
# Solution 3: Reduce epochs
num_train_epochs = 1 # Often 1-2 is enough
# Solution 4: More training data
# Increase dataset size or use data augmentation
Memory Issues
Out of Memory (OOM)
Symptoms: CUDA out of memory error
Quick Fixes:
# Reduce batch size
per_device_train_batch_size = 1 # Start with 1
# Increase gradient accumulation to maintain effective batch size
gradient_accumulation_steps = 8 # Effective batch = 1 * 8 = 8
# Reduce sequence length
max_seq_length = 512 # Shorter sequences use less memory
# Enable gradient checkpointing
use_gradient_checkpointing = "unsloth" # Trades compute for memory
# Use 4-bit quantization
load_in_4bit = True
# Clear CUDA cache before training
import torch
torch.cuda.empty_cache()
Memory Grows During Training
Symptoms: Memory usage increases each step
# Fix: Disable caching during training
model.config.use_cache = False
# Fix: Clear gradients properly (usually handled by trainer)
optimizer.zero_grad(set_to_none=True)
# Fix: Check for memory leaks in callbacks
# Avoid storing tensors in lists
Output Quality Issues
Repetitive Outputs
Symptoms: Model repeats phrases or sentences
# During inference, adjust generation parameters
outputs = model.generate(
**inputs,
max_new_tokens=200,
repetition_penalty=1.2, # Penalize repetition
no_repeat_ngram_size=3, # Don't repeat 3-grams
temperature=0.7, # Add randomness
top_p=0.9, # Nucleus sampling
)
# During training, check for repetitive training data
from collections import Counter
def check_repetition(dataset):
texts = [ex["text"] for ex in dataset]
duplicates = [t for t, count in Counter(texts).items() if count > 1]
print(f"Found {len(duplicates)} duplicate examples")
return duplicates
Catastrophic Forgetting
Symptoms: Model loses general capabilities
# Fix 1: Lower learning rate
learning_rate = 1e-6 # Very conservative
# Fix 2: Increase LoRA alpha relative to rank
r = 16
lora_alpha = 32 # 2x rank
# Fix 3: Mix in general data
from datasets import concatenate_datasets
# Add some general instruction data
general_dataset = load_dataset("yahma/alpaca-cleaned", split="train[:1000]")
combined = concatenate_datasets([domain_dataset, general_dataset])
combined = combined.shuffle(seed=42)
# Fix 4: Use higher beta in DPO (stay closer to reference)
beta = 0.5 # Higher = less deviation
Wrong Format/Style
Symptoms: Model doesn't follow expected format
# Fix: Ensure training data matches expected format exactly
def format_example(example):
# Use the EXACT chat template your model expects
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["output"]}
]
return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}
# Verify format is correct
sample = dataset[0]
print(format_example(sample)["text"])
Training Instability
Gradient Issues
# Monitor gradients
def log_gradients(model):
total_norm = 0
for p in model.parameters():
if p.grad is not None:
param_norm = p.grad.data.norm(2)
total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5
print(f"Gradient norm: {total_norm}")
# If gradients are too large
max_grad_norm = 0.5 # Clip more aggressively
# If gradients are too small (vanishing)
learning_rate = 5e-5 # Increase learning rate
Unstable DPO Training
# DPO-specific issues
dpo_config = DPOConfig(
# Start with conservative beta
beta=0.1,
# Lower learning rate than SFT
learning_rate=5e-7,
# Ensure reference model is frozen
# (handled automatically by DPOTrainer)
# Monitor reward margins
logging_steps=10,
)
# If margins go negative, check data quality
# chosen should clearly be better than rejected
Debugging Checklist
When training fails:
-
Check data first
- No empty examples
- Correct format
- Reasonable lengths
- No special characters causing issues
-
Verify configuration
- Learning rate appropriate for task
- Batch size fits in memory
- Correct target modules for model
-
Monitor metrics
- Loss trending downward
- No NaN values
- Memory usage stable
-
Test incrementally
- Train on 100 examples first
- Verify outputs before full training
- Save checkpoints frequently
Tip: When debugging, reduce everything to minimum (1 epoch, 100 examples, batch size 1) and verify the pipeline works before scaling up.
Next, let's learn how to deploy your fine-tuned model to Ollama. :::