Common Issues & Fixes

Fine-tuning doesn't always go smoothly. Let's diagnose and fix the most common problems.

Loss Issues

Loss Explodes (NaN or Inf)

Symptoms: Loss becomes NaN or infinity early in training

Causes & Fixes:

# Problem: Learning rate too high
# Fix: Lower learning rate
learning_rate = 5e-6  # Start low, increase if needed

# Problem: Gradient overflow
# Fix: Enable gradient clipping
max_grad_norm = 1.0

# Problem: Mixed precision issues
# Fix: Use bf16 instead of fp16
bf16 = True  # More stable than fp16
fp16 = False

# Problem: Bad data (empty strings, special characters)
# Fix: Validate dataset
def validate_example(example):
    if not example.get("text") or len(example["text"]) < 10:
        return False
    if "NaN" in example["text"] or "\x00" in example["text"]:
        return False
    return True

dataset = dataset.filter(validate_example)

Loss Doesn't Decrease

Symptoms: Loss stays flat or oscillates

Fixes:

# Problem: Learning rate too low
learning_rate = 2e-4  # Try increasing

# Problem: LoRA rank too low
r = 32  # Increase from 8 or 16

# Problem: Not enough target modules
target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj"
]  # Include all linear layers

# Problem: Dataset too small or homogeneous
# Fix: Add more diverse examples

Loss Decreases Then Increases

Symptoms: Classic overfitting pattern

Fixes:

# Solution 1: Early stopping
from transformers import EarlyStoppingCallback

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

# Solution 2: More regularization
lora_dropout = 0.1  # Add dropout
weight_decay = 0.01  # Add weight decay

# Solution 3: Reduce epochs
num_train_epochs = 1  # Often 1-2 is enough

# Solution 4: More training data
# Increase dataset size or use data augmentation

Memory Issues

Out of Memory (OOM)

Symptoms: CUDA out of memory error

Quick Fixes:

# Reduce batch size
per_device_train_batch_size = 1  # Start with 1

# Increase gradient accumulation to maintain effective batch size
gradient_accumulation_steps = 8  # Effective batch = 1 * 8 = 8

# Reduce sequence length
max_seq_length = 512  # Shorter sequences use less memory

# Enable gradient checkpointing
use_gradient_checkpointing = "unsloth"  # Trades compute for memory

# Use 4-bit quantization
load_in_4bit = True

# Clear CUDA cache before training
import torch
torch.cuda.empty_cache()

Memory Grows During Training

Symptoms: Memory usage increases each step

# Fix: Disable caching during training
model.config.use_cache = False

# Fix: Clear gradients properly (usually handled by trainer)
optimizer.zero_grad(set_to_none=True)

# Fix: Check for memory leaks in callbacks
# Avoid storing tensors in lists

Output Quality Issues

Repetitive Outputs

Symptoms: Model repeats phrases or sentences

# During inference, adjust generation parameters
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    repetition_penalty=1.2,  # Penalize repetition
    no_repeat_ngram_size=3,  # Don't repeat 3-grams
    temperature=0.7,  # Add randomness
    top_p=0.9,  # Nucleus sampling
)

# During training, check for repetitive training data
from collections import Counter

def check_repetition(dataset):
    texts = [ex["text"] for ex in dataset]
    duplicates = [t for t, count in Counter(texts).items() if count > 1]
    print(f"Found {len(duplicates)} duplicate examples")
    return duplicates

Catastrophic Forgetting

Symptoms: Model loses general capabilities

# Fix 1: Lower learning rate
learning_rate = 1e-6  # Very conservative

# Fix 2: Increase LoRA alpha relative to rank
r = 16
lora_alpha = 32  # 2x rank

# Fix 3: Mix in general data
from datasets import concatenate_datasets

# Add some general instruction data
general_dataset = load_dataset("yahma/alpaca-cleaned", split="train[:1000]")
combined = concatenate_datasets([domain_dataset, general_dataset])
combined = combined.shuffle(seed=42)

# Fix 4: Use higher beta in DPO (stay closer to reference)
beta = 0.5  # Higher = less deviation

Wrong Format/Style

Symptoms: Model doesn't follow expected format

# Fix: Ensure training data matches expected format exactly
def format_example(example):
    # Use the EXACT chat template your model expects
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["output"]}
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

# Verify format is correct
sample = dataset[0]
print(format_example(sample)["text"])

Training Instability

Gradient Issues

# Monitor gradients
def log_gradients(model):
    total_norm = 0
    for p in model.parameters():
        if p.grad is not None:
            param_norm = p.grad.data.norm(2)
            total_norm += param_norm.item() ** 2
    total_norm = total_norm ** 0.5
    print(f"Gradient norm: {total_norm}")

# If gradients are too large
max_grad_norm = 0.5  # Clip more aggressively

# If gradients are too small (vanishing)
learning_rate = 5e-5  # Increase learning rate

Unstable DPO Training

# DPO-specific issues
dpo_config = DPOConfig(
    # Start with conservative beta
    beta=0.1,

    # Lower learning rate than SFT
    learning_rate=5e-7,

    # Ensure reference model is frozen
    # (handled automatically by DPOTrainer)

    # Monitor reward margins
    logging_steps=10,
)

# If margins go negative, check data quality
# chosen should clearly be better than rejected

Debugging Checklist

When training fails:

Check data first
- No empty examples
- Correct format
- Reasonable lengths
- No special characters causing issues
Verify configuration
- Learning rate appropriate for task
- Batch size fits in memory
- Correct target modules for model
Monitor metrics
- Loss trending downward
- No NaN values
- Memory usage stable
Test incrementally
- Train on 100 examples first
- Verify outputs before full training
- Save checkpoints frequently

Tip: When debugging, reduce everything to minimum (1 epoch, 100 examples, batch size 1) and verify the pipeline works before scaling up.

Next, let's learn how to deploy your fine-tuned model to Ollama. :::