Understanding Fine-tuning
Types of Fine-tuning
Not all fine-tuning is created equal. Let's explore the different approaches and when to use each.
Full Fine-tuning vs PEFT
Full Fine-tuning
Updates all model parameters during training.
| Pros | Cons |
|---|---|
| Maximum potential improvement | Requires massive GPU memory |
| Complete model customization | Risk of catastrophic forgetting |
| Works for any task | Expensive and slow |
Hardware needed: 80GB+ VRAM for 7B model, multiple A100s for 70B
Parameter-Efficient Fine-Tuning (PEFT)
Updates only a small subset of parameters (typically <1%).
| Pros | Cons |
|---|---|
| 10-100x less memory | Slightly lower ceiling than full fine-tuning |
| Fast training | Requires understanding of adapters |
| No catastrophic forgetting | |
| Easy to swap adapters |
Hardware needed: 8-24GB VRAM for most models with QLoRA
Training Objectives: SFT vs DPO vs RLHF
Supervised Fine-Tuning (SFT)
The simplest approach: train on (instruction, response) pairs.
# SFT Training Example
dataset = [
{"instruction": "Write a poem about AI", "response": "Silicon dreams..."},
{"instruction": "Explain quantum computing", "response": "Quantum computing uses..."}
]
Best for: Teaching new skills, domain knowledge, output formats
Direct Preference Optimization (DPO)
Train on preference pairs: which response is better?
# DPO Training Example
dataset = [
{
"prompt": "Explain recursion",
"chosen": "Recursion is when a function calls itself...", # Good response
"rejected": "Recursion is complicated. Look it up." # Bad response
}
]
Best for: Improving response quality, alignment, tone
RLHF (Reinforcement Learning from Human Feedback)
The original alignment technique, more complex than DPO.
Pipeline: SFT Model → Reward Model Training → PPO Optimization
Best for: Complex alignment requirements (typically used by labs)
Comparison: When to Use Each
| Approach | Complexity | Hardware | Use Case |
|---|---|---|---|
| SFT only | Low | 8GB+ | Teaching new tasks |
| SFT + DPO | Medium | 16GB+ | Quality + alignment |
| Full RLHF | High | 80GB+ | Research, complex alignment |
The Modern Pipeline (2025)
Most practitioners follow this pipeline:
Base Model → SFT (teach skills) → DPO (improve quality)
↓ ↓ ↓
Llama 3.2 Custom dataset Preference data
Pro Tip: Start with SFT. Only add DPO if you need to improve response quality or alignment. Skip RLHF unless you're doing research.
Key Terminology
| Term | Definition |
|---|---|
| Adapter | Small trainable module added to frozen model |
| LoRA | Low-Rank Adaptation - most popular PEFT method |
| Rank (r) | Size of LoRA matrices (higher = more capacity) |
| Alpha | Scaling factor for LoRA updates |
| Target modules | Which layers to add adapters to |
Next, we'll dive deep into PEFT methods and understand how LoRA works. :::