Types of Fine-tuning

Not all fine-tuning is created equal. Let's explore the different approaches and when to use each.

Full Fine-tuning vs PEFT

Full Fine-tuning

Updates all model parameters during training.

Pros	Cons
Maximum potential improvement	Requires massive GPU memory
Complete model customization	Risk of catastrophic forgetting
Works for any task	Expensive and slow

Hardware needed: 80GB+ VRAM for 7B model, multiple A100s for 70B

Parameter-Efficient Fine-Tuning (PEFT)

Updates only a small subset of parameters (typically <1%).

Pros	Cons
10-100x less memory	Slightly lower ceiling than full fine-tuning
Fast training	Requires understanding of adapters
No catastrophic forgetting
Easy to swap adapters

Hardware needed: 8-24GB VRAM for most models with QLoRA

Training Objectives: SFT vs DPO vs RLHF

Supervised Fine-Tuning (SFT)

The simplest approach: train on (instruction, response) pairs.

# SFT Training Example
dataset = [
    {"instruction": "Write a poem about AI", "response": "Silicon dreams..."},
    {"instruction": "Explain quantum computing", "response": "Quantum computing uses..."}
]

Best for: Teaching new skills, domain knowledge, output formats

Direct Preference Optimization (DPO)

Train on preference pairs: which response is better?

# DPO Training Example
dataset = [
    {
        "prompt": "Explain recursion",
        "chosen": "Recursion is when a function calls itself...",  # Good response
        "rejected": "Recursion is complicated. Look it up."  # Bad response
    }
]

Best for: Improving response quality, alignment, tone

RLHF (Reinforcement Learning from Human Feedback)

The original alignment technique, more complex than DPO.

Pipeline: SFT Model → Reward Model Training → PPO Optimization

Best for: Complex alignment requirements (typically used by labs)

Comparison: When to Use Each

Approach	Complexity	Hardware	Use Case
SFT only	Low	8GB+	Teaching new tasks
SFT + DPO	Medium	16GB+	Quality + alignment
Full RLHF	High	80GB+	Research, complex alignment

The Modern Pipeline (2026)

Most practitioners follow this pipeline:

Base Model → SFT (teach skills) → DPO (improve quality)
    ↓              ↓                     ↓
 Llama 3.2    Custom dataset      Preference data

Pro Tip: Start with SFT. Only add DPO if you need to improve response quality or alignment. Skip RLHF unless you're doing research.

Key Terminology

Term	Definition
Adapter	Small trainable module added to frozen model
LoRA	Low-Rank Adaptation - most popular PEFT method
Rank (r)	Size of LoRA matrices (higher = more capacity)
Alpha	Scaling factor for LoRA updates
Target modules	Which layers to add adapters to

Next, we'll dive deep into PEFT methods and understand how LoRA works. :::