Lesson 2 of 24

Understanding Fine-tuning

Types of Fine-tuning

3 min read

Not all fine-tuning is created equal. Let's explore the different approaches and when to use each.

Full Fine-tuning vs PEFT

Full Fine-tuning

Updates all model parameters during training.

Pros Cons
Maximum potential improvement Requires massive GPU memory
Complete model customization Risk of catastrophic forgetting
Works for any task Expensive and slow

Hardware needed: 80GB+ VRAM for 7B model, multiple A100s for 70B

Parameter-Efficient Fine-Tuning (PEFT)

Updates only a small subset of parameters (typically <1%).

Pros Cons
10-100x less memory Slightly lower ceiling than full fine-tuning
Fast training Requires understanding of adapters
No catastrophic forgetting
Easy to swap adapters

Hardware needed: 8-24GB VRAM for most models with QLoRA

Training Objectives: SFT vs DPO vs RLHF

Supervised Fine-Tuning (SFT)

The simplest approach: train on (instruction, response) pairs.

# SFT Training Example
dataset = [
    {"instruction": "Write a poem about AI", "response": "Silicon dreams..."},
    {"instruction": "Explain quantum computing", "response": "Quantum computing uses..."}
]

Best for: Teaching new skills, domain knowledge, output formats

Direct Preference Optimization (DPO)

Train on preference pairs: which response is better?

# DPO Training Example
dataset = [
    {
        "prompt": "Explain recursion",
        "chosen": "Recursion is when a function calls itself...",  # Good response
        "rejected": "Recursion is complicated. Look it up."  # Bad response
    }
]

Best for: Improving response quality, alignment, tone

RLHF (Reinforcement Learning from Human Feedback)

The original alignment technique, more complex than DPO.

Pipeline: SFT Model → Reward Model Training → PPO Optimization

Best for: Complex alignment requirements (typically used by labs)

Comparison: When to Use Each

Approach Complexity Hardware Use Case
SFT only Low 8GB+ Teaching new tasks
SFT + DPO Medium 16GB+ Quality + alignment
Full RLHF High 80GB+ Research, complex alignment

The Modern Pipeline (2025)

Most practitioners follow this pipeline:

Base Model → SFT (teach skills) → DPO (improve quality)
    ↓              ↓                     ↓
 Llama 3.2    Custom dataset      Preference data

Pro Tip: Start with SFT. Only add DPO if you need to improve response quality or alignment. Skip RLHF unless you're doing research.

Key Terminology

Term Definition
Adapter Small trainable module added to frozen model
LoRA Low-Rank Adaptation - most popular PEFT method
Rank (r) Size of LoRA matrices (higher = more capacity)
Alpha Scaling factor for LoRA updates
Target modules Which layers to add adapters to

Next, we'll dive deep into PEFT methods and understand how LoRA works. :::

Quiz

Module 1: Understanding Fine-tuning

Take Quiz