Lesson 2 of 24

Understanding Fine-tuning

Types of Fine-tuning

3 min read

Not all fine-tuning is created equal. Let's explore the different approaches and when to use each.

Full Fine-tuning vs PEFT

Full Fine-tuning

Updates all model parameters during training.

ProsCons
Maximum potential improvementRequires massive GPU memory
Complete model customizationRisk of catastrophic forgetting
Works for any taskExpensive and slow

Hardware needed: 80GB+ VRAM for 7B model, multiple A100s for 70B

Parameter-Efficient Fine-Tuning (PEFT)

Updates only a small subset of parameters (typically <1%).

ProsCons
10-100x less memorySlightly lower ceiling than full fine-tuning
Fast trainingRequires understanding of adapters
No catastrophic forgetting
Easy to swap adapters

Hardware needed: 8-24GB VRAM for most models with QLoRA

Training Objectives: SFT vs DPO vs RLHF

Supervised Fine-Tuning (SFT)

The simplest approach: train on (instruction, response) pairs.

# SFT Training Example
dataset = [
    {"instruction": "Write a poem about AI", "response": "Silicon dreams..."},
    {"instruction": "Explain quantum computing", "response": "Quantum computing uses..."}
]

Best for: Teaching new skills, domain knowledge, output formats

Direct Preference Optimization (DPO)

Train on preference pairs: which response is better?

# DPO Training Example
dataset = [
    {
        "prompt": "Explain recursion",
        "chosen": "Recursion is when a function calls itself...",  # Good response
        "rejected": "Recursion is complicated. Look it up."  # Bad response
    }
]

Best for: Improving response quality, alignment, tone

RLHF (Reinforcement Learning from Human Feedback)

The original alignment technique, more complex than DPO.

Pipeline: SFT Model → Reward Model Training → PPO Optimization

Best for: Complex alignment requirements (typically used by labs)

Comparison: When to Use Each

ApproachComplexityHardwareUse Case
SFT onlyLow8GB+Teaching new tasks
SFT + DPOMedium16GB+Quality + alignment
Full RLHFHigh80GB+Research, complex alignment

The Modern Pipeline (2026)

Most practitioners follow this pipeline:

Base Model → SFT (teach skills) → DPO (improve quality)
    ↓              ↓                     ↓
 Llama 3.2    Custom dataset      Preference data

Pro Tip: Start with SFT. Only add DPO if you need to improve response quality or alignment. Skip RLHF unless you're doing research.

Key Terminology

TermDefinition
AdapterSmall trainable module added to frozen model
LoRALow-Rank Adaptation - most popular PEFT method
Rank (r)Size of LoRA matrices (higher = more capacity)
AlphaScaling factor for LoRA updates
Target modulesWhich layers to add adapters to

Next, we'll dive deep into PEFT methods and understand how LoRA works. :::

Quick check: how does this lesson land for you?

Quiz

Module 1: Understanding Fine-tuning

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.