Lesson 17 of 24

Alignment with DPO

DPO vs RLHF

3 min read

Direct Preference Optimization (DPO) has revolutionized how we align language models. Let's understand why it's preferred over traditional RLHF.

What is Alignment?

Alignment ensures a model's outputs match human preferences:

  • Helpful and accurate responses
  • Safe and harmless behavior
  • Following instructions properly
  • Appropriate tone and style

Traditional RLHF Pipeline

Reinforcement Learning from Human Feedback (RLHF) was the original approach:

Step 1: Train SFT model
Step 2: Collect human preference data
Step 3: Train reward model on preferences
Step 4: Fine-tune with PPO using reward model
Result: Aligned model

RLHF Challenges

Challenge Description
Complexity 4-step pipeline, multiple models
Instability PPO training is notoriously unstable
Resources Need separate reward model
Hyperparameters Many parameters to tune

DPO: A Simpler Approach

DPO skips the reward model entirely:

Step 1: Train SFT model
Step 2: Collect preference data (chosen/rejected pairs)
Step 3: Train directly with DPO loss
Result: Aligned model

DPO Advantages

Advantage Explanation
Simpler No reward model needed
Stable Standard supervised training
Efficient Single training stage
Effective Matches or beats RLHF results

How DPO Works

DPO treats alignment as a classification problem:

Given: prompt, chosen_response, rejected_response
Goal: Make model prefer chosen over rejected

Loss = -log(sigmoid(β * (log_prob(chosen) - log_prob(rejected))))

The model learns to increase probability of good responses while decreasing probability of bad ones.

Mathematical Insight

# DPO loss pseudocode
def dpo_loss(policy_model, reference_model, chosen, rejected, beta):
    # Get log probabilities from current policy
    policy_chosen = log_prob(policy_model, chosen)
    policy_rejected = log_prob(policy_model, rejected)

    # Get log probabilities from reference (SFT) model
    ref_chosen = log_prob(reference_model, chosen)
    ref_rejected = log_prob(reference_model, rejected)

    # DPO loss
    chosen_reward = beta * (policy_chosen - ref_chosen)
    rejected_reward = beta * (policy_rejected - ref_rejected)

    loss = -log_sigmoid(chosen_reward - rejected_reward)
    return loss

Comparison Summary

Aspect RLHF DPO
Pipeline stages 4 2
Reward model Required Not needed
Training stability Low (PPO) High
Memory usage High Medium
Hyperparameters Many Few
Results Excellent Excellent

When to Use DPO

Use DPO when:

  • Improving response quality after SFT
  • Aligning model with specific preferences
  • Reducing harmful outputs
  • Improving instruction following
  • You want simpler training

Consider RLHF when:

  • You need online learning with live feedback
  • Research requiring reward model access
  • Complex multi-objective optimization

DPO in Practice

The typical workflow:

1. SFT Phase
   Train on instruction-response pairs
   Model learns basic task skills

2. DPO Phase
   Train on preference pairs (chosen/rejected)
   Model learns which responses are better

Key Parameters

Parameter Description Typical Value
β (beta) Regularization strength 0.1 - 0.5
learning_rate Training learning rate 5e-7 to 5e-6
epochs Training epochs 1-3

Key Insight: DPO achieves similar results to RLHF with fraction of the complexity. For most practitioners, DPO is the right choice for alignment in 2025.

Next, let's learn how to prepare preference data for DPO training. :::

Quiz

Module 5: Alignment with DPO

Take Quiz