Alignment with DPO
DPO vs RLHF
Direct Preference Optimization (DPO) has revolutionized how we align language models. Let's understand why it's preferred over traditional RLHF.
What is Alignment?
Alignment ensures a model's outputs match human preferences:
- Helpful and accurate responses
- Safe and harmless behavior
- Following instructions properly
- Appropriate tone and style
Traditional RLHF Pipeline
Reinforcement Learning from Human Feedback (RLHF) was the original approach:
Step 1: Train SFT model
↓
Step 2: Collect human preference data
↓
Step 3: Train reward model on preferences
↓
Step 4: Fine-tune with PPO using reward model
↓
Result: Aligned model
RLHF Challenges
| Challenge | Description |
|---|---|
| Complexity | 4-step pipeline, multiple models |
| Instability | PPO training is notoriously unstable |
| Resources | Need separate reward model |
| Hyperparameters | Many parameters to tune |
DPO: A Simpler Approach
DPO skips the reward model entirely:
Step 1: Train SFT model
↓
Step 2: Collect preference data (chosen/rejected pairs)
↓
Step 3: Train directly with DPO loss
↓
Result: Aligned model
DPO Advantages
| Advantage | Explanation |
|---|---|
| Simpler | No reward model needed |
| Stable | Standard supervised training |
| Efficient | Single training stage |
| Effective | Matches or beats RLHF results |
How DPO Works
DPO treats alignment as a classification problem:
Given: prompt, chosen_response, rejected_response
Goal: Make model prefer chosen over rejected
Loss = -log(sigmoid(β * (log_prob(chosen) - log_prob(rejected))))
The model learns to increase probability of good responses while decreasing probability of bad ones.
Mathematical Insight
# DPO loss pseudocode
def dpo_loss(policy_model, reference_model, chosen, rejected, beta):
# Get log probabilities from current policy
policy_chosen = log_prob(policy_model, chosen)
policy_rejected = log_prob(policy_model, rejected)
# Get log probabilities from reference (SFT) model
ref_chosen = log_prob(reference_model, chosen)
ref_rejected = log_prob(reference_model, rejected)
# DPO loss
chosen_reward = beta * (policy_chosen - ref_chosen)
rejected_reward = beta * (policy_rejected - ref_rejected)
loss = -log_sigmoid(chosen_reward - rejected_reward)
return loss
Comparison Summary
| Aspect | RLHF | DPO |
|---|---|---|
| Pipeline stages | 4 | 2 |
| Reward model | Required | Not needed |
| Training stability | Low (PPO) | High |
| Memory usage | High | Medium |
| Hyperparameters | Many | Few |
| Results | Excellent | Excellent |
When to Use DPO
Use DPO when:
- Improving response quality after SFT
- Aligning model with specific preferences
- Reducing harmful outputs
- Improving instruction following
- You want simpler training
Consider RLHF when:
- You need online learning with live feedback
- Research requiring reward model access
- Complex multi-objective optimization
DPO in Practice
The typical workflow:
1. SFT Phase
Train on instruction-response pairs
Model learns basic task skills
2. DPO Phase
Train on preference pairs (chosen/rejected)
Model learns which responses are better
Key Parameters
| Parameter | Description | Typical Value |
|---|---|---|
| β (beta) | Regularization strength | 0.1 - 0.5 |
| learning_rate | Training learning rate | 5e-7 to 5e-6 |
| epochs | Training epochs | 1-3 |
Key Insight: DPO achieves similar results to RLHF with fraction of the complexity. For most practitioners, DPO is the right choice for alignment in 2025.
Next, let's learn how to prepare preference data for DPO training. :::