Alignment with DPO
DPOTrainer in Practice
3 min read
Let's implement DPO training using TRL's DPOTrainer. We'll walk through a complete training script.
Basic DPO Training
from unsloth import FastLanguageModel
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
# ============================================
# 1. Load Model
# ============================================
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-3B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
# Add LoRA
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
bias="none",
use_gradient_checkpointing="unsloth",
)
# ============================================
# 2. Load Dataset
# ============================================
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
dataset = dataset.select(range(1000)) # Subset for testing
# ============================================
# 3. DPO Configuration
# ============================================
dpo_config = DPOConfig(
output_dir="./outputs/dpo-model",
beta=0.1, # KL penalty coefficient
learning_rate=5e-6, # Lower than SFT
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=1,
warmup_ratio=0.1,
logging_steps=10,
save_steps=500,
bf16=True,
optim="adamw_8bit",
max_length=1024,
max_prompt_length=512,
)
# ============================================
# 4. Initialize Trainer
# ============================================
trainer = DPOTrainer(
model=model,
args=dpo_config,
train_dataset=dataset,
processing_class=tokenizer,
)
# ============================================
# 5. Train
# ============================================
trainer.train()
# ============================================
# 6. Save
# ============================================
trainer.save_model("./outputs/dpo-model/final")
Key DPO Parameters
Beta (β)
Controls how much the model can deviate from the reference:
# Conservative (stay close to reference)
beta = 0.05
# Balanced (default)
beta = 0.1
# Aggressive (more deviation allowed)
beta = 0.5
Learning Rate
DPO typically uses lower learning rates than SFT:
# SFT learning rate
sft_lr = 2e-4
# DPO learning rate (10-100x lower)
dpo_lr = 5e-6 # to 5e-7
Max Lengths
# Total max length for chosen/rejected
max_length = 1024
# Max length for the prompt portion
max_prompt_length = 512
Advanced Configuration
dpo_config = DPOConfig(
output_dir="./outputs/dpo-advanced",
# DPO specific
beta=0.1,
loss_type="sigmoid", # or "hinge", "ipo"
# Training
learning_rate=5e-6,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=1,
# Lengths
max_length=1024,
max_prompt_length=512,
truncation_mode="keep_start",
# Optimization
warmup_ratio=0.1,
lr_scheduler_type="cosine",
weight_decay=0.01,
optim="adamw_8bit",
bf16=True,
# Logging
logging_steps=10,
eval_strategy="steps",
eval_steps=100,
save_steps=500,
# Regularization
max_grad_norm=1.0,
)
Loss Types
DPO supports different loss functions:
# Standard sigmoid loss (default)
loss_type = "sigmoid"
# Hinge loss
loss_type = "hinge"
# IPO (Identity Preference Optimization)
loss_type = "ipo"
With Reference Model
For proper DPO, you need a reference model (frozen copy of the SFT model):
from trl import DPOTrainer, DPOConfig
from peft import PeftModel
# Load SFT model as reference (frozen)
ref_model = AutoModelForCausalLM.from_pretrained(
"path/to/sft-model",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load model to train
model = AutoModelForCausalLM.from_pretrained(
"path/to/sft-model",
torch_dtype=torch.bfloat16,
device_map="auto"
)
model = get_peft_model(model, lora_config)
trainer = DPOTrainer(
model=model,
ref_model=ref_model, # Frozen reference
args=dpo_config,
train_dataset=dataset,
processing_class=tokenizer,
)
Monitoring Training
Key Metrics
# Training logs will show:
# - loss: DPO loss (should decrease)
# - rewards/chosen: Reward for chosen responses (should increase)
# - rewards/rejected: Reward for rejected responses (should decrease)
# - rewards/margins: chosen - rejected (should increase)
Good Training Signs
| Metric | Good Trend |
|---|---|
| loss | Decreasing |
| rewards/margins | Increasing |
| rewards/chosen | Stable or increasing |
| rewards/rejected | Decreasing |
Bad Training Signs
- Loss explodes → Lower learning rate
- Margins go negative → Check data quality
- No improvement → Increase beta or epochs
Complete Example with Evaluation
from unsloth import FastLanguageModel
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Llama-3.2-3B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
target_modules="all-linear",
use_gradient_checkpointing="unsloth",
)
# Load and split dataset
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized")
train_dataset = dataset["train_prefs"].select(range(5000))
eval_dataset = dataset["test_prefs"].select(range(500))
# Configure
dpo_config = DPOConfig(
output_dir="./outputs/dpo-eval",
beta=0.1,
learning_rate=5e-6,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=1,
eval_strategy="steps",
eval_steps=200,
logging_steps=20,
bf16=True,
optim="adamw_8bit",
)
# Train with evaluation
trainer = DPOTrainer(
model=model,
args=dpo_config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
processing_class=tokenizer,
)
trainer.train()
trainer.save_model("./outputs/dpo-eval/final")
Tip: Start with a small beta (0.1) and increase if the model doesn't learn preferences well. Monitor the reward margins to ensure the model is actually preferring chosen over rejected responses.
Next, let's learn how to combine SFT and DPO for a complete training pipeline. :::