Dataset Preparation
Preference Data for DPO
3 min read
Direct Preference Optimization (DPO) requires a special dataset format: pairs of responses where one is preferred over the other. Let's learn how to create this data.
DPO Dataset Format
Standard Format
{
"prompt": "Explain quantum computing in simple terms",
"chosen": "Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously. Unlike regular bits that are either 0 or 1, qubits can be both at once through 'superposition'. This allows quantum computers to process many possibilities at the same time, making them incredibly powerful for specific problems like cryptography and drug discovery.",
"rejected": "Quantum computing is very complex and hard to explain. It uses physics stuff. You probably won't understand it without a PhD."
}
With System Prompt
{
"system": "You are a helpful, patient teacher who explains complex topics clearly.",
"prompt": "What is machine learning?",
"chosen": "Machine learning is a type of AI where computers learn patterns from data instead of being explicitly programmed...",
"rejected": "ML is when computers learn stuff. Google it for more info."
}
Creating Preference Pairs
Method 1: Model-Generated + Human Ranking
Generate multiple responses and have humans rank them:
from openai import OpenAI
client = OpenAI()
def generate_candidates(prompt, n=4):
"""Generate multiple candidate responses."""
responses = []
for _ in range(n):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.9 # High temperature for variety
)
responses.append(response.choices[0].message.content)
return responses
# Generate candidates
prompt = "Explain recursion"
candidates = generate_candidates(prompt)
# Human ranks these as: [2, 0, 3, 1] (best to worst index)
# Create pairs from rankings
Method 2: Model Comparison
Use a stronger model to compare responses:
def compare_responses(prompt, response_a, response_b):
"""Use an LLM to judge which response is better."""
judge_prompt = f"""Compare these two responses to the prompt: "{prompt}"
Response A:
{response_a}
Response B:
{response_b}
Which response is better and why? Answer with just "A" or "B"."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": judge_prompt}]
)
winner = response.choices[0].message.content.strip()
return winner
def create_preference_pair(prompt, response_a, response_b):
winner = compare_responses(prompt, response_a, response_b)
if winner == "A":
return {"prompt": prompt, "chosen": response_a, "rejected": response_b}
else:
return {"prompt": prompt, "chosen": response_b, "rejected": response_a}
Method 3: Rule-Based Selection
Create pairs based on specific criteria:
def create_quality_pairs(prompt, responses):
"""Create pairs based on measurable quality criteria."""
scored_responses = []
for response in responses:
score = 0
# Length score (prefer detailed responses)
if 100 < len(response) < 1000:
score += 2
# Structure score (prefer organized responses)
if any(marker in response for marker in ["1.", "First,", "•"]):
score += 1
# Completeness score
if response.endswith((".", "!", "?", "\"")):
score += 1
# No hedging
if not response.startswith(("I think", "Maybe", "I'm not sure")):
score += 1
scored_responses.append((response, score))
# Sort by score
scored_responses.sort(key=lambda x: x[1], reverse=True)
# Create pairs from best vs worst
pairs = []
for i in range(len(scored_responses) // 2):
pairs.append({
"prompt": prompt,
"chosen": scored_responses[i][0],
"rejected": scored_responses[-(i+1)][0]
})
return pairs
Preference Criteria
What makes a response "better"? Define clear criteria:
| Criterion | Chosen Example | Rejected Example |
|---|---|---|
| Helpfulness | Provides complete answer | Says "Google it" |
| Safety | Refuses harmful requests appropriately | Complies with harmful requests |
| Accuracy | Factually correct | Contains errors |
| Clarity | Well-organized, easy to follow | Confusing, rambling |
| Tone | Appropriate for context | Rude or dismissive |
Scaling Preference Data
Bootstrapping Pipeline
def bootstrap_preference_dataset(seed_prompts, target_size=1000):
"""Scale up preference data from seed prompts."""
dataset = []
for prompt in seed_prompts:
# Generate multiple responses
responses = generate_candidates(prompt, n=4)
# Create all possible pairs
for i in range(len(responses)):
for j in range(i+1, len(responses)):
pair = create_preference_pair(prompt, responses[i], responses[j])
dataset.append(pair)
if len(dataset) >= target_size:
break
return dataset
Using Existing Datasets
Many preference datasets are available on Hugging Face:
from datasets import load_dataset
# Load popular preference datasets
datasets = {
"ultrafeedback": load_dataset("HuggingFaceH4/ultrafeedback_binarized"),
"orca_dpo": load_dataset("Intel/orca_dpo_pairs"),
"helpful_base": load_dataset("Anthropic/hh-rlhf", data_dir="helpful-base")
}
# Combine and sample
combined = concatenate_datasets([d["train"] for d in datasets.values()])
sampled = combined.shuffle().select(range(10000))
Quality Assurance
Validate Preference Pairs
def validate_preference_pair(example):
"""Ensure preference pair is valid."""
checks = [
len(example["prompt"]) > 10,
len(example["chosen"]) > 20,
len(example["rejected"]) > 20,
example["chosen"] != example["rejected"], # Not identical
len(example["chosen"]) != len(example["rejected"]), # Different lengths
]
return all(checks)
# Filter invalid pairs
dataset = dataset.filter(validate_preference_pair)
Check for Label Noise
Randomly sample and manually verify:
import random
def audit_preferences(dataset, sample_size=50):
"""Manually review sample of preference pairs."""
sample = random.sample(list(dataset), sample_size)
correct = 0
for example in sample:
print(f"Prompt: {example['prompt'][:100]}...")
print(f"Chosen: {example['chosen'][:200]}...")
print(f"Rejected: {example['rejected'][:200]}...")
# Human verification
verdict = input("Correct preference? (y/n): ")
if verdict.lower() == 'y':
correct += 1
accuracy = correct / sample_size
print(f"Label accuracy: {accuracy:.1%}")
if accuracy < 0.8:
print("Warning: High label noise detected!")
Final Dataset Structure
# Save in standard format
preference_dataset = {
"prompt": [...],
"chosen": [...],
"rejected": [...]
}
# Convert to Hugging Face Dataset
from datasets import Dataset
dataset = Dataset.from_dict(preference_dataset)
# Split for training
dataset = dataset.train_test_split(test_size=0.1)
# Save to disk
dataset.save_to_disk("./preference_data")
Tip: Aim for at least 1,000 high-quality preference pairs for DPO training. Quality matters more than quantity - noisy labels will hurt alignment.
In the next module, we'll put this all together and start fine-tuning with LoRA and QLoRA. :::