Dataset Preparation
Creating Training Data
You have three main approaches to creating training data: manual curation, synthetic generation, and data augmentation. Let's explore each.
1. Manual Curation
The gold standard for quality. Best for specialized domains.
When to Use
- Domain expertise is critical
- High accuracy is required
- You have subject matter experts available
- Dataset size is manageable (<5,000 examples)
Best Practices
# Example: Creating a manual dataset for customer support
training_data = [
{
"instruction": "Customer asks about return policy",
"input": "I bought this laptop 3 weeks ago and it's not working. Can I return it?",
"output": "I'm sorry to hear about the issue with your laptop. Yes, you can return it within our 30-day return window. Since it's been 3 weeks, you're still eligible. Would you like me to initiate the return process for you?"
},
# Add more examples...
]
Quality Guidelines
| Do | Don't |
|---|---|
| Use real examples from your domain | Copy-paste generic responses |
| Include edge cases | Only include easy examples |
| Vary response length and style | Use identical phrasing |
| Have experts review outputs | Skip quality review |
2. Synthetic Data Generation
Use LLMs to generate training data. Fast and scalable.
Basic Approach
from openai import OpenAI
client = OpenAI()
def generate_training_example(topic, style):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"""Generate a training example for fine-tuning.
Topic: {topic}
Style: {style}
Return JSON with 'instruction' and 'output' fields."""},
{"role": "user", "content": "Generate one high-quality example."}
],
response_format={"type": "json_object"}
)
return response.choices[0].message.content
# Generate examples
example = generate_training_example(
topic="Python debugging",
style="patient teacher explaining to a beginner"
)
Seed-Based Generation
Start with a few manual examples, then expand:
seed_examples = [
{"instruction": "Explain recursion", "output": "Recursion is when..."},
{"instruction": "What is a hash table?", "output": "A hash table is..."}
]
def expand_dataset(seeds, num_new=100):
prompt = f"""Here are example Q&A pairs:
{seeds}
Generate {num_new} NEW, diverse examples in the same style.
Cover different programming topics.
Return as JSON array."""
# Call LLM to generate more
# ...
Self-Instruct Method
The model generates its own training data:
def self_instruct(base_model, num_examples=1000):
"""
1. Start with seed tasks
2. Generate new instructions
3. Generate responses
4. Filter low-quality examples
"""
tasks = load_seed_tasks() # 175 seed tasks
for _ in range(num_examples):
# Sample existing tasks for context
context_tasks = random.sample(tasks, 3)
# Generate new instruction
new_instruction = base_model.generate(
f"Given these tasks: {context_tasks}\n"
"Generate a new, different task:"
)
# Generate response
response = base_model.generate(new_instruction)
# Filter and add to dataset
if passes_quality_check(new_instruction, response):
tasks.append({"instruction": new_instruction, "output": response})
return tasks
3. Data Augmentation
Expand existing datasets through transformation.
Techniques
# 1. Paraphrasing
original = "Explain how photosynthesis works"
augmented = [
"How does photosynthesis work?",
"Can you describe the process of photosynthesis?",
"What happens during photosynthesis?"
]
# 2. Back-translation
# English → French → English (creates natural variations)
# 3. Complexity variation
simple = "What is AI?"
medium = "Explain artificial intelligence and its main branches"
complex = "Analyze the relationship between machine learning, deep learning, and artificial general intelligence"
Augmentation Pipeline
def augment_dataset(dataset, augmentation_factor=3):
augmented = []
for example in dataset:
augmented.append(example) # Keep original
# Generate paraphrases
paraphrases = generate_paraphrases(example["instruction"], n=augmentation_factor-1)
for para in paraphrases:
augmented.append({
"instruction": para,
"output": example["output"] # Same output
})
return augmented
Combining Approaches
The best datasets often combine all three methods:
Manual Examples (100-500)
↓ Seed synthetic generation
Synthetic Examples (1,000-5,000)
↓ Augment for variety
Augmented Dataset (5,000-20,000)
↓ Human review & filtering
Final Training Set
Quality vs Quantity
| Dataset Size | Quality Needed | Typical Approach |
|---|---|---|
| <500 | Very High | Manual only |
| 500-2,000 | High | Manual + careful synthetic |
| 2,000-10,000 | Medium-High | Synthetic + augmentation + filtering |
| 10,000+ | Medium | Large-scale synthetic + sampling |
Key Insight: A smaller, high-quality dataset almost always outperforms a larger, noisy one. Start small and expand only if needed.
Next, we'll learn how to clean and validate your training data. :::