Creating Training Data

You have three main approaches to creating training data: manual curation, synthetic generation, and data augmentation. Let's explore each.

1. Manual Curation

The gold standard for quality. Best for specialized domains.

When to Use

Domain expertise is critical
High accuracy is required
You have subject matter experts available
Dataset size is manageable (<5,000 examples)

Best Practices

# Example: Creating a manual dataset for customer support
training_data = [
    {
        "instruction": "Customer asks about return policy",
        "input": "I bought this laptop 3 weeks ago and it's not working. Can I return it?",
        "output": "I'm sorry to hear about the issue with your laptop. Yes, you can return it within our 30-day return window. Since it's been 3 weeks, you're still eligible. Would you like me to initiate the return process for you?"
    },
    # Add more examples...
]

Quality Guidelines

Do	Don't
Use real examples from your domain	Copy-paste generic responses
Include edge cases	Only include easy examples
Vary response length and style	Use identical phrasing
Have experts review outputs	Skip quality review

2. Synthetic Data Generation

Use LLMs to generate training data. Fast and scalable.

Basic Approach

from openai import OpenAI

client = OpenAI()

def generate_training_example(topic, style):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"""Generate a training example for fine-tuning.
Topic: {topic}
Style: {style}

Return JSON with 'instruction' and 'output' fields."""},
            {"role": "user", "content": "Generate one high-quality example."}
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

# Generate examples
example = generate_training_example(
    topic="Python debugging",
    style="patient teacher explaining to a beginner"
)

Seed-Based Generation

Start with a few manual examples, then expand:

seed_examples = [
    {"instruction": "Explain recursion", "output": "Recursion is when..."},
    {"instruction": "What is a hash table?", "output": "A hash table is..."}
]

def expand_dataset(seeds, num_new=100):
    prompt = f"""Here are example Q&A pairs:
{seeds}

Generate {num_new} NEW, diverse examples in the same style.
Cover different programming topics.
Return as JSON array."""

    # Call LLM to generate more
    # ...

Self-Instruct Method

The model generates its own training data:

def self_instruct(base_model, num_examples=1000):
    """
    1. Start with seed tasks
    2. Generate new instructions
    3. Generate responses
    4. Filter low-quality examples
    """
    tasks = load_seed_tasks()  # 175 seed tasks

    for _ in range(num_examples):
        # Sample existing tasks for context
        context_tasks = random.sample(tasks, 3)

        # Generate new instruction
        new_instruction = base_model.generate(
            f"Given these tasks: {context_tasks}\n"
            "Generate a new, different task:"
        )

        # Generate response
        response = base_model.generate(new_instruction)

        # Filter and add to dataset
        if passes_quality_check(new_instruction, response):
            tasks.append({"instruction": new_instruction, "output": response})

    return tasks

3. Data Augmentation

Expand existing datasets through transformation.

Techniques

# 1. Paraphrasing
original = "Explain how photosynthesis works"
augmented = [
    "How does photosynthesis work?",
    "Can you describe the process of photosynthesis?",
    "What happens during photosynthesis?"
]

# 2. Back-translation
# English → French → English (creates natural variations)

# 3. Complexity variation
simple = "What is AI?"
medium = "Explain artificial intelligence and its main branches"
complex = "Analyze the relationship between machine learning, deep learning, and artificial general intelligence"

Augmentation Pipeline

def augment_dataset(dataset, augmentation_factor=3):
    augmented = []

    for example in dataset:
        augmented.append(example)  # Keep original

        # Generate paraphrases
        paraphrases = generate_paraphrases(example["instruction"], n=augmentation_factor-1)

        for para in paraphrases:
            augmented.append({
                "instruction": para,
                "output": example["output"]  # Same output
            })

    return augmented

Combining Approaches

The best datasets often combine all three methods:

Manual Examples (100-500)
    ↓ Seed synthetic generation
Synthetic Examples (1,000-5,000)
    ↓ Augment for variety
Augmented Dataset (5,000-20,000)
    ↓ Human review & filtering
Final Training Set

Quality vs Quantity

Dataset Size	Quality Needed	Typical Approach
<500	Very High	Manual only
500-2,000	High	Manual + careful synthetic
2,000-10,000	Medium-High	Synthetic + augmentation + filtering
10,000+	Medium	Large-scale synthetic + sampling

Key Insight: A smaller, high-quality dataset almost always outperforms a larger, noisy one. Start small and expand only if needed.

Next, we'll learn how to clean and validate your training data. :::