Lesson 6 of 24

Dataset Preparation

Creating Training Data

3 min read

You have three main approaches to creating training data: manual curation, synthetic generation, and data augmentation. Let's explore each.

1. Manual Curation

The gold standard for quality. Best for specialized domains.

When to Use

  • Domain expertise is critical
  • High accuracy is required
  • You have subject matter experts available
  • Dataset size is manageable (<5,000 examples)

Best Practices

# Example: Creating a manual dataset for customer support
training_data = [
    {
        "instruction": "Customer asks about return policy",
        "input": "I bought this laptop 3 weeks ago and it's not working. Can I return it?",
        "output": "I'm sorry to hear about the issue with your laptop. Yes, you can return it within our 30-day return window. Since it's been 3 weeks, you're still eligible. Would you like me to initiate the return process for you?"
    },
    # Add more examples...
]

Quality Guidelines

Do Don't
Use real examples from your domain Copy-paste generic responses
Include edge cases Only include easy examples
Vary response length and style Use identical phrasing
Have experts review outputs Skip quality review

2. Synthetic Data Generation

Use LLMs to generate training data. Fast and scalable.

Basic Approach

from openai import OpenAI

client = OpenAI()

def generate_training_example(topic, style):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"""Generate a training example for fine-tuning.
Topic: {topic}
Style: {style}

Return JSON with 'instruction' and 'output' fields."""},
            {"role": "user", "content": "Generate one high-quality example."}
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

# Generate examples
example = generate_training_example(
    topic="Python debugging",
    style="patient teacher explaining to a beginner"
)

Seed-Based Generation

Start with a few manual examples, then expand:

seed_examples = [
    {"instruction": "Explain recursion", "output": "Recursion is when..."},
    {"instruction": "What is a hash table?", "output": "A hash table is..."}
]

def expand_dataset(seeds, num_new=100):
    prompt = f"""Here are example Q&A pairs:
{seeds}

Generate {num_new} NEW, diverse examples in the same style.
Cover different programming topics.
Return as JSON array."""

    # Call LLM to generate more
    # ...

Self-Instruct Method

The model generates its own training data:

def self_instruct(base_model, num_examples=1000):
    """
    1. Start with seed tasks
    2. Generate new instructions
    3. Generate responses
    4. Filter low-quality examples
    """
    tasks = load_seed_tasks()  # 175 seed tasks

    for _ in range(num_examples):
        # Sample existing tasks for context
        context_tasks = random.sample(tasks, 3)

        # Generate new instruction
        new_instruction = base_model.generate(
            f"Given these tasks: {context_tasks}\n"
            "Generate a new, different task:"
        )

        # Generate response
        response = base_model.generate(new_instruction)

        # Filter and add to dataset
        if passes_quality_check(new_instruction, response):
            tasks.append({"instruction": new_instruction, "output": response})

    return tasks

3. Data Augmentation

Expand existing datasets through transformation.

Techniques

# 1. Paraphrasing
original = "Explain how photosynthesis works"
augmented = [
    "How does photosynthesis work?",
    "Can you describe the process of photosynthesis?",
    "What happens during photosynthesis?"
]

# 2. Back-translation
# English → French → English (creates natural variations)

# 3. Complexity variation
simple = "What is AI?"
medium = "Explain artificial intelligence and its main branches"
complex = "Analyze the relationship between machine learning, deep learning, and artificial general intelligence"

Augmentation Pipeline

def augment_dataset(dataset, augmentation_factor=3):
    augmented = []

    for example in dataset:
        augmented.append(example)  # Keep original

        # Generate paraphrases
        paraphrases = generate_paraphrases(example["instruction"], n=augmentation_factor-1)

        for para in paraphrases:
            augmented.append({
                "instruction": para,
                "output": example["output"]  # Same output
            })

    return augmented

Combining Approaches

The best datasets often combine all three methods:

Manual Examples (100-500)
    ↓ Seed synthetic generation
Synthetic Examples (1,000-5,000)
    ↓ Augment for variety
Augmented Dataset (5,000-20,000)
    ↓ Human review & filtering
Final Training Set

Quality vs Quantity

Dataset Size Quality Needed Typical Approach
<500 Very High Manual only
500-2,000 High Manual + careful synthetic
2,000-10,000 Medium-High Synthetic + augmentation + filtering
10,000+ Medium Large-scale synthetic + sampling

Key Insight: A smaller, high-quality dataset almost always outperforms a larger, noisy one. Start small and expand only if needed.

Next, we'll learn how to clean and validate your training data. :::

Quiz

Module 2: Dataset Preparation

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.