Lesson 5 of 24

Dataset Preparation

Instruction Tuning Datasets

3 min read

The quality of your training data is the single most important factor in fine-tuning success. Let's understand the formats and structures used in instruction tuning.

Dataset Formats

Alpaca Format

The most common format for instruction tuning:

{
  "instruction": "Write a haiku about programming",
  "input": "",
  "output": "Code flows like water\nBugs hide in shadowed corners\nDebug, repeat, win"
}

With optional input:

{
  "instruction": "Translate the following to French",
  "input": "Hello, how are you?",
  "output": "Bonjour, comment allez-vous?"
}

ShareGPT/Conversation Format

For multi-turn conversations:

{
  "conversations": [
    {"from": "human", "value": "What is machine learning?"},
    {"from": "gpt", "value": "Machine learning is a subset of AI..."},
    {"from": "human", "value": "Can you give an example?"},
    {"from": "gpt", "value": "Sure! A common example is email spam filtering..."}
  ]
}

ChatML Format

Used by many modern models:

{
  "messages": [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to reverse a string"},
    {"role": "assistant", "content": "def reverse_string(s):\n    return s[::-1]"}
  ]
}

Chat Templates

Modern models use chat templates to structure conversations. The tokenizer handles this automatically:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there! How can I help you today?"}
]

# Apply chat template
formatted = tokenizer.apply_chat_template(messages, tokenize=False)
print(formatted)

Output for Llama 3.2:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there! How can I help you today?<|eot_id|>
Dataset Size Use Case
Alpaca 52K General instruction following
ShareGPT 90K Conversational AI
OpenAssistant 160K Helpful assistant behavior
Dolly 15K Open-source, commercially usable
FLAN 1M+ Multi-task instruction tuning

Loading Datasets

from datasets import load_dataset

# Load from Hugging Face Hub
dataset = load_dataset("tatsu-lab/alpaca")

# Load from local JSON file
dataset = load_dataset("json", data_files="my_data.json")

# Load from CSV
dataset = load_dataset("csv", data_files="my_data.csv")

# Split into train/validation
dataset = dataset["train"].train_test_split(test_size=0.1)

Dataset Structure Requirements

For SFTTrainer, your dataset needs:

  1. Text column (if using simple format):
dataset = dataset.map(lambda x: {
    "text": f"### Instruction:\n{x['instruction']}\n\n### Response:\n{x['output']}"
})
  1. Messages column (if using chat format):
dataset = dataset.map(lambda x: {
    "messages": [
        {"role": "user", "content": x["instruction"]},
        {"role": "assistant", "content": x["output"]}
    ]
})

Best Practices

  1. Consistency: Use the same format throughout your dataset
  2. Completeness: Include system prompts if your use case needs them
  3. Variety: Mix different types of instructions
  4. Quality over quantity: 1,000 excellent examples > 10,000 mediocre ones

Next, we'll learn how to create your own high-quality training data. :::

Quiz

Module 2: Dataset Preparation

Take Quiz