Dataset Preparation
Instruction Tuning Datasets
3 min read
The quality of your training data is the single most important factor in fine-tuning success. Let's understand the formats and structures used in instruction tuning.
Dataset Formats
Alpaca Format
The most common format for instruction tuning:
{
"instruction": "Write a haiku about programming",
"input": "",
"output": "Code flows like water\nBugs hide in shadowed corners\nDebug, repeat, win"
}
With optional input:
{
"instruction": "Translate the following to French",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous?"
}
ShareGPT/Conversation Format
For multi-turn conversations:
{
"conversations": [
{"from": "human", "value": "What is machine learning?"},
{"from": "gpt", "value": "Machine learning is a subset of AI..."},
{"from": "human", "value": "Can you give an example?"},
{"from": "gpt", "value": "Sure! A common example is email spam filtering..."}
]
}
ChatML Format
Used by many modern models:
{
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to reverse a string"},
{"role": "assistant", "content": "def reverse_string(s):\n return s[::-1]"}
]
}
Chat Templates
Modern models use chat templates to structure conversations. The tokenizer handles this automatically:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there! How can I help you today?"}
]
# Apply chat template
formatted = tokenizer.apply_chat_template(messages, tokenize=False)
print(formatted)
Output for Llama 3.2:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there! How can I help you today?<|eot_id|>
Popular Training Datasets
| Dataset | Size | Use Case |
|---|---|---|
| Alpaca | 52K | General instruction following |
| ShareGPT | 90K | Conversational AI |
| OpenAssistant | 160K | Helpful assistant behavior |
| Dolly | 15K | Open-source, commercially usable |
| FLAN | 1M+ | Multi-task instruction tuning |
Loading Datasets
from datasets import load_dataset
# Load from Hugging Face Hub
dataset = load_dataset("tatsu-lab/alpaca")
# Load from local JSON file
dataset = load_dataset("json", data_files="my_data.json")
# Load from CSV
dataset = load_dataset("csv", data_files="my_data.csv")
# Split into train/validation
dataset = dataset["train"].train_test_split(test_size=0.1)
Dataset Structure Requirements
For SFTTrainer, your dataset needs:
- Text column (if using simple format):
dataset = dataset.map(lambda x: {
"text": f"### Instruction:\n{x['instruction']}\n\n### Response:\n{x['output']}"
})
- Messages column (if using chat format):
dataset = dataset.map(lambda x: {
"messages": [
{"role": "user", "content": x["instruction"]},
{"role": "assistant", "content": x["output"]}
]
})
Best Practices
- Consistency: Use the same format throughout your dataset
- Completeness: Include system prompts if your use case needs them
- Variety: Mix different types of instructions
- Quality over quantity: 1,000 excellent examples > 10,000 mediocre ones
Next, we'll learn how to create your own high-quality training data. :::