LLM Evaluation Fundamentals
Evaluation Dataset Design
3 min read
Your evaluation is only as good as your test data. A well-designed evaluation dataset catches real issues before they reach production.
What Makes a Good Evaluation Dataset?
| Characteristic | Description | Why It Matters |
|---|---|---|
| Representative | Covers actual production use cases | Catches real-world failures |
| Diverse | Includes edge cases and variations | Reveals hidden weaknesses |
| Labeled | Has expected outputs when possible | Enables automated scoring |
| Versioned | Tracks changes over time | Allows regression testing |
| Balanced | Proportional to production distribution | Accurate performance estimates |
Dataset Structure
A typical evaluation dataset includes:
{
"id": "qa-001",
"input": "What is the refund policy?",
"expected_output": "You can request a refund within 30 days of purchase.",
"context": "Customer asking about returns",
"category": "policy_questions",
"difficulty": "easy",
"tags": ["refund", "policy", "customer_service"]
}
Building Your Dataset
Step 1: Collect Production Samples
Start with real user queries:
- Sample from production logs
- Include successful and failed interactions
- Capture the full distribution of use cases
Step 2: Define Categories
Organize by use case type:
├── Happy Path (60%)
│ ├── Common questions
│ ├── Standard requests
│ └── Typical workflows
├── Edge Cases (25%)
│ ├── Ambiguous queries
│ ├── Multi-part questions
│ └── Unusual formats
└── Adversarial (15%)
├── Prompt injection attempts
├── Out-of-scope requests
└── Malformed inputs
Step 3: Add Expected Outputs
For reference-based evaluation:
| Input Type | Expected Output Approach |
|---|---|
| Factual questions | Exact or paraphrased answers |
| Open-ended | Key points that must be covered |
| Classification | Correct label |
| Extraction | Required entities |
Step 4: Version and Maintain
Track dataset evolution:
- Tag versions (v1.0, v1.1, v2.0)
- Document changes between versions
- Keep historical versions for regression testing
- Review and update quarterly
Dataset Size Guidelines
| Stage | Minimum Size | Purpose |
|---|---|---|
| Development | 20-50 examples | Quick iteration |
| Validation | 100-200 examples | Reliable metrics |
| Production | 500+ examples | Comprehensive coverage |
Tip: Quality matters more than quantity. 100 well-curated examples beat 1000 noisy ones.
Common Pitfalls
- Overfitting to test data: Don't tune prompts on your eval set
- Stale datasets: Production shifts; datasets should too
- Missing edge cases: Easy cases don't reveal weaknesses
- Inconsistent labeling: Multiple annotators need guidelines
- Leaking training data: Evaluation data must be separate
Practical Example: Building a Support Bot Dataset
# evaluation_dataset.yaml
version: "1.2"
created: "2024-01-15"
categories:
- name: "billing_questions"
count: 45
examples:
- input: "How do I update my credit card?"
expected: "Go to Settings > Billing > Update Payment Method"
difficulty: easy
- input: "Why was I charged twice?"
expected: "Check for duplicate transactions in billing history"
difficulty: medium
- name: "edge_cases"
count: 20
examples:
- input: "refund plz"
expected: "Understand as refund request despite informal language"
difficulty: hard
- input: ""
expected: "Handle empty input gracefully"
difficulty: hard
Key Takeaways
- Start with production data: Real queries reveal real issues
- Cover the distribution: Match your evaluation to production traffic
- Include hard cases: Edge cases expose model weaknesses
- Version everything: Track changes for reproducibility
- Maintain regularly: Datasets need ongoing curation
With evaluation fundamentals covered, we'll now dive deep into LangSmith—a powerful platform for tracing and evaluating LLM applications. :::