Evaluation Dataset Design

Your evaluation is only as good as your test data. A well-designed evaluation dataset catches real issues before they reach production.

What Makes a Good Evaluation Dataset?

Characteristic	Description	Why It Matters
Representative	Covers actual production use cases	Catches real-world failures
Diverse	Includes edge cases and variations	Reveals hidden weaknesses
Labeled	Has expected outputs when possible	Enables automated scoring
Versioned	Tracks changes over time	Allows regression testing
Balanced	Proportional to production distribution	Accurate performance estimates

Dataset Structure

A typical evaluation dataset includes:

{
  "id": "qa-001",
  "input": "What is the refund policy?",
  "expected_output": "You can request a refund within 30 days of purchase.",
  "context": "Customer asking about returns",
  "category": "policy_questions",
  "difficulty": "easy",
  "tags": ["refund", "policy", "customer_service"]
}

Building Your Dataset

Step 1: Collect Production Samples

Start with real user queries:

Sample from production logs
Include successful and failed interactions
Capture the full distribution of use cases

Step 2: Define Categories

Organize by use case type:

├── Happy Path (60%)
│   ├── Common questions
│   ├── Standard requests
│   └── Typical workflows
├── Edge Cases (25%)
│   ├── Ambiguous queries
│   ├── Multi-part questions
│   └── Unusual formats
└── Adversarial (15%)
    ├── Prompt injection attempts
    ├── Out-of-scope requests
    └── Malformed inputs

Step 3: Add Expected Outputs

For reference-based evaluation:

Input Type	Expected Output Approach
Factual questions	Exact or paraphrased answers
Open-ended	Key points that must be covered
Classification	Correct label
Extraction	Required entities

Step 4: Version and Maintain

Track dataset evolution:

Tag versions (v1.0, v1.1, v2.0)
Document changes between versions
Keep historical versions for regression testing
Review and update quarterly

Dataset Size Guidelines

Stage	Minimum Size	Purpose
Development	20-50 examples	Quick iteration
Validation	100-200 examples	Reliable metrics
Production	500+ examples	Comprehensive coverage

Tip: Quality matters more than quantity. 100 well-curated examples beat 1000 noisy ones.

Common Pitfalls

Overfitting to test data: Don't tune prompts on your eval set
Stale datasets: Production shifts; datasets should too
Missing edge cases: Easy cases don't reveal weaknesses
Inconsistent labeling: Multiple annotators need guidelines
Leaking training data: Evaluation data must be separate

Practical Example: Building a Support Bot Dataset

# evaluation_dataset.yaml
version: "1.2"
created: "2024-01-15"
categories:
  - name: "billing_questions"
    count: 45
    examples:
      - input: "How do I update my credit card?"
        expected: "Go to Settings > Billing > Update Payment Method"
        difficulty: easy
      - input: "Why was I charged twice?"
        expected: "Check for duplicate transactions in billing history"
        difficulty: medium

  - name: "edge_cases"
    count: 20
    examples:
      - input: "refund plz"
        expected: "Understand as refund request despite informal language"
        difficulty: hard
      - input: ""
        expected: "Handle empty input gracefully"
        difficulty: hard

Key Takeaways

Start with production data: Real queries reveal real issues
Cover the distribution: Match your evaluation to production traffic
Include hard cases: Edge cases expose model weaknesses
Version everything: Track changes for reproducibility
Maintain regularly: Datasets need ongoing curation

With evaluation fundamentals covered, we'll now dive deep into LangSmith—a powerful platform for tracing and evaluating LLM applications. :::