LLM Evaluation Fundamentals

Evaluation Dataset Design

3 min read

Your evaluation is only as good as your test data. A well-designed evaluation dataset catches real issues before they reach production.

What Makes a Good Evaluation Dataset?

Characteristic Description Why It Matters
Representative Covers actual production use cases Catches real-world failures
Diverse Includes edge cases and variations Reveals hidden weaknesses
Labeled Has expected outputs when possible Enables automated scoring
Versioned Tracks changes over time Allows regression testing
Balanced Proportional to production distribution Accurate performance estimates

Dataset Structure

A typical evaluation dataset includes:

{
  "id": "qa-001",
  "input": "What is the refund policy?",
  "expected_output": "You can request a refund within 30 days of purchase.",
  "context": "Customer asking about returns",
  "category": "policy_questions",
  "difficulty": "easy",
  "tags": ["refund", "policy", "customer_service"]
}

Building Your Dataset

Step 1: Collect Production Samples

Start with real user queries:

  • Sample from production logs
  • Include successful and failed interactions
  • Capture the full distribution of use cases

Step 2: Define Categories

Organize by use case type:

├── Happy Path (60%)
│   ├── Common questions
│   ├── Standard requests
│   └── Typical workflows
├── Edge Cases (25%)
│   ├── Ambiguous queries
│   ├── Multi-part questions
│   └── Unusual formats
└── Adversarial (15%)
    ├── Prompt injection attempts
    ├── Out-of-scope requests
    └── Malformed inputs

Step 3: Add Expected Outputs

For reference-based evaluation:

Input Type Expected Output Approach
Factual questions Exact or paraphrased answers
Open-ended Key points that must be covered
Classification Correct label
Extraction Required entities

Step 4: Version and Maintain

Track dataset evolution:

  • Tag versions (v1.0, v1.1, v2.0)
  • Document changes between versions
  • Keep historical versions for regression testing
  • Review and update quarterly

Dataset Size Guidelines

Stage Minimum Size Purpose
Development 20-50 examples Quick iteration
Validation 100-200 examples Reliable metrics
Production 500+ examples Comprehensive coverage

Tip: Quality matters more than quantity. 100 well-curated examples beat 1000 noisy ones.

Common Pitfalls

  1. Overfitting to test data: Don't tune prompts on your eval set
  2. Stale datasets: Production shifts; datasets should too
  3. Missing edge cases: Easy cases don't reveal weaknesses
  4. Inconsistent labeling: Multiple annotators need guidelines
  5. Leaking training data: Evaluation data must be separate

Practical Example: Building a Support Bot Dataset

# evaluation_dataset.yaml
version: "1.2"
created: "2024-01-15"
categories:
  - name: "billing_questions"
    count: 45
    examples:
      - input: "How do I update my credit card?"
        expected: "Go to Settings > Billing > Update Payment Method"
        difficulty: easy
      - input: "Why was I charged twice?"
        expected: "Check for duplicate transactions in billing history"
        difficulty: medium

  - name: "edge_cases"
    count: 20
    examples:
      - input: "refund plz"
        expected: "Understand as refund request despite informal language"
        difficulty: hard
      - input: ""
        expected: "Handle empty input gracefully"
        difficulty: hard

Key Takeaways

  1. Start with production data: Real queries reveal real issues
  2. Cover the distribution: Match your evaluation to production traffic
  3. Include hard cases: Edge cases expose model weaknesses
  4. Version everything: Track changes for reproducibility
  5. Maintain regularly: Datasets need ongoing curation

With evaluation fundamentals covered, we'll now dive deep into LangSmith—a powerful platform for tracing and evaluating LLM applications. :::

Quiz

Module 2: LLM Evaluation Fundamentals

Take Quiz