LLM Evaluation Fundamentals

Evaluation Dataset Design

3 min read

Your evaluation is only as good as your test data. A well-designed evaluation dataset catches real issues before they reach production.

What Makes a Good Evaluation Dataset?

CharacteristicDescriptionWhy It Matters
RepresentativeCovers actual production use casesCatches real-world failures
DiverseIncludes edge cases and variationsReveals hidden weaknesses
LabeledHas expected outputs when possibleEnables automated scoring
VersionedTracks changes over timeAllows regression testing
BalancedProportional to production distributionAccurate performance estimates

Dataset Structure

A typical evaluation dataset includes:

{
  "id": "qa-001",
  "input": "What is the refund policy?",
  "expected_output": "You can request a refund within 30 days of purchase.",
  "context": "Customer asking about returns",
  "category": "policy_questions",
  "difficulty": "easy",
  "tags": ["refund", "policy", "customer_service"]
}

Building Your Dataset

Step 1: Collect Production Samples

Start with real user queries:

  • Sample from production logs
  • Include successful and failed interactions
  • Capture the full distribution of use cases

Step 2: Define Categories

Organize by use case type:

├── Happy Path (60%)
│   ├── Common questions
│   ├── Standard requests
│   └── Typical workflows
├── Edge Cases (25%)
│   ├── Ambiguous queries
│   ├── Multi-part questions
│   └── Unusual formats
└── Adversarial (15%)
    ├── Prompt injection attempts
    ├── Out-of-scope requests
    └── Malformed inputs

Step 3: Add Expected Outputs

For reference-based evaluation:

Input TypeExpected Output Approach
Factual questionsExact or paraphrased answers
Open-endedKey points that must be covered
ClassificationCorrect label
ExtractionRequired entities

Step 4: Version and Maintain

Track dataset evolution:

  • Tag versions (v1.0, v1.1, v2.0)
  • Document changes between versions
  • Keep historical versions for regression testing
  • Review and update quarterly

Dataset Size Guidelines

StageMinimum SizePurpose
Development20-50 examplesQuick iteration
Validation100-200 examplesReliable metrics
Production500+ examplesComprehensive coverage

Tip: Quality matters more than quantity. 100 well-curated examples beat 1000 noisy ones.

Common Pitfalls

  1. Overfitting to test data: Don't tune prompts on your eval set
  2. Stale datasets: Production shifts; datasets should too
  3. Missing edge cases: Easy cases don't reveal weaknesses
  4. Inconsistent labeling: Multiple annotators need guidelines
  5. Leaking training data: Evaluation data must be separate

Practical Example: Building a Support Bot Dataset

# evaluation_dataset.yaml
version: "1.2"
created: "2024-01-15"
categories:
  - name: "billing_questions"
    count: 45
    examples:
      - input: "How do I update my credit card?"
        expected: "Go to Settings > Billing > Update Payment Method"
        difficulty: easy
      - input: "Why was I charged twice?"
        expected: "Check for duplicate transactions in billing history"
        difficulty: medium

  - name: "edge_cases"
    count: 20
    examples:
      - input: "refund plz"
        expected: "Understand as refund request despite informal language"
        difficulty: hard
      - input: ""
        expected: "Handle empty input gracefully"
        difficulty: hard

Key Takeaways

  1. Start with production data: Real queries reveal real issues
  2. Cover the distribution: Match your evaluation to production traffic
  3. Include hard cases: Edge cases expose model weaknesses
  4. Version everything: Track changes for reproducibility
  5. Maintain regularly: Datasets need ongoing curation

With evaluation fundamentals covered, we'll now dive deep into LangSmith—a powerful platform for tracing and evaluating LLM applications. :::

Quick check: how does this lesson land for you?

Quiz

Module 2: LLM Evaluation Fundamentals

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.