LLM Evaluation Fundamentals

Human Evaluation & Annotation

3 min read

LLM-as-Judge is powerful, but it's not perfect. Human evaluation remains the gold standard for establishing ground truth and validating automated evaluators.

Why Human Evaluation Matters

Automated Evaluation Human Evaluation
Fast and scalable Slow but accurate
Consistent but potentially biased Catches nuanced issues
Good for known patterns Discovers unknown failure modes
Can hallucinate reasoning Provides genuine understanding

Key insight: Use humans to validate your automated evaluators, then scale with automation.

Human Evaluation Approaches

1. Direct Rating

Humans score responses on predefined criteria:

Rate this response on a scale of 1-5:

Question: "How do I reset my password?"
Response: "Click on 'Forgot Password' on the login page..."

Criteria:
- Helpfulness: [1] [2] [3] [4] [5]
- Accuracy: [1] [2] [3] [4] [5]
- Clarity: [1] [2] [3] [4] [5]

2. Pairwise Comparison

Humans choose which response is better:

Which response better answers the question?

Question: "Explain machine learning"

Response A: [Technical explanation]
Response B: [Simple analogy]

[ ] A is better
[ ] B is better
[ ] About the same

3. Annotation Queues

Systematic review of production samples:

  • Sample random production requests
  • Route to human reviewers
  • Collect structured feedback
  • Feed back into training data

Building Annotation Guidelines

Clear guidelines reduce inconsistency:

## Annotation Guidelines for Customer Support Responses

### Score 5 (Excellent)
- Fully answers the question
- Polite and professional tone
- Includes relevant next steps

### Score 4 (Good)
- Answers the question adequately
- Professional tone
- Minor omissions

### Score 3 (Acceptable)
- Partially answers the question
- Acceptable tone
- Notable omissions

### Score 2 (Poor)
- Barely addresses the question
- Tone issues
- Major omissions

### Score 1 (Unacceptable)
- Does not answer the question
- Inappropriate content
- Factually incorrect

Measuring Annotator Agreement

Multiple annotators should agree on quality:

Metric Description Target
Cohen's Kappa Agreement between 2 annotators > 0.6
Fleiss' Kappa Agreement among 3+ annotators > 0.6
Krippendorff's Alpha Works with missing data > 0.67

Low agreement signals:

  • Unclear guidelines
  • Subjective criteria
  • Training needed

Practical Tips

  1. Start small: 50-100 examples to validate your evaluation approach
  2. Use multiple annotators: At least 2-3 per example for important decisions
  3. Track disagreements: They reveal edge cases and ambiguity
  4. Iterate on guidelines: Refine based on annotator feedback
  5. Build calibration sets: Use agreed examples to train new annotators

Next, we'll explore how to build evaluation datasets that cover the full range of scenarios your LLM will encounter. :::

Quiz

Module 2: LLM Evaluation Fundamentals

Take Quiz