Human Evaluation & Annotation

LLM-as-Judge is powerful, but it's not perfect. Human evaluation remains the gold standard for establishing ground truth and validating automated evaluators.

Why Human Evaluation Matters

Automated Evaluation	Human Evaluation
Fast and scalable	Slow but accurate
Consistent but potentially biased	Catches nuanced issues
Good for known patterns	Discovers unknown failure modes
Can hallucinate reasoning	Provides genuine understanding

Key insight: Use humans to validate your automated evaluators, then scale with automation.

Human Evaluation Approaches

1. Direct Rating

Humans score responses on predefined criteria:

Rate this response on a scale of 1-5:

Question: "How do I reset my password?"
Response: "Click on 'Forgot Password' on the login page..."

Criteria:
- Helpfulness: [1] [2] [3] [4] [5]
- Accuracy: [1] [2] [3] [4] [5]
- Clarity: [1] [2] [3] [4] [5]

2. Pairwise Comparison

Humans choose which response is better:

Which response better answers the question?

Question: "Explain machine learning"

Response A: [Technical explanation]
Response B: [Simple analogy]

[ ] A is better
[ ] B is better
[ ] About the same

3. Annotation Queues

Systematic review of production samples:

Sample random production requests
Route to human reviewers
Collect structured feedback
Feed back into training data

Building Annotation Guidelines

Clear guidelines reduce inconsistency:

## Annotation Guidelines for Customer Support Responses

### Score 5 (Excellent)
- Fully answers the question
- Polite and professional tone
- Includes relevant next steps

### Score 4 (Good)
- Answers the question adequately
- Professional tone
- Minor omissions

### Score 3 (Acceptable)
- Partially answers the question
- Acceptable tone
- Notable omissions

### Score 2 (Poor)
- Barely addresses the question
- Tone issues
- Major omissions

### Score 1 (Unacceptable)
- Does not answer the question
- Inappropriate content
- Factually incorrect

Measuring Annotator Agreement

Multiple annotators should agree on quality:

Metric	Description	Target
Cohen's Kappa	Agreement between 2 annotators	> 0.6
Fleiss' Kappa	Agreement among 3+ annotators	> 0.6
Krippendorff's Alpha	Works with missing data	> 0.67

Low agreement signals:

Unclear guidelines
Subjective criteria
Training needed

Practical Tips

Start small: 50-100 examples to validate your evaluation approach
Use multiple annotators: At least 2-3 per example for important decisions
Track disagreements: They reveal edge cases and ambiguity
Iterate on guidelines: Refine based on annotator feedback
Build calibration sets: Use agreed examples to train new annotators

Next, we'll explore how to build evaluation datasets that cover the full range of scenarios your LLM will encounter. :::