LLM Evaluation Fundamentals
Human Evaluation & Annotation
3 min read
LLM-as-Judge is powerful, but it's not perfect. Human evaluation remains the gold standard for establishing ground truth and validating automated evaluators.
Why Human Evaluation Matters
| Automated Evaluation | Human Evaluation |
|---|---|
| Fast and scalable | Slow but accurate |
| Consistent but potentially biased | Catches nuanced issues |
| Good for known patterns | Discovers unknown failure modes |
| Can hallucinate reasoning | Provides genuine understanding |
Key insight: Use humans to validate your automated evaluators, then scale with automation.
Human Evaluation Approaches
1. Direct Rating
Humans score responses on predefined criteria:
Rate this response on a scale of 1-5:
Question: "How do I reset my password?"
Response: "Click on 'Forgot Password' on the login page..."
Criteria:
- Helpfulness: [1] [2] [3] [4] [5]
- Accuracy: [1] [2] [3] [4] [5]
- Clarity: [1] [2] [3] [4] [5]
2. Pairwise Comparison
Humans choose which response is better:
Which response better answers the question?
Question: "Explain machine learning"
Response A: [Technical explanation]
Response B: [Simple analogy]
[ ] A is better
[ ] B is better
[ ] About the same
3. Annotation Queues
Systematic review of production samples:
- Sample random production requests
- Route to human reviewers
- Collect structured feedback
- Feed back into training data
Building Annotation Guidelines
Clear guidelines reduce inconsistency:
## Annotation Guidelines for Customer Support Responses
### Score 5 (Excellent)
- Fully answers the question
- Polite and professional tone
- Includes relevant next steps
### Score 4 (Good)
- Answers the question adequately
- Professional tone
- Minor omissions
### Score 3 (Acceptable)
- Partially answers the question
- Acceptable tone
- Notable omissions
### Score 2 (Poor)
- Barely addresses the question
- Tone issues
- Major omissions
### Score 1 (Unacceptable)
- Does not answer the question
- Inappropriate content
- Factually incorrect
Measuring Annotator Agreement
Multiple annotators should agree on quality:
| Metric | Description | Target |
|---|---|---|
| Cohen's Kappa | Agreement between 2 annotators | > 0.6 |
| Fleiss' Kappa | Agreement among 3+ annotators | > 0.6 |
| Krippendorff's Alpha | Works with missing data | > 0.67 |
Low agreement signals:
- Unclear guidelines
- Subjective criteria
- Training needed
Practical Tips
- Start small: 50-100 examples to validate your evaluation approach
- Use multiple annotators: At least 2-3 per example for important decisions
- Track disagreements: They reveal edge cases and ambiguity
- Iterate on guidelines: Refine based on annotator feedback
- Build calibration sets: Use agreed examples to train new annotators
Next, we'll explore how to build evaluation datasets that cover the full range of scenarios your LLM will encounter. :::