LLM Evaluation Fundamentals
LLM-as-Judge Evaluation
3 min read
How do you evaluate the quality of text generated by an LLM? You can use... another LLM! This technique is called LLM-as-Judge.
The Core Idea
Instead of manually reviewing every response, you prompt a powerful LLM to evaluate outputs based on specific criteria:
evaluation_prompt = """
You are an expert evaluator. Score the following response on a scale of 1-5.
Question: {question}
Response: {response}
Criteria:
- Accuracy: Is the information correct?
- Relevance: Does it answer the question?
- Completeness: Does it cover all aspects?
Return a JSON with scores for each criterion.
"""
Why LLM-as-Judge Works
| Advantage | Explanation |
|---|---|
| Scalable | Evaluate thousands of responses automatically |
| Consistent | Same criteria applied uniformly |
| Fast | Results in seconds, not hours |
| Customizable | Define any criteria you need |
Common Judge Patterns
1. Scoring (1-5 scale)
# Judge returns a numeric score
{
"score": 4,
"reasoning": "Response is accurate but slightly verbose"
}
2. Binary Classification
# Judge returns pass/fail
{
"pass": True,
"reasoning": "Response correctly answers the question"
}
3. Categorical Assessment
# Judge returns a category
{
"category": "excellent", # excellent, good, acceptable, poor
"reasoning": "Comprehensive and well-structured response"
}
Limitations to Consider
LLM-as-Judge isn't perfect:
- Judge bias: The judge LLM may have its own biases
- Cost: Each evaluation costs tokens
- Consistency: Different judge models may score differently
- Hallucination: Judges can hallucinate explanations
Best practice: Validate judge accuracy against human annotations before relying on it fully.
Choosing a Judge Model
| Scenario | Recommended Judge |
|---|---|
| High accuracy needed | GPT-4, Claude 3.5 Sonnet |
| Cost-sensitive | GPT-4o-mini, Claude 3.5 Haiku |
| Self-hosted required | Llama 3.2, Mistral |
Next, we'll explore the difference between reference-based and reference-free evaluation. :::