LLM-as-Judge Evaluation

How do you evaluate the quality of text generated by an LLM? You can use... another LLM! This technique is called LLM-as-Judge.

The Core Idea

Instead of manually reviewing every response, you prompt a powerful LLM to evaluate outputs based on specific criteria:

evaluation_prompt = """
You are an expert evaluator. Score the following response on a scale of 1-5.

Question: {question}
Response: {response}

Criteria:
- Accuracy: Is the information correct?
- Relevance: Does it answer the question?
- Completeness: Does it cover all aspects?

Return a JSON with scores for each criterion.
"""

Why LLM-as-Judge Works

Advantage	Explanation
Scalable	Evaluate thousands of responses automatically
Consistent	Same criteria applied uniformly
Fast	Results in seconds, not hours
Customizable	Define any criteria you need

Common Judge Patterns

1. Scoring (1-5 scale)

# Judge returns a numeric score
{
    "score": 4,
    "reasoning": "Response is accurate but slightly verbose"
}

2. Binary Classification

# Judge returns pass/fail
{
    "pass": True,
    "reasoning": "Response correctly answers the question"
}

3. Categorical Assessment

# Judge returns a category
{
    "category": "excellent",  # excellent, good, acceptable, poor
    "reasoning": "Comprehensive and well-structured response"
}

Limitations to Consider

LLM-as-Judge isn't perfect:

Judge bias: The judge LLM may have its own biases
Cost: Each evaluation costs tokens
Consistency: Different judge models may score differently
Hallucination: Judges can hallucinate explanations

Best practice: Validate judge accuracy against human annotations before relying on it fully.

Choosing a Judge Model

Scenario	Recommended Judge
High accuracy needed	GPT-4o, claude-sonnet-4-6
Cost-sensitive	GPT-4o-mini, claude-haiku-4-5-20251001
Self-hosted required	Llama 3.2, Mistral

Next, we'll explore the difference between reference-based and reference-free evaluation. :::