LLM Evaluation Fundamentals

LLM-as-Judge Evaluation

3 min read

How do you evaluate the quality of text generated by an LLM? You can use... another LLM! This technique is called LLM-as-Judge.

The Core Idea

Instead of manually reviewing every response, you prompt a powerful LLM to evaluate outputs based on specific criteria:

evaluation_prompt = """
You are an expert evaluator. Score the following response on a scale of 1-5.

Question: {question}
Response: {response}

Criteria:
- Accuracy: Is the information correct?
- Relevance: Does it answer the question?
- Completeness: Does it cover all aspects?

Return a JSON with scores for each criterion.
"""

Why LLM-as-Judge Works

Advantage Explanation
Scalable Evaluate thousands of responses automatically
Consistent Same criteria applied uniformly
Fast Results in seconds, not hours
Customizable Define any criteria you need

Common Judge Patterns

1. Scoring (1-5 scale)

# Judge returns a numeric score
{
    "score": 4,
    "reasoning": "Response is accurate but slightly verbose"
}

2. Binary Classification

# Judge returns pass/fail
{
    "pass": True,
    "reasoning": "Response correctly answers the question"
}

3. Categorical Assessment

# Judge returns a category
{
    "category": "excellent",  # excellent, good, acceptable, poor
    "reasoning": "Comprehensive and well-structured response"
}

Limitations to Consider

LLM-as-Judge isn't perfect:

  • Judge bias: The judge LLM may have its own biases
  • Cost: Each evaluation costs tokens
  • Consistency: Different judge models may score differently
  • Hallucination: Judges can hallucinate explanations

Best practice: Validate judge accuracy against human annotations before relying on it fully.

Choosing a Judge Model

Scenario Recommended Judge
High accuracy needed GPT-4, Claude 3.5 Sonnet
Cost-sensitive GPT-4o-mini, Claude 3.5 Haiku
Self-hosted required Llama 3.2, Mistral

Next, we'll explore the difference between reference-based and reference-free evaluation. :::

Quiz

Module 2: LLM Evaluation Fundamentals

Take Quiz