LLM Evaluation Fundamentals

LLM-as-Judge Evaluation

3 min read

How do you evaluate the quality of text generated by an LLM? You can use... another LLM! This technique is called LLM-as-Judge.

The Core Idea

Instead of manually reviewing every response, you prompt a powerful LLM to evaluate outputs based on specific criteria:

evaluation_prompt = """
You are an expert evaluator. Score the following response on a scale of 1-5.

Question: {question}
Response: {response}

Criteria:
- Accuracy: Is the information correct?
- Relevance: Does it answer the question?
- Completeness: Does it cover all aspects?

Return a JSON with scores for each criterion.
"""

Why LLM-as-Judge Works

AdvantageExplanation
ScalableEvaluate thousands of responses automatically
ConsistentSame criteria applied uniformly
FastResults in seconds, not hours
CustomizableDefine any criteria you need

Common Judge Patterns

1. Scoring (1-5 scale)

# Judge returns a numeric score
{
    "score": 4,
    "reasoning": "Response is accurate but slightly verbose"
}

2. Binary Classification

# Judge returns pass/fail
{
    "pass": True,
    "reasoning": "Response correctly answers the question"
}

3. Categorical Assessment

# Judge returns a category
{
    "category": "excellent",  # excellent, good, acceptable, poor
    "reasoning": "Comprehensive and well-structured response"
}

Limitations to Consider

LLM-as-Judge isn't perfect:

  • Judge bias: The judge LLM may have its own biases
  • Cost: Each evaluation costs tokens
  • Consistency: Different judge models may score differently
  • Hallucination: Judges can hallucinate explanations

Best practice: Validate judge accuracy against human annotations before relying on it fully.

Choosing a Judge Model

ScenarioRecommended Judge
High accuracy neededGPT-5.4, claude-sonnet-4-6
Cost-sensitiveGPT-5.4 Mini, claude-haiku-4-5-20251001
Self-hosted requiredLlama 3.2, Mistral

Next, we'll explore the difference between reference-based and reference-free evaluation. :::

Quick check: how does this lesson land for you?

Quiz

Module 2: LLM Evaluation Fundamentals

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.