LLM Evaluation Fundamentals

Reference-Based vs Reference-Free Evaluation

3 min read

When evaluating LLM outputs, you have two fundamental approaches: compare against a known-good answer, or evaluate the output on its own merits.

Reference-Based Evaluation

You have: A question AND a correct/expected answer (ground truth) You measure: How well the generated response matches the reference

# Reference-based evaluation example
test_case = {
    "question": "What is the capital of France?",
    "reference": "Paris is the capital of France.",
    "generated": "The capital of France is Paris."
}

# Metrics compare generated vs reference
similarity_score = evaluate_similarity(
    generated=test_case["generated"],
    reference=test_case["reference"]
)

Common Reference-Based Metrics

MetricWhat It Measures
Exact MatchIs the answer identical?
Semantic SimilarityAre the meanings equivalent?
BLEU/ROUGEN-gram overlap with reference
CorrectnessDoes it convey the same facts?

When to Use Reference-Based

  • Factual Q&A with known answers
  • Classification tasks with defined labels
  • Extraction tasks with expected outputs
  • Regression testing against baseline responses

Reference-Free Evaluation

You have: A question and generated response (no ground truth) You measure: Intrinsic quality of the response

# Reference-free evaluation example
test_case = {
    "question": "Write a poem about spring.",
    "generated": "Flowers bloom in gentle light..."
}

# Metrics evaluate response quality directly
quality_score = evaluate_quality(
    question=test_case["question"],
    response=test_case["generated"],
    criteria=["coherence", "creativity", "relevance"]
)

Common Reference-Free Metrics

MetricWhat It Measures
RelevancyDoes it address the question?
CoherenceIs it logically structured?
FluencyIs the language natural?
SafetyIs it free from harmful content?
FaithfulnessDoes it stick to provided context? (RAG)

When to Use Reference-Free

  • Creative writing tasks
  • Open-ended conversations
  • Production traffic monitoring (no expected outputs)
  • Subjective quality assessment

Combining Both Approaches

Real-world evaluation often uses both:

evaluation_results = {
    # Reference-based (offline testing)
    "correctness": compare_to_reference(response, expected),

    # Reference-free (always applicable)
    "coherence": evaluate_coherence(response),
    "safety": check_safety(response),
    "relevancy": evaluate_relevancy(question, response)
}

Decision Guide

ScenarioApproach
You have labeled test dataReference-based
Evaluating production trafficReference-free
RAG faithfulness checkingReference-free (compare to context)
A/B testing new promptsBoth
Regression testingReference-based

Next, we'll explore human evaluation and annotation—the gold standard for establishing ground truth. :::

Quick check: how does this lesson land for you?

Quiz

Module 2: LLM Evaluation Fundamentals

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.