LLM Evaluation Fundamentals

Reference-Based vs Reference-Free Evaluation

3 min read

When evaluating LLM outputs, you have two fundamental approaches: compare against a known-good answer, or evaluate the output on its own merits.

Reference-Based Evaluation

You have: A question AND a correct/expected answer (ground truth) You measure: How well the generated response matches the reference

# Reference-based evaluation example
test_case = {
    "question": "What is the capital of France?",
    "reference": "Paris is the capital of France.",
    "generated": "The capital of France is Paris."
}

# Metrics compare generated vs reference
similarity_score = evaluate_similarity(
    generated=test_case["generated"],
    reference=test_case["reference"]
)

Common Reference-Based Metrics

Metric What It Measures
Exact Match Is the answer identical?
Semantic Similarity Are the meanings equivalent?
BLEU/ROUGE N-gram overlap with reference
Correctness Does it convey the same facts?

When to Use Reference-Based

  • Factual Q&A with known answers
  • Classification tasks with defined labels
  • Extraction tasks with expected outputs
  • Regression testing against baseline responses

Reference-Free Evaluation

You have: A question and generated response (no ground truth) You measure: Intrinsic quality of the response

# Reference-free evaluation example
test_case = {
    "question": "Write a poem about spring.",
    "generated": "Flowers bloom in gentle light..."
}

# Metrics evaluate response quality directly
quality_score = evaluate_quality(
    question=test_case["question"],
    response=test_case["generated"],
    criteria=["coherence", "creativity", "relevance"]
)

Common Reference-Free Metrics

Metric What It Measures
Relevancy Does it address the question?
Coherence Is it logically structured?
Fluency Is the language natural?
Safety Is it free from harmful content?
Faithfulness Does it stick to provided context? (RAG)

When to Use Reference-Free

  • Creative writing tasks
  • Open-ended conversations
  • Production traffic monitoring (no expected outputs)
  • Subjective quality assessment

Combining Both Approaches

Real-world evaluation often uses both:

evaluation_results = {
    # Reference-based (offline testing)
    "correctness": compare_to_reference(response, expected),

    # Reference-free (always applicable)
    "coherence": evaluate_coherence(response),
    "safety": check_safety(response),
    "relevancy": evaluate_relevancy(question, response)
}

Decision Guide

Scenario Approach
You have labeled test data Reference-based
Evaluating production traffic Reference-free
RAG faithfulness checking Reference-free (compare to context)
A/B testing new prompts Both
Regression testing Reference-based

Next, we'll explore human evaluation and annotation—the gold standard for establishing ground truth. :::

Quiz

Module 2: LLM Evaluation Fundamentals

Take Quiz