Reference-Based vs Reference-Free Evaluation

When evaluating LLM outputs, you have two fundamental approaches: compare against a known-good answer, or evaluate the output on its own merits.

Reference-Based Evaluation

You have: A question AND a correct/expected answer (ground truth) You measure: How well the generated response matches the reference

# Reference-based evaluation example
test_case = {
    "question": "What is the capital of France?",
    "reference": "Paris is the capital of France.",
    "generated": "The capital of France is Paris."
}

# Metrics compare generated vs reference
similarity_score = evaluate_similarity(
    generated=test_case["generated"],
    reference=test_case["reference"]
)

Common Reference-Based Metrics

Metric	What It Measures
Exact Match	Is the answer identical?
Semantic Similarity	Are the meanings equivalent?
BLEU/ROUGE	N-gram overlap with reference
Correctness	Does it convey the same facts?

When to Use Reference-Based

Factual Q&A with known answers
Classification tasks with defined labels
Extraction tasks with expected outputs
Regression testing against baseline responses

Reference-Free Evaluation

You have: A question and generated response (no ground truth) You measure: Intrinsic quality of the response

# Reference-free evaluation example
test_case = {
    "question": "Write a poem about spring.",
    "generated": "Flowers bloom in gentle light..."
}

# Metrics evaluate response quality directly
quality_score = evaluate_quality(
    question=test_case["question"],
    response=test_case["generated"],
    criteria=["coherence", "creativity", "relevance"]
)

Common Reference-Free Metrics

Metric	What It Measures
Relevancy	Does it address the question?
Coherence	Is it logically structured?
Fluency	Is the language natural?
Safety	Is it free from harmful content?
Faithfulness	Does it stick to provided context? (RAG)

When to Use Reference-Free

Creative writing tasks
Open-ended conversations
Production traffic monitoring (no expected outputs)
Subjective quality assessment

Combining Both Approaches

Real-world evaluation often uses both:

evaluation_results = {
    # Reference-based (offline testing)
    "correctness": compare_to_reference(response, expected),

    # Reference-free (always applicable)
    "coherence": evaluate_coherence(response),
    "safety": check_safety(response),
    "relevancy": evaluate_relevancy(question, response)
}

Decision Guide

Scenario	Approach
You have labeled test data	Reference-based
Evaluating production traffic	Reference-free
RAG faithfulness checking	Reference-free (compare to context)
A/B testing new prompts	Both
Regression testing	Reference-based

Next, we'll explore human evaluation and annotation—the gold standard for establishing ground truth. :::