LLM Evaluation Fundamentals
Reference-Based vs Reference-Free Evaluation
3 min read
When evaluating LLM outputs, you have two fundamental approaches: compare against a known-good answer, or evaluate the output on its own merits.
Reference-Based Evaluation
You have: A question AND a correct/expected answer (ground truth) You measure: How well the generated response matches the reference
# Reference-based evaluation example
test_case = {
"question": "What is the capital of France?",
"reference": "Paris is the capital of France.",
"generated": "The capital of France is Paris."
}
# Metrics compare generated vs reference
similarity_score = evaluate_similarity(
generated=test_case["generated"],
reference=test_case["reference"]
)
Common Reference-Based Metrics
| Metric | What It Measures |
|---|---|
| Exact Match | Is the answer identical? |
| Semantic Similarity | Are the meanings equivalent? |
| BLEU/ROUGE | N-gram overlap with reference |
| Correctness | Does it convey the same facts? |
When to Use Reference-Based
- Factual Q&A with known answers
- Classification tasks with defined labels
- Extraction tasks with expected outputs
- Regression testing against baseline responses
Reference-Free Evaluation
You have: A question and generated response (no ground truth) You measure: Intrinsic quality of the response
# Reference-free evaluation example
test_case = {
"question": "Write a poem about spring.",
"generated": "Flowers bloom in gentle light..."
}
# Metrics evaluate response quality directly
quality_score = evaluate_quality(
question=test_case["question"],
response=test_case["generated"],
criteria=["coherence", "creativity", "relevance"]
)
Common Reference-Free Metrics
| Metric | What It Measures |
|---|---|
| Relevancy | Does it address the question? |
| Coherence | Is it logically structured? |
| Fluency | Is the language natural? |
| Safety | Is it free from harmful content? |
| Faithfulness | Does it stick to provided context? (RAG) |
When to Use Reference-Free
- Creative writing tasks
- Open-ended conversations
- Production traffic monitoring (no expected outputs)
- Subjective quality assessment
Combining Both Approaches
Real-world evaluation often uses both:
evaluation_results = {
# Reference-based (offline testing)
"correctness": compare_to_reference(response, expected),
# Reference-free (always applicable)
"coherence": evaluate_coherence(response),
"safety": check_safety(response),
"relevancy": evaluate_relevancy(question, response)
}
Decision Guide
| Scenario | Approach |
|---|---|
| You have labeled test data | Reference-based |
| Evaluating production traffic | Reference-free |
| RAG faithfulness checking | Reference-free (compare to context) |
| A/B testing new prompts | Both |
| Regression testing | Reference-based |
Next, we'll explore human evaluation and annotation—the gold standard for establishing ground truth. :::