RAG Evaluation & Testing
RAG Evaluation Metrics
Evaluating RAG systems requires metrics beyond traditional ML accuracy. You need to measure both retrieval quality and generation faithfulness.
The RAG Evaluation Challenge
Traditional metrics don't capture RAG-specific failures:
# Traditional metrics miss critical issues:
# Scenario 1: High BLEU score, but hallucinated facts
generated = "The company was founded in 2015 by John Smith"
reference = "The company was founded in 2015 by Jane Smith"
# BLEU: 0.85 (looks good!)
# Reality: Wrong founder name (critical error)
# Scenario 2: Low BLEU score, but factually correct
generated = "Jane Smith established the business in 2015"
reference = "The company was founded in 2015 by Jane Smith"
# BLEU: 0.42 (looks bad!)
# Reality: Same facts, different wording (acceptable)
Component-Based Evaluation
RAG systems have three components to evaluate:
┌─────────────────────────────────────────────────────────────┐
│ RAG Evaluation Framework │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ RETRIEVAL │───▶│ CONTEXT │───▶│ GENERATION │ │
│ │ QUALITY │ │ QUALITY │ │ QUALITY │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ • Context Recall • Relevance • Faithfulness │
│ • Context Precision • Noise Ratio • Answer Relevancy │
│ • MRR, NDCG • Coverage • Correctness │
│ │
└─────────────────────────────────────────────────────────────┘
Retrieval Metrics
Context Precision
Measures if retrieved documents are relevant:
def context_precision(retrieved_contexts: list, relevant_contexts: list) -> float:
"""
What proportion of retrieved contexts are actually relevant?
High precision = Few irrelevant documents retrieved
Low precision = Many irrelevant documents polluting context
"""
relevant_retrieved = set(retrieved_contexts) & set(relevant_contexts)
if not retrieved_contexts:
return 0.0
return len(relevant_retrieved) / len(retrieved_contexts)
# Example
retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
relevant = ["doc1", "doc3", "doc7"]
precision = context_precision(retrieved, relevant)
# Result: 2/5 = 0.4 (Only 2 of 5 retrieved docs are relevant)
Context Recall
Measures if all relevant information was retrieved:
def context_recall(retrieved_contexts: list, relevant_contexts: list) -> float:
"""
What proportion of relevant contexts were retrieved?
High recall = All relevant information found
Low recall = Missing important context
"""
relevant_retrieved = set(retrieved_contexts) & set(relevant_contexts)
if not relevant_contexts:
return 1.0
return len(relevant_retrieved) / len(relevant_contexts)
# Example
retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
relevant = ["doc1", "doc3", "doc7"]
recall = context_recall(retrieved, relevant)
# Result: 2/3 = 0.67 (Retrieved 2 of 3 relevant docs, missed doc7)
Mean Reciprocal Rank (MRR)
Measures how high the first relevant result ranks:
def mean_reciprocal_rank(queries_results: list[list], relevant_docs: list[set]) -> float:
"""
Average of 1/rank for first relevant result per query.
MRR = 1.0 means first result is always relevant
MRR = 0.5 means first relevant result is typically rank 2
"""
reciprocal_ranks = []
for results, relevant in zip(queries_results, relevant_docs):
for rank, doc in enumerate(results, 1):
if doc in relevant:
reciprocal_ranks.append(1 / rank)
break
else:
reciprocal_ranks.append(0)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
# Example
query_results = [
["doc2", "doc1", "doc3"], # Query 1: relevant doc1 at rank 2
["doc5", "doc6", "doc7"], # Query 2: no relevant docs
["doc8", "doc9", "doc4"], # Query 3: relevant doc4 at rank 3
]
relevant = [{"doc1"}, {"doc10"}, {"doc4"}]
mrr = mean_reciprocal_rank(query_results, relevant)
# Result: (1/2 + 0 + 1/3) / 3 = 0.278
Generation Metrics
Faithfulness
Measures if the answer is grounded in retrieved context:
def assess_faithfulness(answer: str, context: str) -> dict:
"""
Faithfulness checks if every claim in the answer
can be verified from the retrieved context.
Uses LLM-as-judge approach.
"""
# Step 1: Extract claims from the answer
claims_prompt = f"""
Extract all factual claims from this answer:
Answer: {answer}
List each claim on a new line.
"""
# Step 2: Verify each claim against context
verify_prompt = f"""
For each claim, determine if it can be verified from the context.
Context: {context}
Claims: {claims}
For each claim, respond with:
- SUPPORTED: Claim is directly supported by context
- NOT_SUPPORTED: Claim cannot be verified from context
"""
# Step 3: Calculate faithfulness score
# Faithfulness = supported_claims / total_claims
return {
"score": supported_claims / total_claims,
"unsupported_claims": unsupported_list
}
# Example output
# {
# "score": 0.75, # 3 of 4 claims supported
# "unsupported_claims": ["The company has 500 employees"]
# }
Answer Relevancy
Measures if the answer addresses the question:
def assess_answer_relevancy(question: str, answer: str) -> float:
"""
Answer relevancy checks if the answer actually
addresses what was asked.
Approach: Generate questions from the answer,
compare semantic similarity to original question.
"""
# Generate questions that the answer would address
generated_questions = generate_questions_from_answer(answer, n=3)
# Compare each generated question to original
similarities = []
for gen_q in generated_questions:
sim = cosine_similarity(
embed(question),
embed(gen_q)
)
similarities.append(sim)
return sum(similarities) / len(similarities)
# Example
question = "What is the capital of France?"
answer = "Paris is the capital of France, located on the Seine River."
# Generated questions from answer:
# - "What is the capital of France?"
# - "Where is Paris located?"
# - "Which river runs through Paris?"
# Similarity to original: [0.95, 0.3, 0.2]
# Relevancy score: 0.48
Metric Selection Guide
| Metric | Measures | Use When |
|---|---|---|
| Context Precision | Retrieval accuracy | You have ground truth labels |
| Context Recall | Retrieval coverage | Missing info causes failures |
| MRR | Ranking quality | Top results matter most |
| Faithfulness | Hallucination prevention | Accuracy is critical |
| Answer Relevancy | Response quality | Answers seem off-topic |
Combined Scoring
def rag_quality_score(
context_precision: float,
context_recall: float,
faithfulness: float,
answer_relevancy: float,
weights: dict = None
) -> float:
"""
Weighted combination of RAG metrics.
Adjust weights based on your priorities.
"""
weights = weights or {
"context_precision": 0.2,
"context_recall": 0.2,
"faithfulness": 0.4, # Usually most important
"answer_relevancy": 0.2
}
score = (
weights["context_precision"] * context_precision +
weights["context_recall"] * context_recall +
weights["faithfulness"] * faithfulness +
weights["answer_relevancy"] * answer_relevancy
)
return score
# Example
quality = rag_quality_score(
context_precision=0.8,
context_recall=0.7,
faithfulness=0.9,
answer_relevancy=0.85
)
# Result: 0.2*0.8 + 0.2*0.7 + 0.4*0.9 + 0.2*0.85 = 0.83
Key Insight: Faithfulness is typically the most critical metric for production RAG systems. Users can tolerate slightly off-topic answers, but hallucinated facts destroy trust.
Next, let's implement these metrics using the RAGAS framework. :::