RAG Architecture Deep Dive
Common Failure Modes
Understanding why RAG systems fail is crucial for building robust solutions. Most failures occur in retrieval, not generation.
Failure Categories
| Category | Cause | Impact |
|---|---|---|
| Retrieval Failures | Wrong or missing chunks | Incomplete/wrong answers |
| Context Poisoning | Irrelevant content in context | Confused generation |
| Lost in the Middle | Key info buried in context | Missed information |
| Hallucination | Over-reliance on parametric knowledge | False statements |
Retrieval Failures
1. Query-Document Mismatch
User queries don't match document language:
# User asks: "How do I cancel my subscription?"
# Document contains: "Subscription termination procedure..."
# Solution: Query expansion
def expand_query(query: str, llm) -> str:
prompt = f"""Rewrite this user question to match formal documentation style:
User question: {query}
Documentation-style query:"""
return llm.invoke(prompt)
2. Insufficient Retrieval
Not enough relevant documents retrieved:
# Problem: Top-4 misses critical information
results = vectorstore.similarity_search(query, k=4) # Too few
# Solution: Retrieve more, then rerank
results = vectorstore.similarity_search(query, k=20)
top_results = reranker.rerank(query, results, top_k=4)
3. Chunking Artifacts
Important context split across chunks:
Chunk 1: "The return policy allows customers to..."
Chunk 2: "...return items within 30 days of purchase."
Solution: Use overlapping chunks or parent-child retrieval.
Context Poisoning
Irrelevant but semantically similar content:
# Query: "Python list methods"
# Retrieved: Article about "Python the snake" diet lists
# Solution: Metadata filtering
results = vectorstore.similarity_search(
query,
k=10,
filter={"category": "programming"}
)
Lost in the Middle
LLMs pay less attention to middle context:
Position | Attention
------------|----------
Beginning | High
Middle | Low ← Critical info often lost here
End | High
Solutions:
# 1. Limit context to most relevant chunks
context_docs = reranked_docs[:3] # Don't overload
# 2. Reorder by importance
def reorder_for_attention(docs):
"""Place most important at start and end."""
if len(docs) <= 2:
return docs
# Most relevant first, second-most at end
return [docs[0]] + docs[2:] + [docs[1]]
# 3. Summarize long contexts
if total_tokens(docs) > 2000:
docs = [summarize(doc) for doc in docs]
Hallucination Sources
Parametric Override
Model's training knowledge overrides retrieved context:
# Context: "Our company was founded in 2019"
# Model output: "Founded in 2015" (from training data)
# Solution: Explicit grounding instructions
prompt = """Answer ONLY using the provided context.
If the context doesn't contain the answer, say "I don't have this information."
Context: {context}
Question: {question}"""
Context Gaps
Missing information leads to fabrication:
# Detect and handle gaps
def generate_with_confidence(query: str, context: str, llm):
response = llm.invoke(f"""
Based on the context, answer the question.
Rate your confidence (high/medium/low) based on context coverage.
Context: {context}
Question: {query}
Format:
Answer: [your answer]
Confidence: [high/medium/low]
""")
if "Confidence: low" in response:
return "I don't have enough information to answer this accurately."
return response
Debugging Checklist
When RAG quality is poor:
1. Retrieval Quality
□ Are relevant documents being retrieved?
□ Is the embedding model appropriate for your domain?
□ Are chunks the right size?
2. Context Quality
□ Is irrelevant content polluting context?
□ Are metadata filters correctly applied?
□ Is context too long (lost in middle)?
3. Generation Quality
□ Is the prompt grounding the model to context?
□ Are confidence thresholds appropriate?
□ Does the model admit uncertainty?
Quick Diagnostics
def diagnose_rag(query: str, pipeline):
"""Diagnose RAG pipeline issues."""
# Check retrieval
docs = pipeline.retrieve(query, k=10)
print(f"Retrieved {len(docs)} documents")
# Check relevance scores
for i, doc in enumerate(docs[:5]):
print(f"{i+1}. Score: {doc.metadata.get('score', 'N/A')}")
print(f" Content: {doc.page_content[:100]}...")
# Check for duplicates
contents = [d.page_content for d in docs]
duplicates = len(contents) - len(set(contents))
print(f"Duplicate chunks: {duplicates}")
# Check context length
total_chars = sum(len(d.page_content) for d in docs[:4])
print(f"Total context length: {total_chars} chars")
Debugging Principle: When RAG fails, check retrieval first. 80% of quality issues originate in retrieval, not generation.
In the next module, we'll dive deep into embedding models and vector databases—the foundation of effective retrieval. :::