Lesson 4 of 23

RAG Architecture Deep Dive

Common Failure Modes

2 min read

Understanding why RAG systems fail is crucial for building robust solutions. Most failures occur in retrieval, not generation.

Failure Categories

Category Cause Impact
Retrieval Failures Wrong or missing chunks Incomplete/wrong answers
Context Poisoning Irrelevant content in context Confused generation
Lost in the Middle Key info buried in context Missed information
Hallucination Over-reliance on parametric knowledge False statements

Retrieval Failures

1. Query-Document Mismatch

User queries don't match document language:

# User asks: "How do I cancel my subscription?"
# Document contains: "Subscription termination procedure..."

# Solution: Query expansion
def expand_query(query: str, llm) -> str:
    prompt = f"""Rewrite this user question to match formal documentation style:
    User question: {query}
    Documentation-style query:"""
    return llm.invoke(prompt)

2. Insufficient Retrieval

Not enough relevant documents retrieved:

# Problem: Top-4 misses critical information
results = vectorstore.similarity_search(query, k=4)  # Too few

# Solution: Retrieve more, then rerank
results = vectorstore.similarity_search(query, k=20)
top_results = reranker.rerank(query, results, top_k=4)

3. Chunking Artifacts

Important context split across chunks:

Chunk 1: "The return policy allows customers to..."
Chunk 2: "...return items within 30 days of purchase."

Solution: Use overlapping chunks or parent-child retrieval.

Context Poisoning

Irrelevant but semantically similar content:

# Query: "Python list methods"
# Retrieved: Article about "Python the snake" diet lists

# Solution: Metadata filtering
results = vectorstore.similarity_search(
    query,
    k=10,
    filter={"category": "programming"}
)

Lost in the Middle

LLMs pay less attention to middle context:

Position    | Attention
------------|----------
Beginning   | High
Middle      | Low  ← Critical info often lost here
End         | High

Solutions:

# 1. Limit context to most relevant chunks
context_docs = reranked_docs[:3]  # Don't overload

# 2. Reorder by importance
def reorder_for_attention(docs):
    """Place most important at start and end."""
    if len(docs) <= 2:
        return docs
    # Most relevant first, second-most at end
    return [docs[0]] + docs[2:] + [docs[1]]

# 3. Summarize long contexts
if total_tokens(docs) > 2000:
    docs = [summarize(doc) for doc in docs]

Hallucination Sources

Parametric Override

Model's training knowledge overrides retrieved context:

# Context: "Our company was founded in 2019"
# Model output: "Founded in 2015" (from training data)

# Solution: Explicit grounding instructions
prompt = """Answer ONLY using the provided context.
If the context doesn't contain the answer, say "I don't have this information."

Context: {context}

Question: {question}"""

Context Gaps

Missing information leads to fabrication:

# Detect and handle gaps
def generate_with_confidence(query: str, context: str, llm):
    response = llm.invoke(f"""
    Based on the context, answer the question.
    Rate your confidence (high/medium/low) based on context coverage.

    Context: {context}
    Question: {query}

    Format:
    Answer: [your answer]
    Confidence: [high/medium/low]
    """)

    if "Confidence: low" in response:
        return "I don't have enough information to answer this accurately."
    return response

Debugging Checklist

When RAG quality is poor:

1. Retrieval Quality
   □ Are relevant documents being retrieved?
   □ Is the embedding model appropriate for your domain?
   □ Are chunks the right size?

2. Context Quality
   □ Is irrelevant content polluting context?
   □ Are metadata filters correctly applied?
   □ Is context too long (lost in middle)?

3. Generation Quality
   □ Is the prompt grounding the model to context?
   □ Are confidence thresholds appropriate?
   □ Does the model admit uncertainty?

Quick Diagnostics

def diagnose_rag(query: str, pipeline):
    """Diagnose RAG pipeline issues."""
    # Check retrieval
    docs = pipeline.retrieve(query, k=10)
    print(f"Retrieved {len(docs)} documents")

    # Check relevance scores
    for i, doc in enumerate(docs[:5]):
        print(f"{i+1}. Score: {doc.metadata.get('score', 'N/A')}")
        print(f"   Content: {doc.page_content[:100]}...")

    # Check for duplicates
    contents = [d.page_content for d in docs]
    duplicates = len(contents) - len(set(contents))
    print(f"Duplicate chunks: {duplicates}")

    # Check context length
    total_chars = sum(len(d.page_content) for d in docs[:4])
    print(f"Total context length: {total_chars} chars")

Debugging Principle: When RAG fails, check retrieval first. 80% of quality issues originate in retrieval, not generation.

In the next module, we'll dive deep into embedding models and vector databases—the foundation of effective retrieval. :::

Quiz

Module 1: RAG Architecture Deep Dive

Take Quiz