Lesson 4 of 24

RAG Architecture Deep Dive

Common Failure Modes

2 min read

Understanding why RAG systems fail is crucial for building robust solutions. Most failures occur in retrieval, not generation.

Failure Categories

CategoryCauseImpact
Retrieval FailuresWrong or missing chunksIncomplete/wrong answers
Context PoisoningIrrelevant content in contextConfused generation
Lost in the MiddleKey info buried in contextMissed information
HallucinationOver-reliance on parametric knowledgeFalse statements

Retrieval Failures

1. Query-Document Mismatch

User queries don't match document language:

# User asks: "How do I cancel my subscription?"
# Document contains: "Subscription termination procedure..."

# Solution: Query expansion
def expand_query(query: str, llm) -> str:
    prompt = f"""Rewrite this user question to match formal documentation style:
    User question: {query}
    Documentation-style query:"""
    return llm.invoke(prompt)

2. Insufficient Retrieval

Not enough relevant documents retrieved:

# Problem: Top-4 misses critical information
results = vectorstore.similarity_search(query, k=4)  # Too few

# Solution: Retrieve more, then rerank
results = vectorstore.similarity_search(query, k=20)
top_results = reranker.rerank(query, results, top_k=4)

3. Chunking Artifacts

Important context split across chunks:

Chunk 1: "The return policy allows customers to..."
Chunk 2: "...return items within 30 days of purchase."

Solution: Use overlapping chunks or parent-child retrieval.

Context Poisoning

Irrelevant but semantically similar content:

# Query: "Python list methods"
# Retrieved: Article about "Python the snake" diet lists

# Solution: Metadata filtering
results = vectorstore.similarity_search(
    query,
    k=10,
    filter={"category": "programming"}
)

Lost in the Middle

LLMs pay less attention to middle context:

Position    | Attention
------------|----------
Beginning   | High
Middle      | Low  ← Critical info often lost here
End         | High

Solutions:

# 1. Limit context to most relevant chunks
context_docs = reranked_docs[:3]  # Don't overload

# 2. Reorder by importance
def reorder_for_attention(docs):
    """Place most important at start and end."""
    if len(docs) <= 2:
        return docs
    # Most relevant first, second-most at end
    return [docs[0]] + docs[2:] + [docs[1]]

# 3. Summarize long contexts
if total_tokens(docs) > 2000:
    docs = [summarize(doc) for doc in docs]

Hallucination Sources

Parametric Override

Model's training knowledge overrides retrieved context:

# Context: "Our company was founded in 2019"
# Model output: "Founded in 2015" (from training data)

# Solution: Explicit grounding instructions
prompt = """Answer ONLY using the provided context.
If the context doesn't contain the answer, say "I don't have this information."

Context: {context}

Question: {question}"""

Context Gaps

Missing information leads to fabrication:

# Detect and handle gaps
def generate_with_confidence(query: str, context: str, llm):
    response = llm.invoke(f"""
    Based on the context, answer the question.
    Rate your confidence (high/medium/low) based on context coverage.

    Context: {context}
    Question: {query}

    Format:
    Answer: [your answer]
    Confidence: [high/medium/low]
    """)

    if "Confidence: low" in response:
        return "I don't have enough information to answer this accurately."
    return response

Debugging Checklist

When RAG quality is poor:

1. Retrieval Quality
   □ Are relevant documents being retrieved?
   □ Is the embedding model appropriate for your domain?
   □ Are chunks the right size?

2. Context Quality
   □ Is irrelevant content polluting context?
   □ Are metadata filters correctly applied?
   □ Is context too long (lost in middle)?

3. Generation Quality
   □ Is the prompt grounding the model to context?
   □ Are confidence thresholds appropriate?
   □ Does the model admit uncertainty?

Quick Diagnostics

def diagnose_rag(query: str, pipeline):
    """Diagnose RAG pipeline issues."""
    # Check retrieval
    docs = pipeline.retrieve(query, k=10)
    print(f"Retrieved {len(docs)} documents")

    # Check relevance scores
    for i, doc in enumerate(docs[:5]):
        print(f"{i+1}. Score: {doc.metadata.get('score', 'N/A')}")
        print(f"   Content: {doc.page_content[:100]}...")

    # Check for duplicates
    contents = [d.page_content for d in docs]
    duplicates = len(contents) - len(set(contents))
    print(f"Duplicate chunks: {duplicates}")

    # Check context length
    total_chars = sum(len(d.page_content) for d in docs[:4])
    print(f"Total context length: {total_chars} chars")

Debugging Principle: When RAG fails, check retrieval first. 80% of quality issues originate in retrieval, not generation.

In the next module, we'll dive deep into embedding models and vector databases—the foundation of effective retrieval. :::

Quick check: how does this lesson land for you?

Quiz

Module 1: RAG Architecture Deep Dive

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.