RAG Architecture Deep Dive
Beyond Basic RAG
The outcome first — what you'll build in the capstone
Before we talk theory, here's what the end of this course looks like:
$ curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"q": "What did I decide about Q3 in my March planning notes?"}'
{
"answer": "Per your 2026-03-14 notes, Q3 priorities are: (1) ship MCP integration [cite:1],
(2) launch the Arabic site [cite:2], (3) cut infra 30% [cite:1].",
"citations": [
{"id": 1, "source": "meetings/2026-03-14-planning.md"},
{"id": 2, "source": "meetings/2026-03-14-planning.md"}
]
}
Every improvement you learn across the next six modules — better chunking, hybrid search, reranking, RAGAS evaluation, monitoring — makes this service answer more accurately over your actual documents. Not a toy.
Why "basic RAG" fails in production
The naive recipe looks elegant:
def naive_rag(query: str):
docs = vectorstore.similarity_search(query, k=4)
context = "\n".join(d.page_content for d in docs)
return llm.invoke(f"Context: {context}\n\nQuestion: {query}")
Run this on 200 real documents and you'll hit all four canonical failures within the first hour:
- Query–document mismatch. Users phrase questions in natural language ("What did I decide about Q3?"). Documents are written in declarative prose ("Q3 priorities: MCP integration, Arabic launch, 30% infra cut"). Cosine similarity between the two can be surprisingly low.
- Irrelevant chunks pollute context. If 2 of your 4 retrieved chunks are about Q2 (not Q3), the LLM often prefers the majority and confabulates.
- No verification of retrieval quality. Basic RAG returns the top-k whether or not any of them are actually relevant. "What's the capital of Brazil?" over a docs folder about your company still returns 4 company docs — the LLM then either hallucinates or awkwardly says "I don't know."
- Fixed retrieval regardless of complexity. "Who owns the payments service?" needs 1 chunk. "Compare our Q2 and Q3 priorities across engineering and marketing" needs 8 chunks from 3 different docs. Top-k=4 is wrong for both.
The three generations of RAG
| Generation | What it adds | When to use it |
|---|---|---|
| Naive RAG | Embedding search + LLM call | Demos, prototypes, < 50 documents |
| Advanced RAG | Query rewriting, hybrid search, reranking, grounded generation | Production with 100–100K documents — this course |
| Agentic RAG | Multi-step retrieval, self-correction, tool use, adaptive k | Complex analytical queries, cross-domain research agents |
The honest truth most tutorials skip: Agentic RAG usually isn't what you need. Most business questions ("what's our refund policy," "what did the customer say about X") get answered well by properly-tuned Advanced RAG at a fraction of the latency and cost.
This course's focus
You're going to build Advanced RAG — the production-ready middle. Module by module:
| Module | What you learn | What you ship at the end of the module |
|---|---|---|
| 1 (this one) | Why naive fails, RAG vs fine-tune, pipeline, failure modes | A mental model of where quality actually lives |
| 2 | Embedding choice + vector DB choice | Indexed corpus on Supabase pgvector with 3072-dim embeddings |
| 3 | Chunking that preserves meaning | Your corpus rechunked so answers land on clean boundaries |
| 4 | Hybrid search + reranking | A hybrid retriever with BM25 + vector fusion + LLM rerank |
| 5 | RAGAS evaluation + test datasets | A test set + numeric scores for your system |
| 6 | Production hardening + capstone | The full FastAPI RAG service, deployed, with citations |
Key insight
Most RAG failures aren't model problems — they're retrieval problems. Master retrieval, and answer quality follows.
Swap Claude Sonnet 4.6 for GPT-5 — your answer quality barely moves. Swap your naive top-k retrieval for hybrid search with reranking — your answer quality goes from 60% helpful to 90%.
Build checkpoint — do this before the next lesson
You don't need to code yet, but you DO need to pick:
- What corpus will you RAG over? Your Notion export? Your company wiki? A folder of meeting notes? Pick now, not later. Roughly 20–200 documents is the sweet spot for the first build.
- Where do those documents live? If they're scattered across Slack/email/etc., spend 15 minutes exporting a clean folder of
.md/.pdffiles. Quality of your RAG is bounded by what you feed it. - Write down 5 questions you'd actually want answered from that corpus. You'll use these to evaluate every module's improvement. Having a ground truth matters more than any technique you'll learn.
Next: RAG vs Fine-tuning — when each wins, and when the right answer is both. :::