Production RAG Systems
Capstone: Build Your Own RAG Over Your Personal Documents
Outcome: By the end of this lesson you will have a deployed RAG service — a FastAPI app that ingests your personal documents (PDFs, markdown, text), retrieves relevant passages via hybrid search with RRF fusion, and answers questions with inline citations back to the source files.
This capstone ties together everything from the last six modules: embedding strategy (Module 2), chunking (Module 3), hybrid search + reranking (Module 4), evaluation (Module 5), and production hardening (Module 6).
What you'll actually ship:
$ curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"q": "What did I say about the Q3 roadmap in my meeting notes?"}'
{
"answer": "According to your notes from 2026-03-14, the Q3 roadmap priorities are: (1) ship MCP integration [cite:1], (2) launch Arabic site [cite:2], and (3) cut infrastructure costs by 30% [cite:1].",
"citations": [
{"id": 1, "source": "meetings/2026-03-14-planning.md", "span": "..."},
{"id": 2, "source": "meetings/2026-03-14-planning.md", "span": "..."}
]
}
The architecture you're building
{
"type": "architecture",
"title": "RAG Service — Data Flow",
"direction": "top-down",
"layers": [
{
"label": "Ingestion (one-time)",
"color": "blue",
"components": [
{ "label": "Documents", "description": "Your PDFs, markdown, and text files" },
{ "label": "Chunker", "description": "Sliding window, 500 tokens, 75 overlap" },
{ "label": "Embedder", "description": "OpenAI text-embedding-3-large (3072d)" },
{ "label": "Supabase chunks table", "description": "HNSW index + GIN tsvector" }
]
},
{
"label": "Query time (every request)",
"color": "amber",
"components": [
{ "label": "FastAPI /ask", "description": "Receives {q, top_k}" },
{ "label": "Hybrid search (SQL)", "description": "Dense cosine + sparse BM25 → RRF fused top-30" },
{ "label": "LLM reranker (Haiku)", "description": "Scores 30 → keeps top-5" },
{ "label": "Grounded generator (Sonnet)", "description": "Answer with [cite:N] inline markers" }
]
},
{
"label": "Output",
"color": "green",
"components": [
{ "label": "JSON response", "description": "{answer, citations[]} — citations map back to source files" }
]
}
]
}
Part 0 — Decide your document set (5 min)
Pick something you actually care about:
- Your Notion/Obsidian export
- Your company's internal wiki (markdown dump)
- A folder of meeting notes
- A personal knowledge garden
- Your blog archive
Aim for 20–200 documents the first time. Too few and RAG isn't interesting; too many and your first ingestion takes too long to debug.
Part 1 — Supabase setup with pgvector (10 min)
1.1 Create a project
- Go to https://supabase.com, sign up (free tier is enough for up to ~50 MB of text content + embeddings)
- Create a new project. Save the Project URL and the service role key for the next step.
1.2 Apply the schema
In Supabase → SQL Editor, paste and run this. It creates a single chunks table with both a vector column (for semantic search) and a tsvector column (for keyword search) — the two halves of hybrid retrieval you learned in Module 4.
-- schema.sql
create extension if not exists vector;
create table chunks (
id uuid primary key default gen_random_uuid(),
source_path text not null, -- e.g. "meetings/2026-03-14-planning.md"
chunk_index int not null,
content text not null,
embedding vector(3072) not null, -- text-embedding-3-large
tsv tsvector generated always as (to_tsvector('english', content)) stored,
created_at timestamptz default now()
);
create index on chunks using hnsw (embedding vector_cosine_ops);
create index on chunks using gin (tsv);
create index on chunks (source_path);
-- Hybrid search via Reciprocal Rank Fusion (RRF)
create or replace function hybrid_search(
query_embedding vector(3072),
query_text text,
match_count int default 10,
rrf_k int default 60
)
returns table (id uuid, source_path text, content text, score float)
language sql stable as $$
with dense as (
select id, row_number() over (order by embedding <=> query_embedding) as rank
from chunks
order by embedding <=> query_embedding
limit match_count * 3
),
sparse as (
select id, row_number() over (order by ts_rank(tsv, plainto_tsquery('english', query_text)) desc) as rank
from chunks
where tsv @@ plainto_tsquery('english', query_text)
limit match_count * 3
),
fused as (
select coalesce(d.id, s.id) as id,
coalesce(1.0 / (rrf_k + d.rank), 0) + coalesce(1.0 / (rrf_k + s.rank), 0) as rrf_score
from dense d full outer join sparse s on d.id = s.id
)
select c.id, c.source_path, c.content, f.rrf_score as score
from fused f join chunks c on c.id = f.id
order by f.rrf_score desc
limit match_count;
$$;
Why this schema: HNSW for fast approximate nearest neighbour, GIN for BM25-style keyword matching, RRF to fuse both rankings into one score. Exactly the pattern from Module 4 Lesson 2.
Part 2 — Ingestion pipeline (15 min)
Project layout:
my-rag/
├── schema.sql (already applied above)
├── ingest.py
├── retriever.py
├── generator.py
├── main.py
├── requirements.txt
└── .env.example
2.1 requirements.txt
anthropic==0.42.0
openai==1.58.0
fastapi==0.115.6
uvicorn[standard]==0.34.0
supabase==2.10.0
python-dotenv==1.0.1
pypdf==5.1.0
markdown-it-py==3.0.0
tiktoken==0.8.0
2.2 ingest.py — load, chunk, embed, upsert
import os, glob, json
from pathlib import Path
from openai import OpenAI
from supabase import create_client
from pypdf import PdfReader
from dotenv import load_dotenv
import tiktoken
load_dotenv()
openai = OpenAI()
supabase = create_client(os.environ["SUPABASE_URL"], os.environ["SUPABASE_SERVICE_KEY"])
enc = tiktoken.get_encoding("cl100k_base")
CHUNK_TOKENS = 500 # Module 3 Lesson 3: 300-800 works best for most docs
CHUNK_OVERLAP = 75 # 15% overlap keeps context across boundaries
def read_document(path: Path) -> str:
if path.suffix.lower() == ".pdf":
return "\n\n".join(p.extract_text() or "" for p in PdfReader(path).pages)
return path.read_text(encoding="utf-8", errors="ignore")
def chunk(text: str, chunk_tokens: int = CHUNK_TOKENS, overlap: int = CHUNK_OVERLAP) -> list[str]:
"""Sliding-window token chunking — simpler than semantic but robust."""
tokens = enc.encode(text)
out = []
i = 0
while i < len(tokens):
window = tokens[i : i + chunk_tokens]
out.append(enc.decode(window))
i += chunk_tokens - overlap
return out
def embed_batch(texts: list[str]) -> list[list[float]]:
"""text-embedding-3-large returns 3072-dim vectors."""
resp = openai.embeddings.create(model="text-embedding-3-large", input=texts)
return [d.embedding for d in resp.data]
def ingest(docs_dir: str):
paths = [
Path(p) for p in glob.glob(f"{docs_dir}/**/*", recursive=True)
if Path(p).is_file() and Path(p).suffix.lower() in {".md", ".txt", ".pdf"}
]
print(f"Found {len(paths)} documents")
for path in paths:
text = read_document(path)
chunks = chunk(text)
if not chunks:
continue
# Embed in batches of 100 — OpenAI's per-request limit is safely higher
# but batching reduces latency and keeps token accounting simple.
for batch_start in range(0, len(chunks), 100):
batch = chunks[batch_start : batch_start + 100]
vectors = embed_batch(batch)
rows = [{
"source_path": str(path.relative_to(docs_dir)),
"chunk_index": batch_start + i,
"content": content,
"embedding": vec,
} for i, (content, vec) in enumerate(zip(batch, vectors))]
supabase.table("chunks").insert(rows).execute()
print(f" ✓ {path.name}: {len(chunks)} chunks")
if __name__ == "__main__":
import sys
ingest(sys.argv[1] if len(sys.argv) > 1 else "./docs")
Run it:
pip install -r requirements.txt
python ingest.py ./path/to/your/docs
You'll see output like ✓ meetings/2026-03-14-planning.md: 8 chunks. If a document fails to parse (corrupted PDF, etc.), the loop continues — one bad file shouldn't tank a 200-doc ingest.
Part 3 — Retrieval with reranking (15 min)
3.1 retriever.py — hybrid search + optional rerank
import os
from anthropic import Anthropic
from openai import OpenAI
from supabase import create_client
from dotenv import load_dotenv
load_dotenv()
openai = OpenAI()
anthropic = Anthropic()
supabase = create_client(os.environ["SUPABASE_URL"], os.environ["SUPABASE_SERVICE_KEY"])
def retrieve(query: str, top_k: int = 5) -> list[dict]:
"""
Hybrid search (dense + sparse) → top-30 fused candidates → LLM rerank → top-K.
Returns [{id, source_path, content, rank}].
"""
# 1. Embed the query once
query_vec = openai.embeddings.create(
model="text-embedding-3-large",
input=query,
).data[0].embedding
# 2. Hybrid search via our SQL function
result = supabase.rpc("hybrid_search", {
"query_embedding": query_vec,
"query_text": query,
"match_count": 30, # oversample so rerank has candidates to choose from
}).execute()
candidates = result.data or []
if not candidates:
return []
# 3. Lightweight rerank with a small LLM pass. This is the Module 4 Lesson 3
# technique — prompt a cheap model to score relevance 0-10 for each chunk.
prompt = f"""Score each passage 0-10 for how well it answers the query.
Return JSON only: {{"scores": [int, ...]}} in the same order as input.
QUERY: {query}
PASSAGES:
""" + "\n\n".join(
f"[{i}] {c['content'][:500]}" for i, c in enumerate(candidates)
)
reply = anthropic.messages.create(
model="claude-haiku-4-6", # fast and cheap for scoring
max_tokens=2048,
messages=[{"role": "user", "content": prompt}],
)
import json, re
try:
scores = json.loads(re.search(r"\{.*\}", reply.content[0].text, re.DOTALL).group(0))["scores"]
except Exception:
scores = [c.get("score", 0) for c in candidates] # fall back to RRF order
# 4. Sort by LLM score, keep top-K
ranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True,
)[:top_k]
return [{**c, "rank": s} for c, s in ranked]
Why rerank when hybrid search is already good: because Module 4 Lesson 3 showed that LLM reranking lifts final-answer quality meaningfully on ambiguous queries. Haiku costs ~$1/M input tokens — rerank for 30 chunks of 500 chars is pennies per query.
Part 4 — Generation with inline citations (10 min)
4.1 generator.py — grounded answers
import os
from anthropic import Anthropic
from dotenv import load_dotenv
load_dotenv()
client = Anthropic()
SYSTEM = """You are a retrieval-augmented assistant. Answer ONLY using the provided CONTEXT passages.
Rules:
1. Every factual statement must cite a passage using [cite:N] where N is the passage number.
2. If the context doesn't contain the answer, reply: "I don't have that information in your documents."
3. Do NOT use outside knowledge. If it isn't in the CONTEXT, it isn't true for this user.
4. Copy numbers, dates, and quotes verbatim — never paraphrase them.
"""
def answer(query: str, chunks: list[dict]) -> dict:
if not chunks:
return {"answer": "I don't have that information in your documents.", "citations": []}
context = "\n\n".join(
f"[{i + 1}] ({c['source_path']}) {c['content']}"
for i, c in enumerate(chunks)
)
reply = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=SYSTEM,
messages=[{
"role": "user",
"content": f"CONTEXT:\n{context}\n\nQUERY: {query}",
}],
)
text = reply.content[0].text
# Build citations list — only include passages the model actually cited
import re
cited_ids = {int(n) for n in re.findall(r"\[cite:(\d+)\]", text)}
citations = [
{"id": i + 1, "source": c["source_path"], "span": c["content"][:200] + "…"}
for i, c in enumerate(chunks)
if (i + 1) in cited_ids
]
return {"answer": text, "citations": citations}
The system prompt is the guardrail. "Do NOT use outside knowledge" plus enforced [cite:N] markers is how Module 5 Lesson 1's "groundedness" metric gets satisfied at runtime. If you want the Module 5 groundedness verifier on top, add a second LLM pass that checks every sentence has a supporting citation.
Part 5 — FastAPI app (5 min)
5.1 main.py
from fastapi import FastAPI
from pydantic import BaseModel
from retriever import retrieve
from generator import answer
app = FastAPI()
class AskBody(BaseModel):
q: str
top_k: int = 5
@app.post("/ask")
def ask(body: AskBody):
chunks = retrieve(body.q, top_k=body.top_k)
return answer(body.q, chunks)
@app.get("/health")
def health():
return {"ok": True}
5.2 .env.example
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
SUPABASE_URL=https://YOURPROJECT.supabase.co
SUPABASE_SERVICE_KEY=eyJ...
5.3 Run it
cp .env.example .env
# Fill in your keys, then:
uvicorn main:app --reload
Hit it:
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"q": "What are the key takeaways from my March meeting notes?"}'
You should get a JSON response with an answer containing [cite:1], [cite:2] markers plus a citations array mapping those markers to the exact source files.
Part 6 — Deploy (optional, 10 min)
The simplest free-tier path is Railway:
- Push your
my-rag/folder to a GitHub repo - Go to https://railway.app → New Project → Deploy from GitHub
- Add environment variables (same as
.env) - Railway auto-detects
requirements.txtand FastAPI, runsuvicorn main:app --host 0.0.0.0 --port $PORT
Alternative: Fly.io, Render, or Vercel Python serverless. Whatever's in your comfort zone.
Part 7 — Troubleshooting matrix
| Symptom | First check | Typical cause |
|---|---|---|
supabase.exceptions.APIError: relation "chunks" does not exist | SQL editor | Schema didn't apply — rerun schema.sql |
Ingestion runs but hybrid_search returns nothing | select count(*) from chunks; | Ingestion silently failed mid-batch — check OpenAI quota |
Same passage cited as [cite:3] repeatedly | retriever top_k vs rerank order | Rerank is working but all 5 chunks are from the same source doc — raise top_k |
| LLM returns "I don't have that information" for queries you know are covered | SQL: select content from chunks where content ilike '%keyword%' limit 3; | Chunk boundary split the answer mid-sentence — tune CHUNK_TOKENS down or enable overlap |
| 429 from Anthropic during rerank | API dashboard | You're on free tier — rerank fewer chunks or add time.sleep(0.3) between batches |
Build checkpoint — finish this before claiming the certificate
- Ship ingestion. Run
python ingest.py ./your-docs, verifyselect count(*) from chunks;matches expectations. - Ship retrieval. POST
/askwith a question where you already know the right answer — confirm the citations point to the correct source files. - Ship a wrong-on-purpose test. Ask something clearly NOT in your documents (e.g., "What's the capital of Brazil?"). Confirm the model says "I don't have that information."
- Deploy. Get the service reachable from a public URL.
- Screenshot the answer + citation JSON as your proof of work.
You now have a RAG system you built, running on your own documents, with citations you can audit. Everything from the last six modules just landed in production.
Next: 06-next-steps — the real next steps: extending this system with evaluation (Module 5 RAGAS), observability, and multi-user access control.
:::