Capstone: Build Your Own RAG Over Your Personal Documents

Outcome: By the end of this lesson you will have a deployed RAG service — a FastAPI app that ingests your personal documents (PDFs, markdown, text), retrieves relevant passages via hybrid search with RRF fusion, and answers questions with inline citations back to the source files.

This capstone ties together everything from the last six modules: embedding strategy (Module 2), chunking (Module 3), hybrid search + reranking (Module 4), evaluation (Module 5), and production hardening (Module 6).

What you'll actually ship:

$ curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"q": "What did I say about the Q3 roadmap in my meeting notes?"}'

{
  "answer": "According to your notes from 2026-03-14, the Q3 roadmap priorities are: (1) ship MCP integration [cite:1], (2) launch Arabic site [cite:2], and (3) cut infrastructure costs by 30% [cite:1].",
  "citations": [
    {"id": 1, "source": "meetings/2026-03-14-planning.md", "span": "..."},
    {"id": 2, "source": "meetings/2026-03-14-planning.md", "span": "..."}
  ]
}

The architecture you're building

{
  "type": "architecture",
  "title": "RAG Service — Data Flow",
  "direction": "top-down",
  "layers": [
    {
      "label": "Ingestion (one-time)",
      "color": "blue",
      "components": [
        { "label": "Documents", "description": "Your PDFs, markdown, and text files" },
        { "label": "Chunker", "description": "Sliding window, 500 tokens, 75 overlap" },
        { "label": "Embedder", "description": "OpenAI text-embedding-3-large (3072d)" },
        { "label": "Supabase chunks table", "description": "HNSW index + GIN tsvector" }
      ]
    },
    {
      "label": "Query time (every request)",
      "color": "amber",
      "components": [
        { "label": "FastAPI /ask", "description": "Receives {q, top_k}" },
        { "label": "Hybrid search (SQL)", "description": "Dense cosine + sparse BM25 → RRF fused top-30" },
        { "label": "LLM reranker (Haiku)", "description": "Scores 30 → keeps top-5" },
        { "label": "Grounded generator (Sonnet)", "description": "Answer with [cite:N] inline markers" }
      ]
    },
    {
      "label": "Output",
      "color": "green",
      "components": [
        { "label": "JSON response", "description": "{answer, citations[]} — citations map back to source files" }
      ]
    }
  ]
}

Part 0 — Decide your document set (5 min)

Pick something you actually care about:

Your Notion/Obsidian export
Your company's internal wiki (markdown dump)
A folder of meeting notes
A personal knowledge garden
Your blog archive

Aim for 20–200 documents the first time. Too few and RAG isn't interesting; too many and your first ingestion takes too long to debug.

Part 1 — Supabase setup with pgvector (10 min)

1.1 Create a project

Go to https://supabase.com, sign up (free tier is enough for up to ~50 MB of text content + embeddings)
Create a new project. Save the Project URL and the service role key for the next step.

1.2 Apply the schema

In Supabase → SQL Editor, paste and run this. It creates a single chunks table with both a vector column (for semantic search) and a tsvector column (for keyword search) — the two halves of hybrid retrieval you learned in Module 4.

-- schema.sql
create extension if not exists vector;

create table chunks (
  id uuid primary key default gen_random_uuid(),
  source_path text not null,           -- e.g. "meetings/2026-03-14-planning.md"
  chunk_index int not null,
  content text not null,
  embedding vector(3072) not null,     -- text-embedding-3-large
  tsv tsvector generated always as (to_tsvector('english', content)) stored,
  created_at timestamptz default now()
);

create index on chunks using hnsw (embedding vector_cosine_ops);
create index on chunks using gin (tsv);
create index on chunks (source_path);

-- Hybrid search via Reciprocal Rank Fusion (RRF)
create or replace function hybrid_search(
  query_embedding vector(3072),
  query_text text,
  match_count int default 10,
  rrf_k int default 60
)
returns table (id uuid, source_path text, content text, score float)
language sql stable as $$
  with dense as (
    select id, row_number() over (order by embedding <=> query_embedding) as rank
    from chunks
    order by embedding <=> query_embedding
    limit match_count * 3
  ),
  sparse as (
    select id, row_number() over (order by ts_rank(tsv, plainto_tsquery('english', query_text)) desc) as rank
    from chunks
    where tsv @@ plainto_tsquery('english', query_text)
    limit match_count * 3
  ),
  fused as (
    select coalesce(d.id, s.id) as id,
           coalesce(1.0 / (rrf_k + d.rank), 0) + coalesce(1.0 / (rrf_k + s.rank), 0) as rrf_score
    from dense d full outer join sparse s on d.id = s.id
  )
  select c.id, c.source_path, c.content, f.rrf_score as score
  from fused f join chunks c on c.id = f.id
  order by f.rrf_score desc
  limit match_count;
$$;

Why this schema: HNSW for fast approximate nearest neighbour, GIN for BM25-style keyword matching, RRF to fuse both rankings into one score. Exactly the pattern from Module 4 Lesson 2.

Part 2 — Ingestion pipeline (15 min)

Project layout:

my-rag/
├── schema.sql           (already applied above)
├── ingest.py
├── retriever.py
├── generator.py
├── main.py
├── requirements.txt
└── .env.example

2.1 `requirements.txt`

anthropic==0.42.0
openai==1.58.0
fastapi==0.115.6
uvicorn[standard]==0.34.0
supabase==2.10.0
python-dotenv==1.0.1
pypdf==5.1.0
markdown-it-py==3.0.0
tiktoken==0.8.0

2.2 `ingest.py` — load, chunk, embed, upsert

import os, glob, json
from pathlib import Path
from openai import OpenAI
from supabase import create_client
from pypdf import PdfReader
from dotenv import load_dotenv
import tiktoken

load_dotenv()

openai = OpenAI()
supabase = create_client(os.environ["SUPABASE_URL"], os.environ["SUPABASE_SERVICE_KEY"])
enc = tiktoken.get_encoding("cl100k_base")

CHUNK_TOKENS = 500    # Module 3 Lesson 3: 300-800 works best for most docs
CHUNK_OVERLAP = 75    # 15% overlap keeps context across boundaries


def read_document(path: Path) -> str:
    if path.suffix.lower() == ".pdf":
        return "\n\n".join(p.extract_text() or "" for p in PdfReader(path).pages)
    return path.read_text(encoding="utf-8", errors="ignore")


def chunk(text: str, chunk_tokens: int = CHUNK_TOKENS, overlap: int = CHUNK_OVERLAP) -> list[str]:
    """Sliding-window token chunking — simpler than semantic but robust."""
    tokens = enc.encode(text)
    out = []
    i = 0
    while i < len(tokens):
        window = tokens[i : i + chunk_tokens]
        out.append(enc.decode(window))
        i += chunk_tokens - overlap
    return out


def embed_batch(texts: list[str]) -> list[list[float]]:
    """text-embedding-3-large returns 3072-dim vectors."""
    resp = openai.embeddings.create(model="text-embedding-3-large", input=texts)
    return [d.embedding for d in resp.data]


def ingest(docs_dir: str):
    paths = [
        Path(p) for p in glob.glob(f"{docs_dir}/**/*", recursive=True)
        if Path(p).is_file() and Path(p).suffix.lower() in {".md", ".txt", ".pdf"}
    ]
    print(f"Found {len(paths)} documents")

    for path in paths:
        text = read_document(path)
        chunks = chunk(text)
        if not chunks:
            continue

        # Embed in batches of 100 — OpenAI's per-request limit is safely higher
        # but batching reduces latency and keeps token accounting simple.
        for batch_start in range(0, len(chunks), 100):
            batch = chunks[batch_start : batch_start + 100]
            vectors = embed_batch(batch)

            rows = [{
                "source_path": str(path.relative_to(docs_dir)),
                "chunk_index": batch_start + i,
                "content": content,
                "embedding": vec,
            } for i, (content, vec) in enumerate(zip(batch, vectors))]

            supabase.table("chunks").insert(rows).execute()

        print(f"  ✓ {path.name}: {len(chunks)} chunks")


if __name__ == "__main__":
    import sys
    ingest(sys.argv[1] if len(sys.argv) > 1 else "./docs")

Run it:

pip install -r requirements.txt
python ingest.py ./path/to/your/docs

You'll see output like ✓ meetings/2026-03-14-planning.md: 8 chunks. If a document fails to parse (corrupted PDF, etc.), the loop continues — one bad file shouldn't tank a 200-doc ingest.

Part 3 — Retrieval with reranking (15 min)

3.1 `retriever.py` — hybrid search + optional rerank

import os
from anthropic import Anthropic
from openai import OpenAI
from supabase import create_client
from dotenv import load_dotenv

load_dotenv()

openai = OpenAI()
anthropic = Anthropic()
supabase = create_client(os.environ["SUPABASE_URL"], os.environ["SUPABASE_SERVICE_KEY"])


def retrieve(query: str, top_k: int = 5) -> list[dict]:
    """
    Hybrid search (dense + sparse) → top-30 fused candidates → LLM rerank → top-K.
    Returns [{id, source_path, content, rank}].
    """
    # 1. Embed the query once
    query_vec = openai.embeddings.create(
        model="text-embedding-3-large",
        input=query,
    ).data[0].embedding

    # 2. Hybrid search via our SQL function
    result = supabase.rpc("hybrid_search", {
        "query_embedding": query_vec,
        "query_text": query,
        "match_count": 30,         # oversample so rerank has candidates to choose from
    }).execute()

    candidates = result.data or []
    if not candidates:
        return []

    # 3. Lightweight rerank with a small LLM pass. This is the Module 4 Lesson 3
    #    technique — prompt a cheap model to score relevance 0-10 for each chunk.
    prompt = f"""Score each passage 0-10 for how well it answers the query.
Return JSON only: {{"scores": [int, ...]}} in the same order as input.

QUERY: {query}

PASSAGES:
""" + "\n\n".join(
        f"[{i}] {c['content'][:500]}" for i, c in enumerate(candidates)
    )

    reply = anthropic.messages.create(
        model="claude-haiku-4-5-20251001",   # fast and cheap for scoring
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}],
    )

    import json, re
    try:
        scores = json.loads(re.search(r"\{.*\}", reply.content[0].text, re.DOTALL).group(0))["scores"]
    except Exception:
        scores = [c.get("score", 0) for c in candidates]   # fall back to RRF order

    # 4. Sort by LLM score, keep top-K
    ranked = sorted(
        zip(candidates, scores),
        key=lambda x: x[1],
        reverse=True,
    )[:top_k]

    return [{**c, "rank": s} for c, s in ranked]

Why rerank when hybrid search is already good: because Module 4 Lesson 3 showed that LLM reranking lifts final-answer quality meaningfully on ambiguous queries. Haiku costs ~$1/M input tokens — rerank for 30 chunks of 500 chars is pennies per query.

Part 4 — Generation with inline citations (10 min)

4.1 `generator.py` — grounded answers

import os
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()


SYSTEM = """You are a retrieval-augmented assistant. Answer ONLY using the provided CONTEXT passages.

Rules:
1. Every factual statement must cite a passage using [cite:N] where N is the passage number.
2. If the context doesn't contain the answer, reply: "I don't have that information in your documents."
3. Do NOT use outside knowledge. If it isn't in the CONTEXT, it isn't true for this user.
4. Copy numbers, dates, and quotes verbatim — never paraphrase them.
"""


def answer(query: str, chunks: list[dict]) -> dict:
    if not chunks:
        return {"answer": "I don't have that information in your documents.", "citations": []}

    context = "\n\n".join(
        f"[{i + 1}] ({c['source_path']}) {c['content']}"
        for i, c in enumerate(chunks)
    )

    reply = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=SYSTEM,
        messages=[{
            "role": "user",
            "content": f"CONTEXT:\n{context}\n\nQUERY: {query}",
        }],
    )
    text = reply.content[0].text

    # Build citations list — only include passages the model actually cited
    import re
    cited_ids = {int(n) for n in re.findall(r"\[cite:(\d+)\]", text)}
    citations = [
        {"id": i + 1, "source": c["source_path"], "span": c["content"][:200] + "…"}
        for i, c in enumerate(chunks)
        if (i + 1) in cited_ids
    ]

    return {"answer": text, "citations": citations}

The system prompt is the guardrail. "Do NOT use outside knowledge" plus enforced [cite:N] markers is how Module 5 Lesson 1's "groundedness" metric gets satisfied at runtime. If you want the Module 5 groundedness verifier on top, add a second LLM pass that checks every sentence has a supporting citation.

Part 5 — FastAPI app (5 min)

5.1 `main.py`

from fastapi import FastAPI
from pydantic import BaseModel
from retriever import retrieve
from generator import answer

app = FastAPI()


class AskBody(BaseModel):
    q: str
    top_k: int = 5


@app.post("/ask")
def ask(body: AskBody):
    chunks = retrieve(body.q, top_k=body.top_k)
    return answer(body.q, chunks)


@app.get("/health")
def health():
    return {"ok": True}

5.2 `.env.example`

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
SUPABASE_URL=https://YOURPROJECT.supabase.co
SUPABASE_SERVICE_KEY=eyJ...

5.3 Run it

cp .env.example .env
# Fill in your keys, then:
uvicorn main:app --reload

Hit it:

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"q": "What are the key takeaways from my March meeting notes?"}'

You should get a JSON response with an answer containing [cite:1], [cite:2] markers plus a citations array mapping those markers to the exact source files.

Part 6 — Deploy (optional, 10 min)

The simplest free-tier path is Railway:

Push your my-rag/ folder to a GitHub repo
Go to https://railway.app → New Project → Deploy from GitHub
Add environment variables (same as .env)
Railway auto-detects requirements.txt and FastAPI, runs uvicorn main:app --host 0.0.0.0 --port $PORT

Alternative: Fly.io, Render, or Vercel Python serverless. Whatever's in your comfort zone.

Part 7 — Troubleshooting matrix

Symptom	First check	Typical cause
`supabase.exceptions.APIError: relation "chunks" does not exist`	SQL editor	Schema didn't apply — rerun `schema.sql`
Ingestion runs but `hybrid_search` returns nothing	`select count(*) from chunks;`	Ingestion silently failed mid-batch — check OpenAI quota
Same passage cited as `[cite:3]` repeatedly	retriever `top_k` vs rerank order	Rerank is working but all 5 chunks are from the same source doc — raise `top_k`
LLM returns "I don't have that information" for queries you know are covered	SQL: `select content from chunks where content ilike '%keyword%' limit 3;`	Chunk boundary split the answer mid-sentence — tune `CHUNK_TOKENS` down or enable overlap
429 from Anthropic during rerank	API dashboard	You're on free tier — rerank fewer chunks or add `time.sleep(0.3)` between batches

Build checkpoint — finish this before claiming the certificate

Ship ingestion. Run python ingest.py ./your-docs, verify select count(*) from chunks; matches expectations.
Ship retrieval. POST /ask with a question where you already know the right answer — confirm the citations point to the correct source files.
Ship a wrong-on-purpose test. Ask something clearly NOT in your documents (e.g., "What's the capital of Brazil?"). Confirm the model says "I don't have that information."
Deploy. Get the service reachable from a public URL.
Screenshot the answer + citation JSON as your proof of work.

You now have a RAG system you built, running on your own documents, with citations you can audit. Everything from the last six modules just landed in production.

Next: 06-next-steps — the real next steps: extending this system with evaluation (Module 5 RAGAS), observability, and multi-user access control. :::