Build a RAG-Powered Conversational Agent
Instructions
In this lab, you'll build a complete RAG-powered conversational agent in Python. Your agent will ingest documents with configurable chunking, perform vector similarity search, decompose complex queries, manage conversation memory, allocate context window budgets, attribute claims to sources, and guard against hallucinations.
This covers the full lifecycle of a production RAG agent — the kind of system you'd be asked to design or implement in an agent engineer interview.
Architecture Overview
Documents → Chunking → Embeddings → Vector Store
↑
User Query → Memory Context |
↓ |
Query Decomposition (Agentic RAG) |
↓ |
Vector Search ←─────────────────────────┘
↓
Context Budget Allocation
↓
LLM Generation (with source attribution)
↓
Hallucination Guard
↓
Response to User (with citations)
Step 1: Document Store (document_store.py)
Build a DocumentStore class that ingests text documents and splits them into chunks using configurable strategies:
- Fixed-size chunking: Split text every N characters with configurable overlap
- Semantic chunking: Group consecutive sentences by embedding similarity — start a new chunk when the similarity between consecutive sentences drops below a threshold
- Recursive chunking: Split by paragraphs first, then by sentences if a paragraph exceeds the max chunk size, then by character count as a last resort
Each chunk should carry metadata: chunk_id, document_id, source_filename, chunk_index, strategy_used.
Step 2: Vector Search (vector_search.py)
Build a VectorSearch class that provides similarity search over document chunks:
- Accept an embedding function (callable that takes text and returns a list of floats)
- Index chunks by computing and storing their embeddings
- Search by computing the query embedding and finding the top-K most similar chunks using cosine similarity
- Support metadata filtering (e.g., filter by
source_filenameordocument_id) - Return results with similarity scores
Step 3: Agentic RAG (agentic_rag.py)
Build an AgenticRAG class that goes beyond naive retrieve-and-generate:
- Query decomposition: Analyze a complex question and break it into simpler sub-queries that can each be searched independently
- Multi-query retrieval: Execute each sub-query against the vector store and merge results, removing duplicates
- Self-reflection: After retrieval, evaluate whether the retrieved chunks contain sufficient information to answer the original question. Return a quality assessment (sufficient / partial / insufficient) and optionally suggest a reformulated query
Step 4: Memory Manager (memory_manager.py)
Build a MemoryManager class that handles conversation memory:
- Short-term buffer: Store the last N conversation turns (user messages + agent responses) in a list
- Long-term summarization: When the buffer exceeds a configurable limit, compress older turns into a running summary using an LLM call (accept a summarization function as a parameter)
- Memory retrieval: Return the combined context — summary (if any) plus recent turns — formatted for injection into the LLM prompt
Step 5: Context Budget Manager (context_budget.py)
Build a ContextBudgetManager class that allocates a finite token budget across competing components:
- Accept a total token budget and allocation percentages for: system prompt, conversation memory, retrieved documents, and output reserve
- Accept a token counting function (callable that takes text and returns an integer)
allocate()method: given the system prompt, memory context, and retrieved chunks, trim each component to fit within its allocated budget. Trim retrieved chunks by removing the lowest-scored ones first. Trim memory by summarizing or truncating from the oldest turns- Return the final assembled context with each component within budget
Step 6: Source Attribution (source_attribution.py)
Build a SourceAttributor class that links claims in the agent's response to source chunks:
- Accept the agent's response text and the list of source chunks that were used
- Split the response into individual claims (sentence-level)
- For each claim, compute similarity against all source chunks and assign the best-matching chunk as the source
- Assign a confidence score (the similarity value) to each attribution
- Return a list of
Attributionobjects:claim,source_chunk_id,source_document,confidence,relevant_excerpt
Step 7: Hallucination Guard (hallucination_guard.py)
Build a HallucinationGuard class that cross-references the agent's answer against retrieved content:
- Accept the agent's response and the retrieved source chunks
- Split the response into individual claims
- For each claim, check if it is supported by any source chunk (similarity above a configurable threshold)
- Classify each claim as:
supported,partially_supported, orunsupported - Return a
GuardResultwith: the list of classified claims, an overall trust score (ratio of supported claims), and a booleanis_safeflag (true if the trust score exceeds a configurable safety threshold)
What to Submit
The editor has 7 file sections with TODO comments. Replace each TODO with your Python code. The AI grader will evaluate each section against the rubric.
Hints
- For cosine similarity, use the dot product divided by the product of magnitudes, or use
numpyif you prefer - For sentence splitting, splitting on
.(period + space) is a reasonable starting point - For the embedding function parameter, design your code to accept any callable with signature
(str) -> List[float]— this makes it testable with mock embeddings - For token counting, accept any callable with signature
(str) -> int— a simple approximation islen(text.split()) - The hallucination guard threshold is a design choice — 0.7 is a reasonable default for cosine similarity