RAG System Design
RAG Architecture Overview
4 min read
Retrieval-Augmented Generation (RAG) is the most common pattern in production AI systems. It grounds LLM responses in your data, reducing hallucinations and enabling domain-specific answers.
Why RAG?
| Problem | How RAG Solves It |
|---|---|
| LLM hallucinations | Ground responses in retrieved facts |
| Outdated knowledge | Use current data from your sources |
| Generic responses | Provide domain-specific context |
| Token limits | Retrieve only relevant information |
RAG Pipeline Architecture
┌─────────────────────────────────────────────────────────────────┐
│ RAG Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────── INDEXING PHASE ─────────────────────┐ │
│ │ │ │
│ │ Documents ──▶ Chunking ──▶ Embedding ──▶ Vector DB │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────── QUERY PHASE ────────────────────────┐ │
│ │ │ │
│ │ Query ──▶ Embedding ──▶ Retrieval ──▶ Reranking │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Top K Documents │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ Context + Query ──▶ LLM ──▶ Response │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Phase 1: Document Indexing
Chunking Strategies
class ChunkingStrategy:
"""Different approaches to splitting documents."""
@staticmethod
def fixed_size(text: str, chunk_size: int = 500, overlap: int = 50) -> list:
"""Simple fixed-size chunks with overlap."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks
@staticmethod
def semantic(text: str, max_tokens: int = 500) -> list:
"""Split on semantic boundaries (paragraphs, sections)."""
# Split on double newlines (paragraphs)
paragraphs = text.split("\n\n")
chunks = []
current_chunk = []
current_tokens = 0
for para in paragraphs:
para_tokens = len(para) // 4 # Rough estimate
if current_tokens + para_tokens > max_tokens:
chunks.append("\n\n".join(current_chunk))
current_chunk = [para]
current_tokens = para_tokens
else:
current_chunk.append(para)
current_tokens += para_tokens
if current_chunk:
chunks.append("\n\n".join(current_chunk))
return chunks
Chunk Size Trade-offs
| Chunk Size | Pros | Cons |
|---|---|---|
| Small (100-200 tokens) | Precise retrieval | May lose context |
| Medium (300-500 tokens) | Balanced | Good default choice |
| Large (500-1000 tokens) | Full context | May include irrelevant info |
Embedding Generation
from openai import OpenAI
class EmbeddingPipeline:
def __init__(self, model: str = "text-embedding-3-small"):
self.client = OpenAI()
self.model = model
async def embed_chunks(self, chunks: list) -> list:
"""Embed multiple chunks efficiently."""
# Batch embedding for efficiency
response = self.client.embeddings.create(
model=self.model,
input=chunks
)
return [item.embedding for item in response.data]
async def embed_query(self, query: str) -> list:
"""Embed a single query."""
response = self.client.embeddings.create(
model=self.model,
input=query
)
return response.data[0].embedding
Phase 2: Query Processing
Retrieval Flow
class RAGPipeline:
def __init__(self, vector_store, llm, embedding_model):
self.vector_store = vector_store
self.llm = llm
self.embedder = embedding_model
async def query(self, user_query: str, top_k: int = 5) -> dict:
# Step 1: Embed query
query_embedding = await self.embedder.embed_query(user_query)
# Step 2: Retrieve similar documents
results = await self.vector_store.search(
embedding=query_embedding,
top_k=top_k
)
# Step 3: Build context
context = "\n\n---\n\n".join([
f"Source: {r.metadata.get('source', 'Unknown')}\n{r.content}"
for r in results
])
# Step 4: Generate response
prompt = f"""Answer based on the provided context only.
If the answer is not in the context, say "I don't have information about that."
Context:
{context}
Question: {user_query}
Answer:"""
response = await self.llm.complete(prompt)
return {
"answer": response,
"sources": [r.metadata for r in results],
"context_used": context
}
Key Metrics for RAG Systems
| Metric | What It Measures | Target |
|---|---|---|
| Retrieval Precision | % of retrieved docs that are relevant | > 80% |
| Retrieval Recall | % of relevant docs that were retrieved | > 70% |
| Answer Accuracy | Correctness of final response | > 90% |
| Latency | Time from query to response | < 3s |
| Faithfulness | Answer grounded in retrieved docs | > 95% |
Next, we'll dive deep into vector database selection and trade-offs. :::