RAG Architecture Overview

Retrieval-Augmented Generation (RAG) is the most common pattern in production AI systems. It grounds LLM responses in your data, reducing hallucinations and enabling domain-specific answers.

Why RAG?

Problem	How RAG Solves It
LLM hallucinations	Ground responses in retrieved facts
Outdated knowledge	Use current data from your sources
Generic responses	Provide domain-specific context
Token limits	Retrieve only relevant information

RAG Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        RAG Pipeline                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────────── INDEXING PHASE ─────────────────────┐   │
│  │                                                          │   │
│  │   Documents ──▶ Chunking ──▶ Embedding ──▶ Vector DB    │   │
│  │                                                          │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
│  ┌──────────────────── QUERY PHASE ────────────────────────┐   │
│  │                                                          │   │
│  │   Query ──▶ Embedding ──▶ Retrieval ──▶ Reranking       │   │
│  │                              │                           │   │
│  │                              ▼                           │   │
│  │                    Top K Documents                       │   │
│  │                              │                           │   │
│  │                              ▼                           │   │
│  │   ┌─────────────────────────────────────────────────┐   │   │
│  │   │         Context + Query ──▶ LLM ──▶ Response    │   │   │
│  │   └─────────────────────────────────────────────────┘   │   │
│  │                                                          │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Phase 1: Document Indexing

Chunking Strategies

class ChunkingStrategy:
    """Different approaches to splitting documents."""

    @staticmethod
    def fixed_size(text: str, chunk_size: int = 500, overlap: int = 50) -> list:
        """Simple fixed-size chunks with overlap."""
        chunks = []
        start = 0
        while start < len(text):
            end = start + chunk_size
            chunks.append(text[start:end])
            start = end - overlap
        return chunks

    @staticmethod
    def semantic(text: str, max_tokens: int = 500) -> list:
        """Split on semantic boundaries (paragraphs, sections)."""
        # Split on double newlines (paragraphs)
        paragraphs = text.split("\n\n")

        chunks = []
        current_chunk = []
        current_tokens = 0

        for para in paragraphs:
            para_tokens = len(para) // 4  # Rough estimate
            if current_tokens + para_tokens > max_tokens:
                chunks.append("\n\n".join(current_chunk))
                current_chunk = [para]
                current_tokens = para_tokens
            else:
                current_chunk.append(para)
                current_tokens += para_tokens

        if current_chunk:
            chunks.append("\n\n".join(current_chunk))

        return chunks

Chunk Size Trade-offs

Chunk Size	Pros	Cons
Small (100-200 tokens)	Precise retrieval	May lose context
Medium (300-500 tokens)	Balanced	Good default choice
Large (500-1000 tokens)	Full context	May include irrelevant info

Embedding Generation

from openai import OpenAI

class EmbeddingPipeline:
    def __init__(self, model: str = "text-embedding-3-small"):
        self.client = OpenAI()
        self.model = model

    async def embed_chunks(self, chunks: list) -> list:
        """Embed multiple chunks efficiently."""
        # Batch embedding for efficiency
        response = self.client.embeddings.create(
            model=self.model,
            input=chunks
        )
        return [item.embedding for item in response.data]

    async def embed_query(self, query: str) -> list:
        """Embed a single query."""
        response = self.client.embeddings.create(
            model=self.model,
            input=query
        )
        return response.data[0].embedding

Phase 2: Query Processing

Retrieval Flow

class RAGPipeline:
    def __init__(self, vector_store, llm, embedding_model):
        self.vector_store = vector_store
        self.llm = llm
        self.embedder = embedding_model

    async def query(self, user_query: str, top_k: int = 5) -> dict:
        # Step 1: Embed query
        query_embedding = await self.embedder.embed_query(user_query)

        # Step 2: Retrieve similar documents
        results = await self.vector_store.search(
            embedding=query_embedding,
            top_k=top_k
        )

        # Step 3: Build context
        context = "\n\n---\n\n".join([
            f"Source: {r.metadata.get('source', 'Unknown')}\n{r.content}"
            for r in results
        ])

        # Step 4: Generate response
        prompt = f"""Answer based on the provided context only.
If the answer is not in the context, say "I don't have information about that."

Context:
{context}

Question: {user_query}

Answer:"""

        response = await self.llm.complete(prompt)

        return {
            "answer": response,
            "sources": [r.metadata for r in results],
            "context_used": context
        }

Key Metrics for RAG Systems

Metric	What It Measures	Target
Retrieval Precision	% of retrieved docs that are relevant	> 80%
Retrieval Recall	% of relevant docs that were retrieved	> 70%
Answer Accuracy	Correctness of final response	> 90%
Latency	Time from query to response	< 3s
Faithfulness	Answer grounded in retrieved docs	> 95%

Next, we'll dive deep into vector database selection and trade-offs. :::