All Guides
AI & Machine Learning

The Complete Guide to RAG: Building Retrieval-Augmented Generation Systems

Master RAG system development from architecture to production. Learn embedding models, vector databases, chunking strategies, hybrid search, reranking, evaluation with RAGAS, and best practices for building reliable retrieval-augmented generation pipelines.

15 min read
February 10, 2026
NerdLevelTech
5 related articles
The Complete Guide to RAG: Building Retrieval-Augmented Generation Systems

Note: Code examples in this guide use LangChain 0.3+, LlamaIndex 0.11+, and OpenAI SDK 1.x+. Vector database examples cover Chroma, Pinecone, and pgvector. Always check official documentation for the latest API changes.

Retrieval-Augmented Generation (RAG) has become the most practical architecture pattern for building AI applications that need to reason over your own data. Instead of expensive fine-tuning or hoping the model memorized the right facts, RAG lets you connect any LLM to your documents, databases, and knowledge bases. This guide covers the entire RAG pipeline, from core concepts to production-ready systems.

What Is RAG and Why It Matters

RAG (Retrieval-Augmented Generation) is an architecture where an LLM's response is enhanced by first retrieving relevant information from an external knowledge source. The retrieved context is passed alongside the user's question, grounding the model's answer in your actual data.

The Problem RAG Solves

LLMs have three fundamental limitations that RAG addresses:

  1. Knowledge cutoff: Models only know what they were trained on. Your internal docs, recent data, and proprietary knowledge are invisible to them.
  2. Hallucinations: Without grounding, models confidently generate plausible but incorrect answers.
  3. Lack of citations: Base LLMs can't point to sources. RAG enables traceable, verifiable responses.

RAG vs Fine-Tuning vs Prompt Engineering

Approach Best For Data Needs Cost Update Speed
Prompt Engineering Behavior tweaks, formatting None Low Instant
RAG Knowledge grounding, dynamic data Documents/DB Medium Minutes
Fine-Tuning Style, domain reasoning, behavior Training pairs High Hours/Days
RAG + Fine-Tuning Production systems needing both Both Highest Varies

Rule of thumb: If you need the model to know specific facts, use RAG. If you need it to behave differently, fine-tune. Most production systems benefit from RAG first, then adding fine-tuning later if needed.

RAG Architecture: The Complete Pipeline

Every RAG system has two phases: indexing (offline) and retrieval + generation (runtime).

Indexing Pipeline (Offline)

Documents → Load → Split/Chunk → Embed → Store in Vector DB
  1. Load: Ingest documents from various sources (PDFs, web pages, databases, APIs)
  2. Split: Break documents into smaller chunks appropriate for embedding
  3. Embed: Convert each chunk into a dense vector using an embedding model
  4. Store: Save vectors + metadata in a vector database for fast similarity search

Retrieval + Generation Pipeline (Runtime)

User Query → Embed Query → Search Vector DB → Retrieve Top-K → Rerank → Generate with LLM
  1. Embed query: Convert the user's question into a vector using the same embedding model
  2. Search: Find the most similar document chunks via vector similarity
  3. Rerank (optional): Re-score results with a cross-encoder for better precision
  4. Generate: Pass the retrieved context + question to the LLM for answer generation

Basic RAG with LangChain

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# 1. Load documents
loader = PyPDFLoader("technical_manual.pdf")
docs = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(docs)

# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# 4. Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# 5. Build RAG chain
template = """Answer the question based only on the following context.
If you cannot answer from the context, say "I don't have enough information."

Context: {context}

Question: {question}"""

prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model="gpt-4o", temperature=0)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# 6. Query
answer = chain.invoke("How do I configure the backup system?")

Embedding Models: Turning Text Into Vectors

Embeddings are the foundation of RAG. They convert text into dense numerical vectors where semantic similarity maps to geometric proximity.

How Embeddings Work

An embedding model maps text to a fixed-dimensional vector (e.g., 1536 dimensions for OpenAI's text-embedding-3-small). Texts with similar meaning produce vectors that are close together in this space, enabling similarity search.

Model Dimensions Context Best For Pricing
OpenAI text-embedding-3-small 1536 8,191 tokens General purpose, cost-effective $0.02/1M tokens
OpenAI text-embedding-3-large 3072 8,191 tokens Higher accuracy needs $0.13/1M tokens
Cohere embed-v3 1024 512 tokens Multilingual, search-optimized $0.10/1M tokens
BGE-large-en-v1.5 1024 512 tokens Open-source, self-hosted Free
E5-mistral-7b-instruct 4096 32,768 tokens Long-context, open-source Free
all-MiniLM-L6-v2 384 256 tokens Fast, lightweight, local Free

Choosing the Right Model

  • Prototyping: text-embedding-3-small (cheap, good quality, easy API)
  • Production (cloud): text-embedding-3-large or Cohere embed-v3
  • Self-hosted/privacy: BGE or E5 models via sentence-transformers
  • Multilingual: Cohere embed-v3 or multilingual-e5-large
# Using sentence-transformers for local embeddings
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

texts = ["How does photosynthesis work?", "Plants convert sunlight to energy"]
embeddings = model.encode(texts)

# Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
print(f"Similarity: {similarity[0][0]:.3f}")  # ~0.85

Embedding Best Practices

  1. Use the same model for indexing and querying — different models produce incompatible vector spaces
  2. Normalize vectors if your database doesn't do it automatically (cosine similarity requires normalized vectors)
  3. Benchmark on your data — MTEB leaderboard rankings don't always predict performance on domain-specific content
  4. Consider dimensionality reduction — OpenAI's models support dimensions parameter to reduce storage costs

Vector Databases: Choosing the Right Store

Vector databases are purpose-built for storing, indexing, and querying high-dimensional vectors efficiently.

Comparison Table

Database Type Hosting Filtering Hybrid Search Best For
Chroma Embedded Local/Docker Basic No Prototyping, small datasets
Pinecone Managed Cloud only Advanced Yes Production, zero-ops
Weaviate Self-hosted/Cloud Both Advanced Yes Full control, GraphQL API
Qdrant Self-hosted/Cloud Both Advanced Yes Performance, Rust-based
pgvector Extension PostgreSQL Full SQL Yes (with extensions) Existing Postgres users
Milvus Self-hosted/Cloud Both Advanced Yes Large-scale, enterprise

Pinecone Example (Managed)

from pinecone import Pinecone, ServerlessSpec

# Initialize
pc = Pinecone(api_key="your-api-key")

# Create index
pc.create_index(
    name="rag-docs",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

index = pc.Index("rag-docs")

# Upsert vectors with metadata
index.upsert(vectors=[
    {
        "id": "doc-1-chunk-0",
        "values": embedding_vector,
        "metadata": {
            "source": "manual.pdf",
            "page": 5,
            "section": "Installation",
            "text": "To install the software..."
        }
    }
])

# Query with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True,
    filter={"source": {"$eq": "manual.pdf"}}
)

pgvector Example (PostgreSQL)

-- Enable extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    metadata JSONB,
    embedding vector(1536)
);

-- Create HNSW index for fast search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Query: find 5 most similar documents
SELECT id, content, metadata,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 5;

pgvector is ideal when you already use PostgreSQL — no new infrastructure needed, full SQL power for metadata filtering, and transactional consistency with your application data.

Chunking Strategies: How to Split Your Documents

Chunking is one of the most impactful decisions in RAG. Bad chunking leads to irrelevant retrievals, split context, and poor answers.

Chunking Methods

Method How It Works Best For
Fixed-size Split by token/character count Simple documents, baseline
Recursive Split by separators (paragraphs, sentences) General purpose, most common
Semantic Split when embedding similarity drops Natural topic boundaries
Document-aware Split respecting structure (headers, sections) Markdown, HTML, structured docs
Agentic/AST Parse code by functions/classes Code repositories

Recursive Character Splitting (Most Common)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,        # Target chunk size in characters
    chunk_overlap=100,     # Overlap between chunks to preserve context
    separators=[
        "\n\n",  # Paragraph breaks first
        "\n",    # Line breaks
        ". ",    # Sentence boundaries
        " ",     # Word boundaries (last resort)
        ""       # Character level (absolute last resort)
    ],
    length_function=len,
)

chunks = splitter.split_documents(documents)

Semantic Chunking

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Split based on embedding similarity between consecutive sentences
splitter = SemanticChunker(
    OpenAIEmbeddings(model="text-embedding-3-small"),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90,
)

chunks = splitter.split_documents(documents)

Chunk Size Guidelines

  • Too small (< 200 tokens): Loses context, retrieves fragments
  • Too large (> 2000 tokens): Dilutes relevance, wastes context window
  • Sweet spot (400-800 tokens): Enough context to be useful, specific enough to be relevant
  • Always include overlap (50-100 tokens): Prevents cutting sentences and losing information at boundaries

Metadata Enrichment

Add metadata to each chunk for better retrieval and filtering:

for i, chunk in enumerate(chunks):
    chunk.metadata.update({
        "chunk_index": i,
        "source_file": "manual.pdf",
        "section_title": extract_section_title(chunk),
        "doc_type": "technical",
        "created_at": "2026-01-15",
    })

Retrieval & Reranking: Finding the Best Context

Retrieval quality directly determines answer quality. Poor retrieval means the LLM gets irrelevant context and produces bad answers.

Retrieval Strategies

The simplest approach — find the K most similar vectors:

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

2. Maximum Marginal Relevance (MMR)

Balances relevance with diversity to avoid redundant results:

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 5,
        "fetch_k": 20,      # Fetch 20 candidates
        "lambda_mult": 0.7,  # 0=max diversity, 1=max relevance
    }
)

3. Hybrid Search (Semantic + Keyword)

Combines vector similarity with BM25 keyword matching for the best of both worlds:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Keyword retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Combine with equal weights
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # Adjust based on your use case
)

Reranking

Reranking uses a cross-encoder model to re-score retrieved results for better precision. Cross-encoders are more accurate than bi-encoders (embedding models) because they process the query and document together.

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# Use Cohere reranker
reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=3  # Return top 3 after reranking
)

# Wrap the base retriever with reranking
retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)

Query Transformation

Sometimes the user's query doesn't match the language of the documents. Transform queries to improve retrieval:

# Multi-query: generate multiple search queries from one question
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.3),
)

# The retriever generates 3 query variations and combines results
docs = multi_retriever.invoke("How does the authentication system work?")
# Might search for:
# 1. "authentication system architecture"
# 2. "login and auth flow implementation"
# 3. "user authentication mechanism"

Evaluating RAG Systems

You can't improve what you don't measure. RAG evaluation tells you where your pipeline is failing and guides optimization.

RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) is the standard evaluation framework for RAG systems. It provides four key metrics:

Metric What It Measures Range
Faithfulness Is the answer supported by the retrieved context? 0-1
Answer Relevancy Does the answer address the question? 0-1
Context Precision Are the top-ranked retrieved docs relevant? 0-1
Context Recall Were all necessary docs retrieved? 0-1
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["How do I reset my password?"],
    "answer": ["To reset your password, go to Settings > Security > Reset Password..."],
    "contexts": [["The password reset feature is in Settings > Security..."]],
    "ground_truth": ["Navigate to Settings, then Security, click Reset Password..."],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(results)
# {'faithfulness': 0.95, 'answer_relevancy': 0.90,
#  'context_precision': 0.85, 'context_recall': 0.80}

Building an Evaluation Dataset

A good eval dataset needs:

  1. Questions: 50-100 representative questions your users would ask
  2. Ground truth answers: The correct/expected answers
  3. Source documents: The documents that contain the answers

Start with manual curation, then expand with synthetic question generation:

from ragas.testset.generator import TestsetGenerator
from langchain_openai import ChatOpenAI

generator = TestsetGenerator.from_langchain(
    generator_llm=ChatOpenAI(model="gpt-4o"),
    critic_llm=ChatOpenAI(model="gpt-4o"),
)

testset = generator.generate_with_langchain_docs(
    documents=chunks,
    test_size=50,
)

What to Optimize Based on Metrics

Low Metric Root Cause Fix
Low Faithfulness LLM ignoring context or hallucinating Stronger prompt instructions, lower temperature
Low Answer Relevancy Answer off-topic Better prompt template, check retrieved context
Low Context Precision Irrelevant docs ranked high Add reranking, improve chunking
Low Context Recall Missing relevant docs Increase k, try hybrid search, improve embeddings

Production Patterns & Best Practices

Moving from prototype to production requires attention to performance, reliability, and cost.

Caching

Cache frequently asked questions and their retrieved context to reduce latency and cost:

import hashlib

def get_cache_key(query: str) -> str:
    return hashlib.sha256(query.lower().strip().encode()).hexdigest()

# Simple cache pattern
cache = {}

def cached_rag(query: str):
    key = get_cache_key(query)
    if key in cache:
        return cache[key]

    result = rag_chain.invoke(query)
    cache[key] = result
    return result

For production, use Redis or a similar distributed cache with TTL expiration.

Streaming Responses

Stream LLM output for better user experience:

from langchain_core.runnables import RunnablePassthrough

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
)

# Stream tokens as they're generated
for chunk in chain.stream("How do I configure backups?"):
    print(chunk.content, end="", flush=True)

Monitoring and Observability

Track these metrics in production:

  • Retrieval latency: Time to search the vector database
  • Generation latency: Time for the LLM to respond
  • Retrieval relevance scores: Are similarity scores trending down?
  • User feedback: Thumbs up/down on answers
  • Token usage: Cost per query

Use LangSmith, Langfuse, or custom logging to capture traces of the full pipeline.

Cost Optimization

Strategy Impact Implementation
Smaller embedding model 6x cheaper (3-small vs 3-large) Switch model, re-embed
Response caching 90%+ cost reduction for repeat queries Redis/in-memory cache
Tiered retrieval Reduce LLM calls for simple queries Route simple queries to cache
Chunk deduplication Fewer embeddings to store Deduplicate before indexing
Batch embedding Lower API costs Embed in batches of 100+

Security Considerations

  1. Input sanitization: Prevent prompt injection through user queries
  2. Access control: Ensure users only retrieve documents they're authorized to see
  3. PII filtering: Strip sensitive data from retrieved context before passing to LLMs
  4. Audit logging: Log queries and retrieved documents for compliance
  5. Rate limiting: Prevent abuse of your RAG endpoint

Common Failure Modes

Failure Symptom Solution
Stale index Answers reference outdated info Scheduled re-indexing pipeline
Context overflow LLM truncates retrieved context Reduce k or chunk size
Embedding drift Quality degrades after model update Version embeddings, re-index on model change
Filter bypass Users access unauthorized content Enforce access controls in retrieval, not just UI

Getting Started

Ready to build your first RAG system? Here's a recommended learning path:

  1. Start with a prototype: Use Chroma + OpenAI embeddings + a small document set
  2. Add evaluation: Create a test set of 20 questions and measure RAGAS scores
  3. Optimize chunking: Experiment with different chunk sizes and strategies
  4. Add hybrid search: Combine vector search with BM25 for better retrieval
  5. Add reranking: Use Cohere or a cross-encoder to re-score results
  6. Go to production: Add caching, monitoring, and access controls
  7. Scale: Move to a managed vector database and optimize costs

The RAG ecosystem is maturing rapidly. New techniques like Agentic RAG (where an agent decides when and how to retrieve) and Graph RAG (using knowledge graphs alongside vectors) continue to push the boundaries of what's possible.

Share this guide

Frequently Asked Questions

RAG (Retrieval-Augmented Generation) is an architecture pattern that enhances LLM responses by first retrieving relevant documents from an external knowledge base, then passing them as context to the LLM along with the user's question. This allows the model to generate answers grounded in your specific data rather than relying solely on its training knowledge.

Related Articles