Note: Code examples in this guide use LangChain 0.3+, LlamaIndex 0.11+, and OpenAI SDK 1.x+. Vector database examples cover Chroma, Pinecone, and pgvector. Always check official documentation for the latest API changes.
Retrieval-Augmented Generation (RAG) has become the most practical architecture pattern for building AI applications that need to reason over your own data. Instead of expensive fine-tuning or hoping the model memorized the right facts, RAG lets you connect any LLM to your documents, databases, and knowledge bases. This guide covers the entire RAG pipeline, from core concepts to production-ready systems.
What Is RAG and Why It Matters
RAG (Retrieval-Augmented Generation) is an architecture where an LLM's response is enhanced by first retrieving relevant information from an external knowledge source. The retrieved context is passed alongside the user's question, grounding the model's answer in your actual data.
The Problem RAG Solves
LLMs have three fundamental limitations that RAG addresses:
- Knowledge cutoff: Models only know what they were trained on. Your internal docs, recent data, and proprietary knowledge are invisible to them.
- Hallucinations: Without grounding, models confidently generate plausible but incorrect answers.
- Lack of citations: Base LLMs can't point to sources. RAG enables traceable, verifiable responses.
RAG vs Fine-Tuning vs Prompt Engineering
| Approach | Best For | Data Needs | Cost | Update Speed |
|---|---|---|---|---|
| Prompt Engineering | Behavior tweaks, formatting | None | Low | Instant |
| RAG | Knowledge grounding, dynamic data | Documents/DB | Medium | Minutes |
| Fine-Tuning | Style, domain reasoning, behavior | Training pairs | High | Hours/Days |
| RAG + Fine-Tuning | Production systems needing both | Both | Highest | Varies |
Rule of thumb: If you need the model to know specific facts, use RAG. If you need it to behave differently, fine-tune. Most production systems benefit from RAG first, then adding fine-tuning later if needed.
RAG Architecture: The Complete Pipeline
Every RAG system has two phases: indexing (offline) and retrieval + generation (runtime).
Indexing Pipeline (Offline)
Documents → Load → Split/Chunk → Embed → Store in Vector DB
- Load: Ingest documents from various sources (PDFs, web pages, databases, APIs)
- Split: Break documents into smaller chunks appropriate for embedding
- Embed: Convert each chunk into a dense vector using an embedding model
- Store: Save vectors + metadata in a vector database for fast similarity search
Retrieval + Generation Pipeline (Runtime)
User Query → Embed Query → Search Vector DB → Retrieve Top-K → Rerank → Generate with LLM
- Embed query: Convert the user's question into a vector using the same embedding model
- Search: Find the most similar document chunks via vector similarity
- Rerank (optional): Re-score results with a cross-encoder for better precision
- Generate: Pass the retrieved context + question to the LLM for answer generation
Basic RAG with LangChain
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# 1. Load documents
loader = PyPDFLoader("technical_manual.pdf")
docs = loader.load()
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(docs)
# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
# 4. Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# 5. Build RAG chain
template = """Answer the question based only on the following context.
If you cannot answer from the context, say "I don't have enough information."
Context: {context}
Question: {question}"""
prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# 6. Query
answer = chain.invoke("How do I configure the backup system?")
Embedding Models: Turning Text Into Vectors
Embeddings are the foundation of RAG. They convert text into dense numerical vectors where semantic similarity maps to geometric proximity.
How Embeddings Work
An embedding model maps text to a fixed-dimensional vector (e.g., 1536 dimensions for OpenAI's text-embedding-3-small). Texts with similar meaning produce vectors that are close together in this space, enabling similarity search.
Popular Embedding Models
| Model | Dimensions | Context | Best For | Pricing |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | 8,191 tokens | General purpose, cost-effective | $0.02/1M tokens |
| OpenAI text-embedding-3-large | 3072 | 8,191 tokens | Higher accuracy needs | $0.13/1M tokens |
| Cohere embed-v3 | 1024 | 512 tokens | Multilingual, search-optimized | $0.10/1M tokens |
| BGE-large-en-v1.5 | 1024 | 512 tokens | Open-source, self-hosted | Free |
| E5-mistral-7b-instruct | 4096 | 32,768 tokens | Long-context, open-source | Free |
| all-MiniLM-L6-v2 | 384 | 256 tokens | Fast, lightweight, local | Free |
Choosing the Right Model
- Prototyping:
text-embedding-3-small(cheap, good quality, easy API) - Production (cloud):
text-embedding-3-largeor Cohere embed-v3 - Self-hosted/privacy: BGE or E5 models via sentence-transformers
- Multilingual: Cohere embed-v3 or multilingual-e5-large
# Using sentence-transformers for local embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
texts = ["How does photosynthesis work?", "Plants convert sunlight to energy"]
embeddings = model.encode(texts)
# Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
print(f"Similarity: {similarity[0][0]:.3f}") # ~0.85
Embedding Best Practices
- Use the same model for indexing and querying — different models produce incompatible vector spaces
- Normalize vectors if your database doesn't do it automatically (cosine similarity requires normalized vectors)
- Benchmark on your data — MTEB leaderboard rankings don't always predict performance on domain-specific content
- Consider dimensionality reduction — OpenAI's models support
dimensionsparameter to reduce storage costs
Vector Databases: Choosing the Right Store
Vector databases are purpose-built for storing, indexing, and querying high-dimensional vectors efficiently.
Comparison Table
| Database | Type | Hosting | Filtering | Hybrid Search | Best For |
|---|---|---|---|---|---|
| Chroma | Embedded | Local/Docker | Basic | No | Prototyping, small datasets |
| Pinecone | Managed | Cloud only | Advanced | Yes | Production, zero-ops |
| Weaviate | Self-hosted/Cloud | Both | Advanced | Yes | Full control, GraphQL API |
| Qdrant | Self-hosted/Cloud | Both | Advanced | Yes | Performance, Rust-based |
| pgvector | Extension | PostgreSQL | Full SQL | Yes (with extensions) | Existing Postgres users |
| Milvus | Self-hosted/Cloud | Both | Advanced | Yes | Large-scale, enterprise |
Pinecone Example (Managed)
from pinecone import Pinecone, ServerlessSpec
# Initialize
pc = Pinecone(api_key="your-api-key")
# Create index
pc.create_index(
name="rag-docs",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index("rag-docs")
# Upsert vectors with metadata
index.upsert(vectors=[
{
"id": "doc-1-chunk-0",
"values": embedding_vector,
"metadata": {
"source": "manual.pdf",
"page": 5,
"section": "Installation",
"text": "To install the software..."
}
}
])
# Query with metadata filter
results = index.query(
vector=query_embedding,
top_k=5,
include_metadata=True,
filter={"source": {"$eq": "manual.pdf"}}
)
pgvector Example (PostgreSQL)
-- Enable extension
CREATE EXTENSION vector;
-- Create table with vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
metadata JSONB,
embedding vector(1536)
);
-- Create HNSW index for fast search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Query: find 5 most similar documents
SELECT id, content, metadata,
1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 5;
pgvector is ideal when you already use PostgreSQL — no new infrastructure needed, full SQL power for metadata filtering, and transactional consistency with your application data.
Chunking Strategies: How to Split Your Documents
Chunking is one of the most impactful decisions in RAG. Bad chunking leads to irrelevant retrievals, split context, and poor answers.
Chunking Methods
| Method | How It Works | Best For |
|---|---|---|
| Fixed-size | Split by token/character count | Simple documents, baseline |
| Recursive | Split by separators (paragraphs, sentences) | General purpose, most common |
| Semantic | Split when embedding similarity drops | Natural topic boundaries |
| Document-aware | Split respecting structure (headers, sections) | Markdown, HTML, structured docs |
| Agentic/AST | Parse code by functions/classes | Code repositories |
Recursive Character Splitting (Most Common)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800, # Target chunk size in characters
chunk_overlap=100, # Overlap between chunks to preserve context
separators=[
"\n\n", # Paragraph breaks first
"\n", # Line breaks
". ", # Sentence boundaries
" ", # Word boundaries (last resort)
"" # Character level (absolute last resort)
],
length_function=len,
)
chunks = splitter.split_documents(documents)
Semantic Chunking
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# Split based on embedding similarity between consecutive sentences
splitter = SemanticChunker(
OpenAIEmbeddings(model="text-embedding-3-small"),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90,
)
chunks = splitter.split_documents(documents)
Chunk Size Guidelines
- Too small (< 200 tokens): Loses context, retrieves fragments
- Too large (> 2000 tokens): Dilutes relevance, wastes context window
- Sweet spot (400-800 tokens): Enough context to be useful, specific enough to be relevant
- Always include overlap (50-100 tokens): Prevents cutting sentences and losing information at boundaries
Metadata Enrichment
Add metadata to each chunk for better retrieval and filtering:
for i, chunk in enumerate(chunks):
chunk.metadata.update({
"chunk_index": i,
"source_file": "manual.pdf",
"section_title": extract_section_title(chunk),
"doc_type": "technical",
"created_at": "2026-01-15",
})
Retrieval & Reranking: Finding the Best Context
Retrieval quality directly determines answer quality. Poor retrieval means the LLM gets irrelevant context and produces bad answers.
Retrieval Strategies
1. Basic Similarity Search
The simplest approach — find the K most similar vectors:
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5}
)
2. Maximum Marginal Relevance (MMR)
Balances relevance with diversity to avoid redundant results:
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": 5,
"fetch_k": 20, # Fetch 20 candidates
"lambda_mult": 0.7, # 0=max diversity, 1=max relevance
}
)
3. Hybrid Search (Semantic + Keyword)
Combines vector similarity with BM25 keyword matching for the best of both worlds:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# Keyword retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Combine with equal weights
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6] # Adjust based on your use case
)
Reranking
Reranking uses a cross-encoder model to re-score retrieved results for better precision. Cross-encoders are more accurate than bi-encoders (embedding models) because they process the query and document together.
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
# Use Cohere reranker
reranker = CohereRerank(
model="rerank-english-v3.0",
top_n=3 # Return top 3 after reranking
)
# Wrap the base retriever with reranking
retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)
Query Transformation
Sometimes the user's query doesn't match the language of the documents. Transform queries to improve retrieval:
# Multi-query: generate multiple search queries from one question
from langchain.retrievers.multi_query import MultiQueryRetriever
multi_retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.3),
)
# The retriever generates 3 query variations and combines results
docs = multi_retriever.invoke("How does the authentication system work?")
# Might search for:
# 1. "authentication system architecture"
# 2. "login and auth flow implementation"
# 3. "user authentication mechanism"
Evaluating RAG Systems
You can't improve what you don't measure. RAG evaluation tells you where your pipeline is failing and guides optimization.
RAGAS Framework
RAGAS (Retrieval Augmented Generation Assessment) is the standard evaluation framework for RAG systems. It provides four key metrics:
| Metric | What It Measures | Range |
|---|---|---|
| Faithfulness | Is the answer supported by the retrieved context? | 0-1 |
| Answer Relevancy | Does the answer address the question? | 0-1 |
| Context Precision | Are the top-ranked retrieved docs relevant? | 0-1 |
| Context Recall | Were all necessary docs retrieved? | 0-1 |
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Prepare evaluation dataset
eval_data = {
"question": ["How do I reset my password?"],
"answer": ["To reset your password, go to Settings > Security > Reset Password..."],
"contexts": [["The password reset feature is in Settings > Security..."]],
"ground_truth": ["Navigate to Settings, then Security, click Reset Password..."],
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(results)
# {'faithfulness': 0.95, 'answer_relevancy': 0.90,
# 'context_precision': 0.85, 'context_recall': 0.80}
Building an Evaluation Dataset
A good eval dataset needs:
- Questions: 50-100 representative questions your users would ask
- Ground truth answers: The correct/expected answers
- Source documents: The documents that contain the answers
Start with manual curation, then expand with synthetic question generation:
from ragas.testset.generator import TestsetGenerator
from langchain_openai import ChatOpenAI
generator = TestsetGenerator.from_langchain(
generator_llm=ChatOpenAI(model="gpt-4o"),
critic_llm=ChatOpenAI(model="gpt-4o"),
)
testset = generator.generate_with_langchain_docs(
documents=chunks,
test_size=50,
)
What to Optimize Based on Metrics
| Low Metric | Root Cause | Fix |
|---|---|---|
| Low Faithfulness | LLM ignoring context or hallucinating | Stronger prompt instructions, lower temperature |
| Low Answer Relevancy | Answer off-topic | Better prompt template, check retrieved context |
| Low Context Precision | Irrelevant docs ranked high | Add reranking, improve chunking |
| Low Context Recall | Missing relevant docs | Increase k, try hybrid search, improve embeddings |
Production Patterns & Best Practices
Moving from prototype to production requires attention to performance, reliability, and cost.
Caching
Cache frequently asked questions and their retrieved context to reduce latency and cost:
import hashlib
def get_cache_key(query: str) -> str:
return hashlib.sha256(query.lower().strip().encode()).hexdigest()
# Simple cache pattern
cache = {}
def cached_rag(query: str):
key = get_cache_key(query)
if key in cache:
return cache[key]
result = rag_chain.invoke(query)
cache[key] = result
return result
For production, use Redis or a similar distributed cache with TTL expiration.
Streaming Responses
Stream LLM output for better user experience:
from langchain_core.runnables import RunnablePassthrough
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
)
# Stream tokens as they're generated
for chunk in chain.stream("How do I configure backups?"):
print(chunk.content, end="", flush=True)
Monitoring and Observability
Track these metrics in production:
- Retrieval latency: Time to search the vector database
- Generation latency: Time for the LLM to respond
- Retrieval relevance scores: Are similarity scores trending down?
- User feedback: Thumbs up/down on answers
- Token usage: Cost per query
Use LangSmith, Langfuse, or custom logging to capture traces of the full pipeline.
Cost Optimization
| Strategy | Impact | Implementation |
|---|---|---|
| Smaller embedding model | 6x cheaper (3-small vs 3-large) | Switch model, re-embed |
| Response caching | 90%+ cost reduction for repeat queries | Redis/in-memory cache |
| Tiered retrieval | Reduce LLM calls for simple queries | Route simple queries to cache |
| Chunk deduplication | Fewer embeddings to store | Deduplicate before indexing |
| Batch embedding | Lower API costs | Embed in batches of 100+ |
Security Considerations
- Input sanitization: Prevent prompt injection through user queries
- Access control: Ensure users only retrieve documents they're authorized to see
- PII filtering: Strip sensitive data from retrieved context before passing to LLMs
- Audit logging: Log queries and retrieved documents for compliance
- Rate limiting: Prevent abuse of your RAG endpoint
Common Failure Modes
| Failure | Symptom | Solution |
|---|---|---|
| Stale index | Answers reference outdated info | Scheduled re-indexing pipeline |
| Context overflow | LLM truncates retrieved context | Reduce k or chunk size |
| Embedding drift | Quality degrades after model update | Version embeddings, re-index on model change |
| Filter bypass | Users access unauthorized content | Enforce access controls in retrieval, not just UI |
Getting Started
Ready to build your first RAG system? Here's a recommended learning path:
- Start with a prototype: Use Chroma + OpenAI embeddings + a small document set
- Add evaluation: Create a test set of 20 questions and measure RAGAS scores
- Optimize chunking: Experiment with different chunk sizes and strategies
- Add hybrid search: Combine vector search with BM25 for better retrieval
- Add reranking: Use Cohere or a cross-encoder to re-score results
- Go to production: Add caching, monitoring, and access controls
- Scale: Move to a managed vector database and optimize costs
The RAG ecosystem is maturing rapidly. New techniques like Agentic RAG (where an agent decides when and how to retrieve) and Graph RAG (using knowledge graphs alongside vectors) continue to push the boundaries of what's possible.