What is RAG and how does it work?

RAG (Retrieval-Augmented Generation) is an architecture pattern that enhances LLM responses by first retrieving relevant documents from an external knowledge base, then passing them as context to the LLM along with the user's question. This allows the model to generate answers grounded in your specific data rather than relying solely on its training knowledge.

When should I use RAG vs fine-tuning?

Use RAG when you need the model to reference specific, frequently updated data (documents, knowledge bases, databases). Use fine-tuning when you need to change the model's behavior, style, or teach it domain-specific reasoning patterns. Many production systems combine both: fine-tuning for behavior and RAG for knowledge.

Which vector database should I use for RAG?

For prototyping, use Chroma (in-memory, zero config). For production with managed infrastructure, Pinecone offers the simplest scaling. For self-hosted production, Weaviate or Qdrant provide full control. For existing PostgreSQL users, pgvector adds vector search without a new service. Choose based on your scale, hosting preference, and existing infrastructure.

What is the best chunking strategy for RAG?

There is no universal best strategy. Recursive character splitting (500-1000 tokens with 50-100 token overlap) works well as a starting point. For structured documents, use semantic or document-aware chunking that respects headers and sections. For code, use AST-based splitting. Always test different strategies against your evaluation metrics.

How do I evaluate the quality of a RAG system?

Use the RAGAS framework which provides four key metrics: Faithfulness (are answers grounded in retrieved context?), Answer Relevancy (does the answer address the question?), Context Precision (are retrieved documents relevant?), and Context Recall (were all relevant documents retrieved?). Combine automated metrics with human evaluation for production systems.

What is hybrid search in RAG?

Hybrid search combines semantic search (vector similarity) with keyword search (BM25/TF-IDF). Semantic search finds conceptually similar content, while keyword search catches exact term matches. Combining both using Reciprocal Rank Fusion (RRF) typically outperforms either approach alone, especially for technical content with specific terminology.

How do I reduce hallucinations in RAG systems?

Key strategies: improve retrieval quality (better chunking, reranking), add explicit instructions to only answer from context, use citation-based prompting where the model must quote sources, implement faithfulness checks on outputs, set appropriate temperature (lower for factual), and add a fallback response for low-confidence retrievals.

What embedding model should I use for RAG?

For English content, OpenAI's text-embedding-3-small offers a good balance of cost and quality. For multilingual or open-source needs, consider Cohere's embed-v3 or BGE/E5 models from Hugging Face. For sensitive data that cannot leave your infrastructure, use sentence-transformers models locally. Always benchmark on your specific domain data.

How do I handle large documents in RAG?

For large documents: use hierarchical chunking (document > section > paragraph), maintain parent-child relationships between chunks, include metadata (document title, section header) in each chunk for context, consider using a summarization step for very long documents, and implement multi-level retrieval where you first identify relevant documents then search within them.

Can RAG work with non-text data like images and PDFs?

Yes. For PDFs, use document parsers like PyPDF, Unstructured, or LlamaParse to extract text while preserving structure. For images, use multimodal embeddings (CLIP, or GPT-4V for description-based indexing). For tables, extract them separately and store as structured data. The key is converting all content into text or embeddings that can be indexed and retrieved.

The Complete Guide to RAG: Building Retrieval-Augmented Generation Systems 2026

Note: Code examples in this guide use LangChain 0.3+, LlamaIndex 0.11+, and OpenAI SDK 1.x+. Vector database examples cover Chroma, Pinecone, and pgvector. Always check official documentation for the latest API changes.

Retrieval-Augmented Generation (RAG) has become the most practical architecture pattern for building AI applications that need to reason over your own data. Instead of expensive fine-tuning or hoping the model memorized the right facts, RAG lets you connect any LLM to your documents, databases, and knowledge bases. This guide covers the entire RAG pipeline, from core concepts to production-ready systems.

What Is RAG and Why It Matters

RAG (Retrieval-Augmented Generation) is an architecture where an LLM's response is enhanced by first retrieving relevant information from an external knowledge source. The retrieved context is passed alongside the user's question, grounding the model's answer in your actual data.

The Problem RAG Solves

LLMs have three fundamental limitations that RAG addresses:

Knowledge cutoff: Models only know what they were trained on. Your internal docs, recent data, and proprietary knowledge are invisible to them.
Hallucinations: Without grounding, models confidently generate plausible but incorrect answers.
Lack of citations: Base LLMs can't point to sources. RAG enables traceable, verifiable responses.

RAG vs Fine-Tuning vs Prompt Engineering

Approach	Best For	Data Needs	Cost	Update Speed
Prompt Engineering	Behavior tweaks, formatting	None	Low	Instant
RAG	Knowledge grounding, dynamic data	Documents/DB	Medium	Minutes
Fine-Tuning	Style, domain reasoning, behavior	Training pairs	High	Hours/Days
RAG + Fine-Tuning	Production systems needing both	Both	Highest	Varies

Rule of thumb: If you need the model to know specific facts, use RAG. If you need it to behave differently, fine-tune. Most production systems benefit from RAG first, then adding fine-tuning later if needed.

RAG Architecture: The Complete Pipeline

Every RAG system has two phases: indexing (offline) and retrieval + generation (runtime).

Indexing Pipeline (Offline)

Documents → Load → Split/Chunk → Embed → Store in Vector DB

Load: Ingest documents from various sources (PDFs, web pages, databases, APIs)
Split: Break documents into smaller chunks appropriate for embedding
Embed: Convert each chunk into a dense vector using an embedding model
Store: Save vectors + metadata in a vector database for fast similarity search

Retrieval + Generation Pipeline (Runtime)

User Query → Embed Query → Search Vector DB → Retrieve Top-K → Rerank → Generate with LLM

Embed query: Convert the user's question into a vector using the same embedding model
Search: Find the most similar document chunks via vector similarity
Rerank (optional): Re-score results with a cross-encoder for better precision
Generate: Pass the retrieved context + question to the LLM for answer generation

Basic RAG with LangChain

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# 1. Load documents
loader = PyPDFLoader("technical_manual.pdf")
docs = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(docs)

# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# 4. Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# 5. Build RAG chain
template = """Answer the question based only on the following context.
If you cannot answer from the context, say "I don't have enough information."

Context: {context}

Question: {question}"""

prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model="gpt-4o", temperature=0)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# 6. Query
answer = chain.invoke("How do I configure the backup system?")

Embedding Models: Turning Text Into Vectors

Embeddings are the foundation of RAG. They convert text into dense numerical vectors where semantic similarity maps to geometric proximity.

How Embeddings Work

An embedding model maps text to a fixed-dimensional vector (e.g., 1536 dimensions for OpenAI's text-embedding-3-small). Texts with similar meaning produce vectors that are close together in this space, enabling similarity search.

Popular Embedding Models

Model	Dimensions	Context	Best For	Pricing
OpenAI text-embedding-3-small	1536	8,191 tokens	General purpose, cost-effective	$0.02/1M tokens
OpenAI text-embedding-3-large	3072	8,191 tokens	Higher accuracy needs	$0.13/1M tokens
Cohere embed-v3	1024	512 tokens	Multilingual, search-optimized	$0.10/1M tokens
BGE-large-en-v1.5	1024	512 tokens	Open-source, self-hosted	Free
E5-mistral-7b-instruct	4096	32,768 tokens	Long-context, open-source	Free
all-MiniLM-L6-v2	384	256 tokens	Fast, lightweight, local	Free

Choosing the Right Model

Prototyping: text-embedding-3-small (cheap, good quality, easy API)
Production (cloud): text-embedding-3-large or Cohere embed-v3
Self-hosted/privacy: BGE or E5 models via sentence-transformers
Multilingual: Cohere embed-v3 or multilingual-e5-large

# Using sentence-transformers for local embeddings
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

texts = ["How does photosynthesis work?", "Plants convert sunlight to energy"]
embeddings = model.encode(texts)

# Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
print(f"Similarity: {similarity[0][0]:.3f}")  # ~0.85

Embedding Best Practices

Use the same model for indexing and querying — different models produce incompatible vector spaces
Normalize vectors if your database doesn't do it automatically (cosine similarity requires normalized vectors)
Benchmark on your data — MTEB leaderboard rankings don't always predict performance on domain-specific content
Consider dimensionality reduction — OpenAI's models support dimensions parameter to reduce storage costs

Vector Databases: Choosing the Right Store

Vector databases are purpose-built for storing, indexing, and querying high-dimensional vectors efficiently.

Comparison Table

Database	Type	Hosting	Filtering	Hybrid Search	Best For
Chroma	Embedded	Local/Docker	Basic	No	Prototyping, small datasets
Pinecone	Managed	Cloud only	Advanced	Yes	Production, zero-ops
Weaviate	Self-hosted/Cloud	Both	Advanced	Yes	Full control, GraphQL API
Qdrant	Self-hosted/Cloud	Both	Advanced	Yes	Performance, Rust-based
pgvector	Extension	PostgreSQL	Full SQL	Yes (with extensions)	Existing Postgres users
Milvus	Self-hosted/Cloud	Both	Advanced	Yes	Large-scale, enterprise

Pinecone Example (Managed)

from pinecone import Pinecone, ServerlessSpec

# Initialize
pc = Pinecone(api_key="your-api-key")

# Create index
pc.create_index(
    name="rag-docs",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

index = pc.Index("rag-docs")

# Upsert vectors with metadata
index.upsert(vectors=[
    {
        "id": "doc-1-chunk-0",
        "values": embedding_vector,
        "metadata": {
            "source": "manual.pdf",
            "page": 5,
            "section": "Installation",
            "text": "To install the software..."
        }
    }
])

# Query with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True,
    filter={"source": {"$eq": "manual.pdf"}}
)

pgvector Example (PostgreSQL)

-- Enable extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    metadata JSONB,
    embedding vector(1536)
);

-- Create HNSW index for fast search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Query: find 5 most similar documents
SELECT id, content, metadata,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 5;

pgvector is ideal when you already use PostgreSQL — no new infrastructure needed, full SQL power for metadata filtering, and transactional consistency with your application data.

Chunking Strategies: How to Split Your Documents

Chunking is one of the most impactful decisions in RAG. Bad chunking leads to irrelevant retrievals, split context, and poor answers.

Chunking Methods

Method	How It Works	Best For
Fixed-size	Split by token/character count	Simple documents, baseline
Recursive	Split by separators (paragraphs, sentences)	General purpose, most common
Semantic	Split when embedding similarity drops	Natural topic boundaries
Document-aware	Split respecting structure (headers, sections)	Markdown, HTML, structured docs
Agentic/AST	Parse code by functions/classes	Code repositories

Recursive Character Splitting (Most Common)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,        # Target chunk size in characters
    chunk_overlap=100,     # Overlap between chunks to preserve context
    separators=[
        "\n\n",  # Paragraph breaks first
        "\n",    # Line breaks
        ". ",    # Sentence boundaries
        " ",     # Word boundaries (last resort)
        ""       # Character level (absolute last resort)
    ],
    length_function=len,
)

chunks = splitter.split_documents(documents)

Semantic Chunking

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Split based on embedding similarity between consecutive sentences
splitter = SemanticChunker(
    OpenAIEmbeddings(model="text-embedding-3-small"),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90,
)

chunks = splitter.split_documents(documents)

Chunk Size Guidelines

Too small (< 200 tokens): Loses context, retrieves fragments
Too large (> 2000 tokens): Dilutes relevance, wastes context window
Sweet spot (400-800 tokens): Enough context to be useful, specific enough to be relevant
Always include overlap (50-100 tokens): Prevents cutting sentences and losing information at boundaries

Metadata Enrichment

Add metadata to each chunk for better retrieval and filtering:

for i, chunk in enumerate(chunks):
    chunk.metadata.update({
        "chunk_index": i,
        "source_file": "manual.pdf",
        "section_title": extract_section_title(chunk),
        "doc_type": "technical",
        "created_at": "2026-01-15",
    })

Retrieval & Reranking: Finding the Best Context

Retrieval quality directly determines answer quality. Poor retrieval means the LLM gets irrelevant context and produces bad answers.

Retrieval Strategies

1. Basic Similarity Search

The simplest approach — find the K most similar vectors:

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

2. Maximum Marginal Relevance (MMR)

Balances relevance with diversity to avoid redundant results:

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 5,
        "fetch_k": 20,      # Fetch 20 candidates
        "lambda_mult": 0.7,  # 0=max diversity, 1=max relevance
    }
)

3. Hybrid Search (Semantic + Keyword)

Combines vector similarity with BM25 keyword matching for the best of both worlds:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Keyword retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Combine with equal weights
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # Adjust based on your use case
)

Reranking

Reranking uses a cross-encoder model to re-score retrieved results for better precision. Cross-encoders are more accurate than bi-encoders (embedding models) because they process the query and document together.

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# Use Cohere reranker
reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=3  # Return top 3 after reranking
)

# Wrap the base retriever with reranking
retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)

Query Transformation

Sometimes the user's query doesn't match the language of the documents. Transform queries to improve retrieval:

# Multi-query: generate multiple search queries from one question
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.3),
)

# The retriever generates 3 query variations and combines results
docs = multi_retriever.invoke("How does the authentication system work?")
# Might search for:
# 1. "authentication system architecture"
# 2. "login and auth flow implementation"
# 3. "user authentication mechanism"

Evaluating RAG Systems

You can't improve what you don't measure. RAG evaluation tells you where your pipeline is failing and guides optimization.

RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) is the standard evaluation framework for RAG systems. It provides four key metrics:

Metric	What It Measures	Range
Faithfulness	Is the answer supported by the retrieved context?	0-1
Answer Relevancy	Does the answer address the question?	0-1
Context Precision	Are the top-ranked retrieved docs relevant?	0-1
Context Recall	Were all necessary docs retrieved?	0-1

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["How do I reset my password?"],
    "answer": ["To reset your password, go to Settings > Security > Reset Password..."],
    "contexts": [["The password reset feature is in Settings > Security..."]],
    "ground_truth": ["Navigate to Settings, then Security, click Reset Password..."],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(results)
# {'faithfulness': 0.95, 'answer_relevancy': 0.90,
#  'context_precision': 0.85, 'context_recall': 0.80}

Building an Evaluation Dataset

A good eval dataset needs:

Questions: 50-100 representative questions your users would ask
Ground truth answers: The correct/expected answers
Source documents: The documents that contain the answers

Start with manual curation, then expand with synthetic question generation:

from ragas.testset.generator import TestsetGenerator
from langchain_openai import ChatOpenAI

generator = TestsetGenerator.from_langchain(
    generator_llm=ChatOpenAI(model="gpt-4o"),
    critic_llm=ChatOpenAI(model="gpt-4o"),
)

testset = generator.generate_with_langchain_docs(
    documents=chunks,
    test_size=50,
)

What to Optimize Based on Metrics

Low Metric	Root Cause	Fix
Low Faithfulness	LLM ignoring context or hallucinating	Stronger prompt instructions, lower temperature
Low Answer Relevancy	Answer off-topic	Better prompt template, check retrieved context
Low Context Precision	Irrelevant docs ranked high	Add reranking, improve chunking
Low Context Recall	Missing relevant docs	Increase k, try hybrid search, improve embeddings

Production Patterns & Best Practices

Moving from prototype to production requires attention to performance, reliability, and cost.

Caching

Cache frequently asked questions and their retrieved context to reduce latency and cost:

import hashlib

def get_cache_key(query: str) -> str:
    return hashlib.sha256(query.lower().strip().encode()).hexdigest()

# Simple cache pattern
cache = {}

def cached_rag(query: str):
    key = get_cache_key(query)
    if key in cache:
        return cache[key]

    result = rag_chain.invoke(query)
    cache[key] = result
    return result

For production, use Redis or a similar distributed cache with TTL expiration.

Streaming Responses

Stream LLM output for better user experience:

from langchain_core.runnables import RunnablePassthrough

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
)

# Stream tokens as they're generated
for chunk in chain.stream("How do I configure backups?"):
    print(chunk.content, end="", flush=True)

Monitoring and Observability

Track these metrics in production:

Retrieval latency: Time to search the vector database
Generation latency: Time for the LLM to respond
Retrieval relevance scores: Are similarity scores trending down?
User feedback: Thumbs up/down on answers
Token usage: Cost per query

Use LangSmith, Langfuse, or custom logging to capture traces of the full pipeline.

Cost Optimization

Strategy	Impact	Implementation
Smaller embedding model	6x cheaper (3-small vs 3-large)	Switch model, re-embed
Response caching	90%+ cost reduction for repeat queries	Redis/in-memory cache
Tiered retrieval	Reduce LLM calls for simple queries	Route simple queries to cache
Chunk deduplication	Fewer embeddings to store	Deduplicate before indexing
Batch embedding	Lower API costs	Embed in batches of 100+

Security Considerations

Input sanitization: Prevent prompt injection through user queries
Access control: Ensure users only retrieve documents they're authorized to see
PII filtering: Strip sensitive data from retrieved context before passing to LLMs
Audit logging: Log queries and retrieved documents for compliance
Rate limiting: Prevent abuse of your RAG endpoint

Common Failure Modes

Failure	Symptom	Solution
Stale index	Answers reference outdated info	Scheduled re-indexing pipeline
Context overflow	LLM truncates retrieved context	Reduce k or chunk size
Embedding drift	Quality degrades after model update	Version embeddings, re-index on model change
Filter bypass	Users access unauthorized content	Enforce access controls in retrieval, not just UI

Getting Started

Ready to build your first RAG system? Here's a recommended learning path:

Start with a prototype: Use Chroma + OpenAI embeddings + a small document set
Add evaluation: Create a test set of 20 questions and measure RAGAS scores
Optimize chunking: Experiment with different chunk sizes and strategies
Add hybrid search: Combine vector search with BM25 for better retrieval
Add reranking: Use Cohere or a cross-encoder to re-score results
Go to production: Add caching, monitoring, and access controls
Scale: Move to a managed vector database and optimize costs

The RAG ecosystem is maturing rapidly. New techniques like Agentic RAG (where an agent decides when and how to retrieve) and Graph RAG (using knowledge graphs alongside vectors) continue to push the boundaries of what's possible.

The Complete Guide to RAG: Building Retrieval-Augmented Generation Systems

Frequently Asked Questions

Related Articles

RAG Optimization Techniques: Building Smarter Retrieval-Augmented Systems

How to Solve Common RAG Failures

Choosing the Right Vector Database for AI and Search

Hallucination Prevention in AI: Techniques, Testing & Trust

Mastering Context Window Optimization for LLMs

Stay on the Nerd Track