RAG Architecture Deep Dive
RAG vs Fine-tuning
3 min read
Two approaches dominate knowledge enhancement for LLMs: RAG and fine-tuning. Understanding when to use each is crucial for production systems.
Fundamental Differences
| Aspect | RAG | Fine-tuning |
|---|---|---|
| Knowledge source | External retrieval | Embedded in weights |
| Update frequency | Real-time possible | Requires retraining |
| Cost structure | Per-query retrieval | Upfront training cost |
| Hallucination control | Grounded in sources | Still possible |
| Domain adaptation | Works immediately | Needs training data |
When to Choose RAG
RAG excels when:
1. Knowledge changes frequently
# RAG: Update documents, immediately available
def add_new_knowledge(document: str):
chunks = chunker.split(document)
embeddings = embed_model.encode(chunks)
vectorstore.add(chunks, embeddings)
# Immediately queryable - no retraining
2. You need source attribution
# RAG provides traceable answers
response = {
"answer": "The policy allows 30-day returns.",
"sources": [
{"doc": "return_policy.pdf", "page": 3},
{"doc": "faq.md", "section": "returns"}
]
}
3. Knowledge base is large or diverse
- Legal documents across jurisdictions
- Technical documentation spanning products
- Historical data with temporal context
4. Accuracy is critical
- Medical information retrieval
- Financial compliance queries
- Legal research applications
When to Choose Fine-tuning
Fine-tuning excels when:
1. Consistent style/format needed
# Fine-tuned model learns your voice
# Training data:
# Input: "Summarize this email"
# Output: "[Company style summary format]"
2. Task-specific behavior
- Custom classification schemas
- Domain-specific entity extraction
- Proprietary reasoning patterns
3. Latency is critical
- No retrieval overhead
- Single model inference
- Predictable response times
4. Knowledge is stable
- Core domain concepts don't change
- Training data represents complete knowledge
The Hybrid Approach
Production systems often combine both:
class HybridKnowledgeSystem:
def __init__(self):
# Fine-tuned for domain understanding
self.llm = load_fine_tuned_model("domain-expert-v2")
# RAG for current facts
self.retriever = VectorRetriever("knowledge_base")
def answer(self, query: str):
# Retrieve current facts
context = self.retriever.search(query)
# Fine-tuned model understands domain nuances
return self.llm.generate(
query=query,
context=context,
# Model knows domain terminology, style
)
Cost Comparison
| Factor | RAG | Fine-tuning |
|---|---|---|
| Initial setup | Index creation (~$0.10/1M tokens) | Training (~$10-100+ per run) |
| Per-query cost | Embedding + retrieval + generation | Generation only |
| Update cost | Re-embed changed docs | Full retraining |
| Infrastructure | Vector DB hosting | Model hosting |
Decision Framework
START
│
▼
Does knowledge change frequently?
│
├─ YES → RAG
│
▼
Need source attribution?
│
├─ YES → RAG
│
▼
Is consistent style critical?
│
├─ YES → Fine-tuning (+ optional RAG)
│
▼
Latency under 100ms required?
│
├─ YES → Fine-tuning
│
▼
Default → RAG (more flexible, easier to update)
Production Reality: Most enterprise applications start with RAG because it's faster to deploy, easier to update, and provides the auditability that compliance requires.
Next, let's examine the complete RAG pipeline architecture. :::