Lesson 2 of 23

RAG Architecture Deep Dive

RAG vs Fine-tuning

3 min read

Two approaches dominate knowledge enhancement for LLMs: RAG and fine-tuning. Understanding when to use each is crucial for production systems.

Fundamental Differences

Aspect RAG Fine-tuning
Knowledge source External retrieval Embedded in weights
Update frequency Real-time possible Requires retraining
Cost structure Per-query retrieval Upfront training cost
Hallucination control Grounded in sources Still possible
Domain adaptation Works immediately Needs training data

When to Choose RAG

RAG excels when:

1. Knowledge changes frequently

# RAG: Update documents, immediately available
def add_new_knowledge(document: str):
    chunks = chunker.split(document)
    embeddings = embed_model.encode(chunks)
    vectorstore.add(chunks, embeddings)
    # Immediately queryable - no retraining

2. You need source attribution

# RAG provides traceable answers
response = {
    "answer": "The policy allows 30-day returns.",
    "sources": [
        {"doc": "return_policy.pdf", "page": 3},
        {"doc": "faq.md", "section": "returns"}
    ]
}

3. Knowledge base is large or diverse

  • Legal documents across jurisdictions
  • Technical documentation spanning products
  • Historical data with temporal context

4. Accuracy is critical

  • Medical information retrieval
  • Financial compliance queries
  • Legal research applications

When to Choose Fine-tuning

Fine-tuning excels when:

1. Consistent style/format needed

# Fine-tuned model learns your voice
# Training data:
# Input: "Summarize this email"
# Output: "[Company style summary format]"

2. Task-specific behavior

  • Custom classification schemas
  • Domain-specific entity extraction
  • Proprietary reasoning patterns

3. Latency is critical

  • No retrieval overhead
  • Single model inference
  • Predictable response times

4. Knowledge is stable

  • Core domain concepts don't change
  • Training data represents complete knowledge

The Hybrid Approach

Production systems often combine both:

class HybridKnowledgeSystem:
    def __init__(self):
        # Fine-tuned for domain understanding
        self.llm = load_fine_tuned_model("domain-expert-v2")
        # RAG for current facts
        self.retriever = VectorRetriever("knowledge_base")

    def answer(self, query: str):
        # Retrieve current facts
        context = self.retriever.search(query)

        # Fine-tuned model understands domain nuances
        return self.llm.generate(
            query=query,
            context=context,
            # Model knows domain terminology, style
        )

Cost Comparison

Factor RAG Fine-tuning
Initial setup Index creation (~$0.10/1M tokens) Training (~$10-100+ per run)
Per-query cost Embedding + retrieval + generation Generation only
Update cost Re-embed changed docs Full retraining
Infrastructure Vector DB hosting Model hosting

Decision Framework

START
Does knowledge change frequently?
  ├─ YES → RAG
Need source attribution?
  ├─ YES → RAG
Is consistent style critical?
  ├─ YES → Fine-tuning (+ optional RAG)
Latency under 100ms required?
  ├─ YES → Fine-tuning
Default → RAG (more flexible, easier to update)

Production Reality: Most enterprise applications start with RAG because it's faster to deploy, easier to update, and provides the auditability that compliance requires.

Next, let's examine the complete RAG pipeline architecture. :::

Quiz

Module 1: RAG Architecture Deep Dive

Take Quiz