RAG-Powered Agents & Memory Systems

Agent Memory & Retrieval Architectures

5 min read

Why Agents Need Memory

A stateless LLM forgets everything between API calls. Without memory, an agent cannot maintain conversation context, recall factual knowledge from a company's documents, or adapt to a user's preferences over time. Memory transforms a stateless language model into a persistent, context-aware agent.

Agent memory serves three distinct purposes:

Purpose Example What Breaks Without It
Conversation context Remembering the user said "my project uses PostgreSQL" five messages ago Agent asks the same clarifying questions repeatedly
Factual knowledge Retrieving the company's refund policy from internal docs Agent hallucinates policies or gives generic answers
Learned preferences Knowing this user prefers concise code examples over verbose explanations Agent cannot personalize interactions

Memory Taxonomy

Agent memory systems map to a well-established taxonomy. Understanding these categories helps you design the right memory architecture for a given use case.

Working Memory (Context Window)

The LLM's context window is the agent's working memory. Everything the model can "see" at inference time — the system prompt, conversation history, retrieved documents, tool results — must fit within this window.

  • Capacity: Varies by model (e.g., 128K tokens for GPT-4o, 200K tokens for Claude)
  • Trade-off: Larger context windows increase cost and latency per request
  • Key constraint: Context is finite, so you must decide what to include and what to leave out

Episodic Memory (Conversation History)

Episodic memory stores the sequence of past interactions — what the user said, what the agent did, what tools were called, and what results came back.

# Simple episodic memory structure
episodic_memory = [
    {"role": "user", "content": "Find all orders over $500"},
    {"role": "assistant", "content": None, "tool_calls": [
        {"name": "query_orders", "args": {"min_amount": 500}}
    ]},
    {"role": "tool", "content": "Found 23 orders totaling $18,450"},
    {"role": "assistant", "content": "I found 23 orders over $500..."},
]

As conversations grow long, episodic memory must be managed — you cannot keep injecting the full history into every request indefinitely.

Semantic Memory (Knowledge Base)

Semantic memory stores factual knowledge the agent can retrieve on demand. This is where RAG (Retrieval-Augmented Generation) comes in: instead of stuffing all knowledge into the prompt, you retrieve only the relevant pieces for each query.

  • Storage: Vector databases (e.g., Pinecone, Weaviate, Chroma, pgvector), document stores
  • Retrieval: Embedding-based similarity search, keyword search, or hybrid approaches
  • Scope: Company docs, product catalogs, knowledge bases, code repositories

Procedural Memory (Learned Patterns)

Procedural memory captures how the agent should behave — patterns, workflows, and strategies it has learned. This can be implemented as:

  • System prompts with instructions and examples
  • Few-shot examples stored and retrieved dynamically
  • Fine-tuned model weights encoding specific behaviors

RAG Architectures for Agents

RAG is the primary mechanism for connecting agents to external knowledge. The architecture you choose has a significant impact on answer quality.

Naive RAG

The simplest approach: embed the query, retrieve the top-K most similar chunks, stuff them into the prompt.

User Query → Embed → Vector Search (top-K) → Stuff into Prompt → LLM → Answer

Limitations of naive RAG:

  • Single retrieval pass — if the first search misses relevant content, quality degrades
  • No query reformulation — the user's raw question may not match how information is stored
  • No verification — the agent cannot check whether retrieved content actually answers the question

Agentic RAG

Agentic RAG treats retrieval as a multi-step reasoning process. The agent actively plans, retrieves, evaluates, and iterates.

Query Planning: Before searching, the agent analyzes the question and may decompose it into sub-queries.

# Complex question requiring decomposition
user_query = "Compare our Q3 and Q4 revenue and explain the main drivers"

# Agent decomposes into sub-queries
sub_queries = [
    "Q3 revenue figures and breakdown",
    "Q4 revenue figures and breakdown",
    "Key business events between Q3 and Q4",
]
# Each sub-query is searched independently, results are combined

Self-Reflection: After retrieving documents, the agent evaluates whether the results are sufficient to answer the question.

# Self-reflection prompt (conceptual)
reflection_prompt = """
Given these retrieved documents: {retrieved_docs}
And the original question: {user_query}

Do these documents contain enough information to answer the question?
- If YES: proceed to generate the answer
- If NO: what additional information is needed? Reformulate the query.
"""

Multi-Hop Retrieval: Some questions require chaining multiple retrievals — the answer to the first search informs what to search for next.

Query: "Who manages the team that built feature X?"
  → Search 1: "feature X" → finds "Built by Team Alpha"
  → Search 2: "Team Alpha manager" → finds "Managed by Sarah Chen"
  → Answer: "Sarah Chen manages the team that built feature X"

Choosing the Right Architecture

Scenario Recommended Approach Why
Simple FAQ lookup Naive RAG Questions map directly to stored answers
Complex analytical questions Agentic RAG with query decomposition Multiple pieces of information needed
Questions requiring reasoning across documents Multi-hop retrieval Answer depends on chaining facts
High-stakes applications (legal, medical) Agentic RAG with self-reflection Must verify retrieval quality before answering

Chunking Strategies

How you split documents into chunks directly affects retrieval quality. The right strategy depends on your content type and query patterns.

Strategy How It Works Best For Drawback
Fixed-size Split every N tokens with overlap Uniform content (logs, transcripts) Cuts mid-sentence, breaks context
Recursive Split by paragraphs, then sentences, then tokens Structured documents (articles, docs) Requires tuning separators per content type
Semantic Group sentences by embedding similarity Mixed-topic documents Computationally expensive at ingestion time
Document-aware Split by headings, sections, or logical boundaries Structured formats (Markdown, HTML, code) Requires content-specific parsing logic

Key principle: Chunks should be self-contained enough that they make sense without surrounding context, but small enough to be precise. A common starting point is 256-512 tokens with 10-20% overlap.

Context Window Management

When an agent has conversation history, retrieved documents, system instructions, and tool results, the context window fills up fast. You need strategies to manage what goes in.

Summarization

Compress older conversation turns into a summary, keeping recent messages verbatim.

# Context management strategy
context = []
context.append(system_prompt)              # Fixed: ~500 tokens
context.append(conversation_summary)        # Compressed: ~200 tokens
context.append(recent_messages[-5:])        # Verbatim: ~1000 tokens
context.append(retrieved_documents[:3])     # Top-3 chunks: ~1500 tokens
# Total: ~3200 tokens — fits comfortably within budget

Sliding Window

Keep only the last N messages in the context, dropping older ones.

  • Advantage: Simple to implement, predictable token usage
  • Disadvantage: Loses early context that may still be relevant

Importance-Based Pruning

Score each piece of context by relevance to the current query and drop low-scoring items first.

  • Messages where the user stated key requirements: high importance
  • Small-talk or acknowledgment messages: low importance
  • Tool results from earlier steps: medium importance (summarize if needed)

Token Budget Allocation

Allocate your context window into explicit budgets:

Component Budget Example (8K total)
System prompt 10-15% ~1000 tokens
Conversation memory 20-30% ~2000 tokens
Retrieved documents 40-50% ~3500 tokens
Reserved for output 15-20% ~1500 tokens

Source Attribution and Hallucination Prevention

Agents that retrieve knowledge must attribute their answers to specific sources. Without attribution, users cannot verify claims and trust erodes.

Source Attribution

Each claim in the agent's response should link back to a specific chunk:

# Attribution structure
attribution = {
    "claim": "The refund policy allows returns within 30 days",
    "source_chunk_id": "policy-doc-chunk-42",
    "source_document": "refund-policy-v3.pdf",
    "confidence": 0.92,
    "relevant_excerpt": "Customers may return items within 30 calendar days..."
}

Hallucination Prevention

The agent should only make claims supported by retrieved content. Common strategies:

  • Grounding check: Compare each sentence in the response against retrieved chunks
  • Abstain when unsure: If no retrieved content addresses the question, say "I don't have information about that" rather than guessing
  • Quote directly: Include relevant excerpts from source documents
  • Confidence scoring: Assign confidence scores and flag low-confidence claims for human review

Evaluation Metrics

RAG-powered agents need systematic evaluation. Three core metrics matter:

Metric What It Measures How to Assess
Faithfulness Does the answer only contain information from retrieved sources? Check each claim against source chunks — penalize unsupported claims
Relevance Are the retrieved documents relevant to the question? Score how well retrieved chunks address the query
Completeness Does the answer address all parts of the question? Compare answer coverage against the full set of relevant information

Additional evaluation dimensions for production systems:

  • Latency: How long does the full retrieve-and-generate pipeline take?
  • Cost: How many tokens are consumed per query (retrieval + generation)?
  • Robustness: Does quality degrade with ambiguous queries, adversarial inputs, or out-of-scope questions?

Interview tip: When discussing RAG system design, always mention evaluation. Interviewers want to see that you think about how to measure quality, not just how to build the system.

In the lab, you'll build a complete RAG-powered conversational agent with configurable chunking, agentic retrieval, memory management, and hallucination guards. :::

Quiz

Module 2 Quiz: RAG-Powered Agents & Memory Systems

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.