RAG-Powered Agents & Memory Systems

Agent Memory & Retrieval Architectures

5 min read

Why Agents Need Memory

A stateless LLM forgets everything between API calls. Without memory, an agent cannot maintain conversation context, recall factual knowledge from a company's documents, or adapt to a user's preferences over time. Memory transforms a stateless language model into a persistent, context-aware agent.

Agent memory serves three distinct purposes:

PurposeExampleWhat Breaks Without It
Conversation contextRemembering the user said "my project uses PostgreSQL" five messages agoAgent asks the same clarifying questions repeatedly
Factual knowledgeRetrieving the company's refund policy from internal docsAgent hallucinates policies or gives generic answers
Learned preferencesKnowing this user prefers concise code examples over verbose explanationsAgent cannot personalize interactions

Memory Taxonomy

Agent memory systems map to a well-established taxonomy. Understanding these categories helps you design the right memory architecture for a given use case.

Working Memory (Context Window)

The LLM's context window is the agent's working memory. Everything the model can "see" at inference time — the system prompt, conversation history, retrieved documents, tool results — must fit within this window.

  • Capacity: Varies by model (e.g., 1M tokens for GPT-5.4, 1M tokens for Claude Sonnet 4.6, 1M tokens for Gemini 3.1 Pro)
  • Trade-off: Larger context windows increase cost and latency per request
  • Key constraint: Context is finite, so you must decide what to include and what to leave out

Episodic Memory (Conversation History)

Episodic memory stores the sequence of past interactions — what the user said, what the agent did, what tools were called, and what results came back.

# Simple episodic memory structure
episodic_memory = [
    {"role": "user", "content": "Find all orders over $500"},
    {"role": "assistant", "content": None, "tool_calls": [
        {"name": "query_orders", "args": {"min_amount": 500}}
    ]},
    {"role": "tool", "content": "Found 23 orders totaling $18,450"},
    {"role": "assistant", "content": "I found 23 orders over $500..."},
]

As conversations grow long, episodic memory must be managed — you cannot keep injecting the full history into every request indefinitely.

Semantic Memory (Knowledge Base)

Semantic memory stores factual knowledge the agent can retrieve on demand. This is where RAG (Retrieval-Augmented Generation) comes in: instead of stuffing all knowledge into the prompt, you retrieve only the relevant pieces for each query.

  • Storage: Vector databases (e.g., Pinecone, Weaviate, Chroma, pgvector), document stores
  • Retrieval: Embedding-based similarity search, keyword search, or hybrid approaches
  • Scope: Company docs, product catalogs, knowledge bases, code repositories

Procedural Memory (Learned Patterns)

Procedural memory captures how the agent should behave — patterns, workflows, and strategies it has learned. This can be implemented as:

  • System prompts with instructions and examples
  • Few-shot examples stored and retrieved dynamically
  • Fine-tuned model weights encoding specific behaviors

RAG Architectures for Agents

RAG is the primary mechanism for connecting agents to external knowledge. The architecture you choose has a significant impact on answer quality.

Naive RAG

The simplest approach: embed the query, retrieve the top-K most similar chunks, stuff them into the prompt.

User Query → Embed → Vector Search (top-K) → Stuff into Prompt → LLM → Answer

Limitations of naive RAG:

  • Single retrieval pass — if the first search misses relevant content, quality degrades
  • No query reformulation — the user's raw question may not match how information is stored
  • No verification — the agent cannot check whether retrieved content actually answers the question

Agentic RAG

Agentic RAG treats retrieval as a multi-step reasoning process. The agent actively plans, retrieves, evaluates, and iterates.

Query Planning: Before searching, the agent analyzes the question and may decompose it into sub-queries.

# Complex question requiring decomposition
user_query = "Compare our Q3 and Q4 revenue and explain the main drivers"

# Agent decomposes into sub-queries
sub_queries = [
    "Q3 revenue figures and breakdown",
    "Q4 revenue figures and breakdown",
    "Key business events between Q3 and Q4",
]
# Each sub-query is searched independently, results are combined

Self-Reflection: After retrieving documents, the agent evaluates whether the results are sufficient to answer the question.

# Self-reflection prompt (conceptual)
reflection_prompt = """
Given these retrieved documents: {retrieved_docs}
And the original question: {user_query}

Do these documents contain enough information to answer the question?
- If YES: proceed to generate the answer
- If NO: what additional information is needed? Reformulate the query.
"""

Multi-Hop Retrieval: Some questions require chaining multiple retrievals — the answer to the first search informs what to search for next.

Query: "Who manages the team that built feature X?"
  → Search 1: "feature X" → finds "Built by Team Alpha"
  → Search 2: "Team Alpha manager" → finds "Managed by Sarah Chen"
  → Answer: "Sarah Chen manages the team that built feature X"

Choosing the Right Architecture

ScenarioRecommended ApproachWhy
Simple FAQ lookupNaive RAGQuestions map directly to stored answers
Complex analytical questionsAgentic RAG with query decompositionMultiple pieces of information needed
Questions requiring reasoning across documentsMulti-hop retrievalAnswer depends on chaining facts
High-stakes applications (legal, medical)Agentic RAG with self-reflectionMust verify retrieval quality before answering

Chunking Strategies

How you split documents into chunks directly affects retrieval quality. The right strategy depends on your content type and query patterns.

StrategyHow It WorksBest ForDrawback
Fixed-sizeSplit every N tokens with overlapUniform content (logs, transcripts)Cuts mid-sentence, breaks context
RecursiveSplit by paragraphs, then sentences, then tokensStructured documents (articles, docs)Requires tuning separators per content type
SemanticGroup sentences by embedding similarityMixed-topic documentsComputationally expensive at ingestion time
Document-awareSplit by headings, sections, or logical boundariesStructured formats (Markdown, HTML, code)Requires content-specific parsing logic

Key principle: Chunks should be self-contained enough that they make sense without surrounding context, but small enough to be precise. A common starting point is 256-512 tokens with 10-20% overlap.

Context Window Management

When an agent has conversation history, retrieved documents, system instructions, and tool results, the context window fills up fast. You need strategies to manage what goes in.

Summarization

Compress older conversation turns into a summary, keeping recent messages verbatim.

# Context management strategy
context = []
context.append(system_prompt)              # Fixed: ~500 tokens
context.append(conversation_summary)        # Compressed: ~200 tokens
context.append(recent_messages[-5:])        # Verbatim: ~1000 tokens
context.append(retrieved_documents[:3])     # Top-3 chunks: ~1500 tokens
# Total: ~3200 tokens — fits comfortably within budget

Sliding Window

Keep only the last N messages in the context, dropping older ones.

  • Advantage: Simple to implement, predictable token usage
  • Disadvantage: Loses early context that may still be relevant

Importance-Based Pruning

Score each piece of context by relevance to the current query and drop low-scoring items first.

  • Messages where the user stated key requirements: high importance
  • Small-talk or acknowledgment messages: low importance
  • Tool results from earlier steps: medium importance (summarize if needed)

Token Budget Allocation

Allocate your context window into explicit budgets:

ComponentBudgetExample (8K total)
System prompt10-15%~1000 tokens
Conversation memory20-30%~2000 tokens
Retrieved documents40-50%~3500 tokens
Reserved for output15-20%~1500 tokens

Source Attribution and Hallucination Prevention

Agents that retrieve knowledge must attribute their answers to specific sources. Without attribution, users cannot verify claims and trust erodes.

Source Attribution

Each claim in the agent's response should link back to a specific chunk:

# Attribution structure
attribution = {
    "claim": "The refund policy allows returns within 30 days",
    "source_chunk_id": "policy-doc-chunk-42",
    "source_document": "refund-policy-v3.pdf",
    "confidence": 0.92,
    "relevant_excerpt": "Customers may return items within 30 calendar days..."
}

Hallucination Prevention

The agent should only make claims supported by retrieved content. Common strategies:

  • Grounding check: Compare each sentence in the response against retrieved chunks
  • Abstain when unsure: If no retrieved content addresses the question, say "I don't have information about that" rather than guessing
  • Quote directly: Include relevant excerpts from source documents
  • Confidence scoring: Assign confidence scores and flag low-confidence claims for human review

Evaluation Metrics

RAG-powered agents need systematic evaluation. Three core metrics matter:

MetricWhat It MeasuresHow to Assess
FaithfulnessDoes the answer only contain information from retrieved sources?Check each claim against source chunks — penalize unsupported claims
RelevanceAre the retrieved documents relevant to the question?Score how well retrieved chunks address the query
CompletenessDoes the answer address all parts of the question?Compare answer coverage against the full set of relevant information

Additional evaluation dimensions for production systems:

  • Latency: How long does the full retrieve-and-generate pipeline take?
  • Cost: How many tokens are consumed per query (retrieval + generation)?
  • Robustness: Does quality degrade with ambiguous queries, adversarial inputs, or out-of-scope questions?

Interview tip: When discussing RAG system design, always mention evaluation. Interviewers want to see that you think about how to measure quality, not just how to build the system.

In the lab, you'll build a complete RAG-powered conversational agent with configurable chunking, agentic retrieval, memory management, and hallucination guards. :::

Quiz

Module 2 Quiz: RAG-Powered Agents & Memory Systems

Take Quiz
Was this lesson helpful?

Sign in to rate