Agent Memory & Retrieval Architectures

Why Agents Need Memory

A stateless LLM forgets everything between API calls. Without memory, an agent cannot maintain conversation context, recall factual knowledge from a company's documents, or adapt to a user's preferences over time. Memory transforms a stateless language model into a persistent, context-aware agent.

Agent memory serves three distinct purposes:

Purpose	Example	What Breaks Without It
Conversation context	Remembering the user said "my project uses PostgreSQL" five messages ago	Agent asks the same clarifying questions repeatedly
Factual knowledge	Retrieving the company's refund policy from internal docs	Agent hallucinates policies or gives generic answers
Learned preferences	Knowing this user prefers concise code examples over verbose explanations	Agent cannot personalize interactions

Memory Taxonomy

Agent memory systems map to a well-established taxonomy. Understanding these categories helps you design the right memory architecture for a given use case.

Working Memory (Context Window)

The LLM's context window is the agent's working memory. Everything the model can "see" at inference time — the system prompt, conversation history, retrieved documents, tool results — must fit within this window.

Capacity: Varies by model (e.g., 128K tokens for GPT-4o, 200K tokens for Claude)
Trade-off: Larger context windows increase cost and latency per request
Key constraint: Context is finite, so you must decide what to include and what to leave out

Episodic Memory (Conversation History)

Episodic memory stores the sequence of past interactions — what the user said, what the agent did, what tools were called, and what results came back.

# Simple episodic memory structure
episodic_memory = [
    {"role": "user", "content": "Find all orders over $500"},
    {"role": "assistant", "content": None, "tool_calls": [
        {"name": "query_orders", "args": {"min_amount": 500}}
    ]},
    {"role": "tool", "content": "Found 23 orders totaling $18,450"},
    {"role": "assistant", "content": "I found 23 orders over $500..."},
]

As conversations grow long, episodic memory must be managed — you cannot keep injecting the full history into every request indefinitely.

Semantic Memory (Knowledge Base)

Semantic memory stores factual knowledge the agent can retrieve on demand. This is where RAG (Retrieval-Augmented Generation) comes in: instead of stuffing all knowledge into the prompt, you retrieve only the relevant pieces for each query.

Storage: Vector databases (e.g., Pinecone, Weaviate, Chroma, pgvector), document stores
Retrieval: Embedding-based similarity search, keyword search, or hybrid approaches
Scope: Company docs, product catalogs, knowledge bases, code repositories

Procedural Memory (Learned Patterns)

Procedural memory captures how the agent should behave — patterns, workflows, and strategies it has learned. This can be implemented as:

System prompts with instructions and examples
Few-shot examples stored and retrieved dynamically
Fine-tuned model weights encoding specific behaviors

RAG Architectures for Agents

RAG is the primary mechanism for connecting agents to external knowledge. The architecture you choose has a significant impact on answer quality.

Naive RAG

The simplest approach: embed the query, retrieve the top-K most similar chunks, stuff them into the prompt.

User Query → Embed → Vector Search (top-K) → Stuff into Prompt → LLM → Answer

Limitations of naive RAG:

Single retrieval pass — if the first search misses relevant content, quality degrades
No query reformulation — the user's raw question may not match how information is stored
No verification — the agent cannot check whether retrieved content actually answers the question

Agentic RAG

Agentic RAG treats retrieval as a multi-step reasoning process. The agent actively plans, retrieves, evaluates, and iterates.

Query Planning: Before searching, the agent analyzes the question and may decompose it into sub-queries.

# Complex question requiring decomposition
user_query = "Compare our Q3 and Q4 revenue and explain the main drivers"

# Agent decomposes into sub-queries
sub_queries = [
    "Q3 revenue figures and breakdown",
    "Q4 revenue figures and breakdown",
    "Key business events between Q3 and Q4",
]
# Each sub-query is searched independently, results are combined

Self-Reflection: After retrieving documents, the agent evaluates whether the results are sufficient to answer the question.

# Self-reflection prompt (conceptual)
reflection_prompt = """
Given these retrieved documents: {retrieved_docs}
And the original question: {user_query}

Do these documents contain enough information to answer the question?
- If YES: proceed to generate the answer
- If NO: what additional information is needed? Reformulate the query.
"""

Multi-Hop Retrieval: Some questions require chaining multiple retrievals — the answer to the first search informs what to search for next.

Query: "Who manages the team that built feature X?"
  → Search 1: "feature X" → finds "Built by Team Alpha"
  → Search 2: "Team Alpha manager" → finds "Managed by Sarah Chen"
  → Answer: "Sarah Chen manages the team that built feature X"

Choosing the Right Architecture

Scenario	Recommended Approach	Why
Simple FAQ lookup	Naive RAG	Questions map directly to stored answers
Complex analytical questions	Agentic RAG with query decomposition	Multiple pieces of information needed
Questions requiring reasoning across documents	Multi-hop retrieval	Answer depends on chaining facts
High-stakes applications (legal, medical)	Agentic RAG with self-reflection	Must verify retrieval quality before answering

Chunking Strategies

How you split documents into chunks directly affects retrieval quality. The right strategy depends on your content type and query patterns.

Strategy	How It Works	Best For	Drawback
Fixed-size	Split every N tokens with overlap	Uniform content (logs, transcripts)	Cuts mid-sentence, breaks context
Recursive	Split by paragraphs, then sentences, then tokens	Structured documents (articles, docs)	Requires tuning separators per content type
Semantic	Group sentences by embedding similarity	Mixed-topic documents	Computationally expensive at ingestion time
Document-aware	Split by headings, sections, or logical boundaries	Structured formats (Markdown, HTML, code)	Requires content-specific parsing logic

Key principle: Chunks should be self-contained enough that they make sense without surrounding context, but small enough to be precise. A common starting point is 256-512 tokens with 10-20% overlap.

Context Window Management

When an agent has conversation history, retrieved documents, system instructions, and tool results, the context window fills up fast. You need strategies to manage what goes in.

Summarization

Compress older conversation turns into a summary, keeping recent messages verbatim.

# Context management strategy
context = []
context.append(system_prompt)              # Fixed: ~500 tokens
context.append(conversation_summary)        # Compressed: ~200 tokens
context.append(recent_messages[-5:])        # Verbatim: ~1000 tokens
context.append(retrieved_documents[:3])     # Top-3 chunks: ~1500 tokens
# Total: ~3200 tokens — fits comfortably within budget

Sliding Window

Keep only the last N messages in the context, dropping older ones.

Advantage: Simple to implement, predictable token usage
Disadvantage: Loses early context that may still be relevant

Importance-Based Pruning

Score each piece of context by relevance to the current query and drop low-scoring items first.

Messages where the user stated key requirements: high importance
Small-talk or acknowledgment messages: low importance
Tool results from earlier steps: medium importance (summarize if needed)

Token Budget Allocation

Allocate your context window into explicit budgets:

Component	Budget	Example (8K total)
System prompt	10-15%	~1000 tokens
Conversation memory	20-30%	~2000 tokens
Retrieved documents	40-50%	~3500 tokens
Reserved for output	15-20%	~1500 tokens

Source Attribution and Hallucination Prevention

Agents that retrieve knowledge must attribute their answers to specific sources. Without attribution, users cannot verify claims and trust erodes.

Source Attribution

Each claim in the agent's response should link back to a specific chunk:

# Attribution structure
attribution = {
    "claim": "The refund policy allows returns within 30 days",
    "source_chunk_id": "policy-doc-chunk-42",
    "source_document": "refund-policy-v3.pdf",
    "confidence": 0.92,
    "relevant_excerpt": "Customers may return items within 30 calendar days..."
}

Hallucination Prevention

The agent should only make claims supported by retrieved content. Common strategies:

Grounding check: Compare each sentence in the response against retrieved chunks
Abstain when unsure: If no retrieved content addresses the question, say "I don't have information about that" rather than guessing
Quote directly: Include relevant excerpts from source documents
Confidence scoring: Assign confidence scores and flag low-confidence claims for human review

Evaluation Metrics

RAG-powered agents need systematic evaluation. Three core metrics matter:

Metric	What It Measures	How to Assess
Faithfulness	Does the answer only contain information from retrieved sources?	Check each claim against source chunks — penalize unsupported claims
Relevance	Are the retrieved documents relevant to the question?	Score how well retrieved chunks address the query
Completeness	Does the answer address all parts of the question?	Compare answer coverage against the full set of relevant information

Additional evaluation dimensions for production systems:

Latency: How long does the full retrieve-and-generate pipeline take?
Cost: How many tokens are consumed per query (retrieval + generation)?
Robustness: Does quality degrade with ambiguous queries, adversarial inputs, or out-of-scope questions?

Interview tip: When discussing RAG system design, always mention evaluation. Interviewers want to see that you think about how to measure quality, not just how to build the system.

In the lab, you'll build a complete RAG-powered conversational agent with configurable chunking, agentic retrieval, memory management, and hallucination guards. :::