Memory & Knowledge
Context Window Management
3 min read
Every LLM has a finite context window—the maximum number of tokens it can process at once. Managing this window is crucial for building effective agents.
Understanding Context Windows
| Model | Context Window | ~Word Equivalent |
|---|---|---|
| GPT-4 | 128K tokens | ~96,000 words |
| Claude 3.5 | 200K tokens | ~150,000 words |
| GPT-4o | 128K tokens | ~96,000 words |
| Gemini 1.5 Pro | 2M tokens | ~1.5M words |
Note: Longer context ≠ better performance. Models often struggle with information in the "middle" of long contexts.
Token Economics
Every API call costs tokens. A typical agent conversation includes:
System prompt: ~500-2000 tokens
Conversation: Variable
Tool definitions: ~100-500 per tool
Tool results: Variable
Response: ~200-2000 tokens
Managing Context Efficiently
1. Summarization
def manage_context(messages, max_tokens=50000):
total_tokens = count_tokens(messages)
if total_tokens > max_tokens:
# Summarize older messages
old_messages = messages[:-10] # Keep last 10
summary = llm.generate(f"Summarize this conversation: {old_messages}")
return [
{"role": "system", "content": f"Previous context: {summary}"},
*messages[-10:]
]
return messages
2. Sliding Window
Keep only the most recent messages:
def sliding_window(messages, window_size=20):
if len(messages) > window_size:
# Always keep system message
system = messages[0] if messages[0]["role"] == "system" else None
recent = messages[-window_size:]
return [system, *recent] if system else recent
return messages
3. Selective Retrieval
Only include relevant past context:
def selective_context(messages, current_query):
# Embed current query
query_embedding = embed(current_query)
# Find relevant past messages
relevant = []
for msg in messages:
similarity = cosine_similarity(query_embedding, embed(msg))
if similarity > 0.7:
relevant.append(msg)
return relevant
Best Practices
| Practice | Benefit |
|---|---|
| Monitor token usage | Stay within limits, control costs |
| Summarize proactively | Preserve key information |
| Prioritize recent context | Most relevant for current task |
| Cache embeddings | Faster retrieval |
| Use smaller models for summarization | Cost efficient |
Common Pitfalls
- ❌ Ignoring context limits → Truncated important info
- ❌ Including everything → Slow, expensive, noisy
- ❌ Aggressive truncation → Loss of key context
- ✅ Smart summarization → Best of both worlds
Next, we'll explore RAG—a powerful technique for giving agents access to external knowledge. :::