Lesson 9 of 20

Memory & Knowledge

Context Window Management

3 min read

Every LLM has a finite context window—the maximum number of tokens it can process at once. Managing this window is crucial for building effective agents.

Understanding Context Windows

ModelContext Window~Word Equivalent
GPT-5.41M tokens~750,000 words
Claude Sonnet 4.61M tokens~750,000 words
Gemini 3.1 Pro1M tokens~750,000 words

Note: Longer context ≠ better performance. Models often struggle with information in the "middle" of long contexts.

Token Economics

Every API call costs tokens. A typical agent conversation includes:

System prompt:     ~500-2000 tokens
Conversation:      Variable
Tool definitions:  ~100-500 per tool
Tool results:      Variable
Response:          ~200-2000 tokens

Managing Context Efficiently

1. Summarization

def manage_context(messages, max_tokens=50000):
    total_tokens = count_tokens(messages)

    if total_tokens > max_tokens:
        # Summarize older messages
        old_messages = messages[:-10]  # Keep last 10
        summary = llm.generate(f"Summarize this conversation: {old_messages}")

        return [
            {"role": "system", "content": f"Previous context: {summary}"},
            *messages[-10:]
        ]

    return messages

2. Sliding Window

Keep only the most recent messages:

def sliding_window(messages, window_size=20):
    if len(messages) > window_size:
        # Always keep system message
        system = messages[0] if messages[0]["role"] == "system" else None
        recent = messages[-window_size:]
        return [system, *recent] if system else recent
    return messages

3. Selective Retrieval

Only include relevant past context:

def selective_context(messages, current_query):
    # Embed current query
    query_embedding = embed(current_query)

    # Find relevant past messages
    relevant = []
    for msg in messages:
        similarity = cosine_similarity(query_embedding, embed(msg))
        if similarity > 0.7:
            relevant.append(msg)

    return relevant

Best Practices

PracticeBenefit
Monitor token usageStay within limits, control costs
Summarize proactivelyPreserve key information
Prioritize recent contextMost relevant for current task
Cache embeddingsFaster retrieval
Use smaller models for summarizationCost efficient

Common Pitfalls

  • Ignoring context limits → Truncated important info
  • Including everything → Slow, expensive, noisy
  • Aggressive truncation → Loss of key context
  • Smart summarization → Best of both worlds

Next, we'll explore RAG—a powerful technique for giving agents access to external knowledge. :::

Quick check: how does this lesson land for you?

Quiz

Module 3: Memory & Knowledge

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.