Lesson 9 of 20

Memory & Knowledge

Context Window Management

3 min read

Every LLM has a finite context window—the maximum number of tokens it can process at once. Managing this window is crucial for building effective agents.

Understanding Context Windows

Model Context Window ~Word Equivalent
GPT-4 128K tokens ~96,000 words
Claude 3.5 200K tokens ~150,000 words
GPT-4o 128K tokens ~96,000 words
Gemini 1.5 Pro 2M tokens ~1.5M words

Note: Longer context ≠ better performance. Models often struggle with information in the "middle" of long contexts.

Token Economics

Every API call costs tokens. A typical agent conversation includes:

System prompt:     ~500-2000 tokens
Conversation:      Variable
Tool definitions:  ~100-500 per tool
Tool results:      Variable
Response:          ~200-2000 tokens

Managing Context Efficiently

1. Summarization

def manage_context(messages, max_tokens=50000):
    total_tokens = count_tokens(messages)

    if total_tokens > max_tokens:
        # Summarize older messages
        old_messages = messages[:-10]  # Keep last 10
        summary = llm.generate(f"Summarize this conversation: {old_messages}")

        return [
            {"role": "system", "content": f"Previous context: {summary}"},
            *messages[-10:]
        ]

    return messages

2. Sliding Window

Keep only the most recent messages:

def sliding_window(messages, window_size=20):
    if len(messages) > window_size:
        # Always keep system message
        system = messages[0] if messages[0]["role"] == "system" else None
        recent = messages[-window_size:]
        return [system, *recent] if system else recent
    return messages

3. Selective Retrieval

Only include relevant past context:

def selective_context(messages, current_query):
    # Embed current query
    query_embedding = embed(current_query)

    # Find relevant past messages
    relevant = []
    for msg in messages:
        similarity = cosine_similarity(query_embedding, embed(msg))
        if similarity > 0.7:
            relevant.append(msg)

    return relevant

Best Practices

Practice Benefit
Monitor token usage Stay within limits, control costs
Summarize proactively Preserve key information
Prioritize recent context Most relevant for current task
Cache embeddings Faster retrieval
Use smaller models for summarization Cost efficient

Common Pitfalls

  • Ignoring context limits → Truncated important info
  • Including everything → Slow, expensive, noisy
  • Aggressive truncation → Loss of key context
  • Smart summarization → Best of both worlds

Next, we'll explore RAG—a powerful technique for giving agents access to external knowledge. :::

Quiz

Module 3: Memory & Knowledge

Take Quiz