Lesson 5 of 20

Long-Running Agents

The Session Challenge

3 min read

Most AI agent examples show quick, single-turn interactions. Real-world tasks—refactoring a codebase, migrating a database, writing documentation—take hours or days, not seconds.

The Problem

LLM APIs have hard limits:

Constraint Impact
Context window Can't remember everything at once
Request timeout API calls can't run forever
Rate limits Throttled after too many requests
Cost per token Long sessions get expensive

A 200K context window sounds large until your agent processes 50 files, runs 30 commands, and tries to remember what it learned two hours ago.

What Happens When Context Fills Up?

Turn 1: Agent reads 10 files, understands codebase
Turn 2: Agent makes 5 changes, tracks state
Turn 3: Agent runs tests, analyzes errors
Turn 4: ...context full, earlier information lost

The agent forgets why it made those changes. It starts repeating mistakes. It hallucinates file contents.

State Persistence: The Solution

Long-running agents need external memory:

# Bad: Relying only on context
agent.run("Refactor the auth module")  # Forgets after session

# Good: Persisting state externally
class PersistentAgent:
    def __init__(self, state_file: str):
        self.state_file = state_file
        self.state = self.load_state()

    def load_state(self) -> dict:
        if os.path.exists(self.state_file):
            return json.load(open(self.state_file))
        return {"completed": [], "current_task": None, "notes": []}

    def save_state(self):
        json.dump(self.state, open(self.state_file, "w"), indent=2)

    def checkpoint(self, task: str, result: dict):
        self.state["completed"].append({
            "task": task,
            "result": result,
            "timestamp": datetime.now().isoformat()
        })
        self.save_state()

Types of State to Persist

  1. Progress state: What's done, what's next
  2. Knowledge state: What the agent learned
  3. Environment state: File changes, tool outputs
  4. Decision state: Why choices were made

When Sessions Truly Need to Be Long

Not every task needs long-running architecture. Use it when:

  • Task requires multiple API calls over extended time
  • Work must survive connection failures
  • Results need to be reproducible
  • Multiple humans might continue the work

Nerd Note: If you can finish in under 50 API calls with comfortable context room, keep it simple. Over-engineering state management is a real trap.

Next: A proven architecture for handling long sessions. :::

Quiz

Module 2: Long-Running Agents

Take Quiz