Autonomous Agents: Planning and Self-Correction

What Makes an Agent "Autonomous"

An autonomous agent goes beyond simple tool calling. It operates in a continuous loop of four stages:

Planning — Decompose a high-level goal into a sequence of actionable steps
Execution — Carry out each step by calling tools, writing code, or querying APIs
Observation — Inspect the result of each action to determine what happened
Self-Correction — Detect when the plan is failing and adjust course

This Plan-Execute-Observe-Correct loop is the fundamental pattern behind every autonomous agent, from coding assistants to research agents. The key insight is that the agent does not just follow instructions — it reasons about whether its actions are achieving the intended goal.

Goal: "Find the top 3 competitors and summarize their pricing"
  |
Plan: [search_web, extract_pages, compare_pricing, write_summary]
  |
Execute step 1: search_web("competitor pricing SaaS")
  |
Observe: Got 10 results, but 3 are irrelevant blog posts
  |
Self-Correct: Filter to product pages only, refine search query
  |
Execute step 2 (revised): search_web("competitor pricing page site:competitor.com")
  |
... continue loop ...

Planning Strategies

Different planning approaches offer different trade-offs between thoroughness and speed.

ReAct: Reasoning + Acting

The ReAct pattern, introduced by Yao et al. (ICLR 2023), interleaves reasoning traces with actions. At each step, the agent produces:

Thought — What do I know? What should I do next?
Action — The specific tool call or operation
Observation — The result of that action

Thought: I need to find the current price of AAPL stock.
         I have a stock_price tool available.
Action:  stock_price(symbol="AAPL")
Observation: $187.42
Thought: Now I have the price. The user also asked for
         the 52-week high. Let me search for that.
Action:  stock_info(symbol="AAPL", field="52_week_high")
Observation: $199.62
Thought: I have both data points. I can now respond.
Action:  respond("AAPL is currently at $187.42, with a 52-week high of $199.62")

ReAct's strength is its step-by-step transparency — every decision is traceable. Its weakness is that it plans only one step ahead, which can lead to inefficient paths for complex tasks.

Plan-and-Execute

The Plan-and-Execute strategy, explored by Wang et al. (2023) in the Plan-and-Solve approach, separates planning from execution into two distinct phases:

Planning phase: The LLM generates a complete plan (a list of steps) before executing anything
Execution phase: A separate executor runs each step, feeding results back to the planner if replanning is needed

This works well for structured tasks where the steps are predictable. It struggles when early steps produce unexpected results that invalidate later steps.

Tree of Thought

Tree of Thought (ToT), introduced by Yao et al. (2023), extends chain-of-thought reasoning by exploring multiple reasoning paths simultaneously:

Generate several candidate next steps
Evaluate each candidate (using the LLM as an evaluator or using heuristics)
Pursue the most promising paths, prune the rest
Optionally backtrack if a path leads to a dead end

ToT is powerful for tasks with clear evaluation criteria (math problems, puzzles, code generation with test cases) but expensive — it requires multiple LLM calls per decision point.

Reflexion

Reflexion, introduced by Shinn et al. (2023), adds an explicit self-reflection step after task completion:

Attempt the task
Evaluate the result (did it succeed? what went wrong?)
Generate a verbal reflection on what to do differently
Retry with the reflection added to context

The key idea is that the agent's own verbal analysis of its failures becomes part of its "experience" for the next attempt. This mimics how humans learn from mistakes through deliberate review.

Strategy	Plans Ahead	Explores Alternatives	Self-Reflects	Cost
ReAct	1 step	No	No	Low
Plan-and-Execute	Full plan	No	On replan only	Medium
Tree of Thought	Multiple paths	Yes	Via evaluation	High
Reflexion	1 attempt	Via retry	Yes, explicitly	Medium-High

Self-Correction: When the Plan Fails

Autonomous agents must detect failure and adapt. Common failure signals include:

Repeated identical actions — The agent is stuck in a loop
Tool errors — An API returned an error or unexpected format
Goal drift — The agent's outputs are diverging from the original objective
Resource exhaustion — Approaching token limits, time limits, or cost budgets
Diminishing progress — Each step adds less value than the previous one

A well-designed self-correction system:

Detects the failure condition (via heuristics or LLM-based evaluation)
Diagnoses the root cause (wrong tool? bad input? impossible subtask?)
Replans with the diagnosis in context (generate a new plan that avoids the failed approach)
Tracks what was tried before to avoid repeating the same mistake

def check_stuck_state(history: list[str]) -> bool:
    """Detect if agent is repeating the same action."""
    if len(history) < 3:
        return False
    last_three = history[-3:]
    return last_three[0] == last_three[1] == last_three[2]

def replan(goal: str, failed_steps: list[str], error: str) -> str:
    """Generate a new plan given what failed."""
    prompt = f"""Original goal: {goal}
Steps that failed: {failed_steps}
Error encountered: {error}
Generate a new plan that avoids the failed approach."""
    return llm_call(prompt)

Human-in-the-Loop: Keeping Humans in Control

Fully autonomous agents are risky. Production systems use checkpoints where the agent pauses for human approval.

Confidence Thresholds

The agent assigns a confidence score to its planned actions. Below a threshold, it asks for human input:

High confidence (> 0.9): Execute automatically
Medium confidence (0.5 - 0.9): Execute but flag for review
Low confidence (< 0.5): Pause and ask the human

Approval Gates

Certain action categories always require human approval, regardless of confidence:

Irreversible actions — Deleting data, sending emails, making purchases
High-cost actions — API calls that incur significant charges
External communications — Messages sent to customers or partners
Scope expansion — When the agent wants to take actions outside its original task

Checkpoint Pattern

Agent: I have found 3 competitor pricing pages. My plan is:
       1. Extract pricing from each page
       2. Build a comparison table
       3. Write a summary with recommendations
       Confidence: 0.85
       Should I proceed? [Yes / Modify plan / Stop]
Human: Yes, proceed.
Agent: [executes plan]

Sandboxing Autonomous Execution

When agents act alone, the blast radius of a mistake grows. Sandboxing strategies mitigate this:

Read-only by default — The agent can read data freely but needs explicit permission to write, delete, or modify
Containerized execution — Run agent-generated code in isolated containers (Docker, sandboxed interpreters) with no access to host resources
Resource limits — Cap the number of tool calls, tokens consumed, wall-clock time, and API spend per task
Dry-run mode — The agent generates the plan and shows what it would do, without executing. Human reviews, then approves execution
Rollback capability — For write operations, maintain a log of changes so they can be reversed if the agent makes a mistake

The principle is defense in depth: no single safeguard is sufficient, but layered protections reduce risk to acceptable levels.

Real-World: How Coding Agents Work

Coding agents like Devin and Claude Code demonstrate autonomous task execution in practice. Their typical workflow follows the same loop:

Plan: Read the issue or task description. Identify which files to examine. Outline an approach
Execute: Edit files, write code, create tests
Observe: Run the test suite, read compiler/linter output, check if the build passes
Self-Correct: If tests fail, read the error message, diagnose the issue, modify the code, re-run

What makes these agents effective is tight feedback loops — they can run code and observe results within seconds, enabling rapid iteration. They also maintain a working memory of what they have tried, preventing them from repeating failed approaches.

Key design decisions in production coding agents:

Decision	Trade-off
How many files to read upfront	More context = better plan, but costs more tokens
When to run tests	After every edit (safe but slow) vs. batch edits then test (fast but harder to debug)
When to ask for help	Too early = annoying; too late = wasted work
Max autonomous steps	Too few = agent gives up early; too many = runaway cost

Interview Angle: Designing an Autonomous Research Agent

A common interview question is: "Design an autonomous agent that can research a topic and produce a report."

Apply the planning strategies from this lesson:

Planning: Use Plan-and-Execute — generate a research plan with subtasks (define scope, search sources, extract key findings, synthesize, write report)
Execution: Each subtask calls different tools (web search, document reader, summarizer)
Self-Correction: After extracting findings, evaluate whether they cover the topic adequately. If gaps exist, add new search subtasks
Human-in-the-Loop: Present the outline to the human before writing the full report. Let them redirect if the scope drifted
Budget: Set limits on total searches, total tokens, and wall-clock time. If the budget runs low, summarize what was found so far rather than failing silently

The interviewer wants to see that you think about failure modes (what if search returns nothing useful?), quality control (how do you verify the findings?), and cost management (how do you prevent the agent from burning through API credits?).

In the lab, you will build an autonomous task execution agent with planning, ReAct execution, self-reflection, replanning, human checkpoints, and budget management. :::