Lesson 13 of 20

Error Handling & Recovery

Failure Modes in Agents

3 min read

AI agents can fail in many ways—understanding these failure modes is the first step to building robust systems.

Common Failure Categories

Category Example Impact
API Failures Rate limits, timeouts, service outages Agent stops working
Tool Failures Invalid parameters, permission errors Incomplete tasks
Reasoning Failures Hallucinations, loops, wrong conclusions Incorrect outputs
Context Failures Token overflow, lost context Degraded performance

API and Network Failures

import time
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=60)
)
def call_llm_with_retry(prompt):
    """Retry LLM calls with exponential backoff"""
    try:
        return llm.generate(prompt)
    except RateLimitError:
        raise  # Will trigger retry
    except ServiceUnavailableError:
        raise  # Will trigger retry

Tool Execution Failures

class SafeToolExecutor:
    def execute(self, tool_name, params):
        try:
            result = self.tools[tool_name].run(**params)
            return {"success": True, "result": result}
        except KeyError:
            return {"success": False, "error": f"Unknown tool: {tool_name}"}
        except TypeError as e:
            return {"success": False, "error": f"Invalid parameters: {e}"}
        except PermissionError:
            return {"success": False, "error": "Permission denied"}
        except Exception as e:
            return {"success": False, "error": f"Unexpected error: {e}"}

Reasoning Failures

Infinite Loops

class LoopDetector:
    def __init__(self, max_iterations=10):
        self.max_iterations = max_iterations
        self.action_history = []

    def check_and_record(self, action):
        self.action_history.append(action)

        # Check for repeated patterns
        if len(self.action_history) > self.max_iterations:
            recent = self.action_history[-5:]
            if len(set(recent)) == 1:  # Same action repeated
                raise LoopDetectedError("Agent stuck in loop")

        return True

Hallucination Detection

def verify_tool_output(claimed_result, actual_result):
    """Check if agent's claimed result matches actual tool output"""
    if claimed_result != actual_result:
        return {
            "verified": False,
            "discrepancy": f"Agent claimed '{claimed_result}' but tool returned '{actual_result}'"
        }
    return {"verified": True}

Failure Monitoring

Track failures to identify patterns:

from collections import defaultdict
from datetime import datetime

class FailureTracker:
    def __init__(self):
        self.failures = defaultdict(list)

    def log_failure(self, category, details):
        self.failures[category].append({
            "timestamp": datetime.now(),
            "details": details
        })

    def get_failure_rate(self, category, hours=24):
        cutoff = datetime.now() - timedelta(hours=hours)
        recent = [f for f in self.failures[category]
                  if f["timestamp"] > cutoff]
        return len(recent)

Key Failure Indicators

Watch for these warning signs:

  • Repeated retries → Underlying service issue
  • Increasing latency → Resource constraints
  • Tool errors spike → Schema or API changes
  • Context truncation → Need better memory management

Next: Learn how to handle failures gracefully without breaking the user experience. :::

Quiz

Module 4: Error Handling & Recovery

Take Quiz