Error Handling & Recovery
Failure Modes in Agents
3 min read
AI agents can fail in many ways—understanding these failure modes is the first step to building robust systems.
Common Failure Categories
| Category | Example | Impact |
|---|---|---|
| API Failures | Rate limits, timeouts, service outages | Agent stops working |
| Tool Failures | Invalid parameters, permission errors | Incomplete tasks |
| Reasoning Failures | Hallucinations, loops, wrong conclusions | Incorrect outputs |
| Context Failures | Token overflow, lost context | Degraded performance |
API and Network Failures
import time
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60)
)
def call_llm_with_retry(prompt):
"""Retry LLM calls with exponential backoff"""
try:
return llm.generate(prompt)
except RateLimitError:
raise # Will trigger retry
except ServiceUnavailableError:
raise # Will trigger retry
Tool Execution Failures
class SafeToolExecutor:
def execute(self, tool_name, params):
try:
result = self.tools[tool_name].run(**params)
return {"success": True, "result": result}
except KeyError:
return {"success": False, "error": f"Unknown tool: {tool_name}"}
except TypeError as e:
return {"success": False, "error": f"Invalid parameters: {e}"}
except PermissionError:
return {"success": False, "error": "Permission denied"}
except Exception as e:
return {"success": False, "error": f"Unexpected error: {e}"}
Reasoning Failures
Infinite Loops
class LoopDetector:
def __init__(self, max_iterations=10):
self.max_iterations = max_iterations
self.action_history = []
def check_and_record(self, action):
self.action_history.append(action)
# Check for repeated patterns
if len(self.action_history) > self.max_iterations:
recent = self.action_history[-5:]
if len(set(recent)) == 1: # Same action repeated
raise LoopDetectedError("Agent stuck in loop")
return True
Hallucination Detection
def verify_tool_output(claimed_result, actual_result):
"""Check if agent's claimed result matches actual tool output"""
if claimed_result != actual_result:
return {
"verified": False,
"discrepancy": f"Agent claimed '{claimed_result}' but tool returned '{actual_result}'"
}
return {"verified": True}
Failure Monitoring
Track failures to identify patterns:
from collections import defaultdict
from datetime import datetime
class FailureTracker:
def __init__(self):
self.failures = defaultdict(list)
def log_failure(self, category, details):
self.failures[category].append({
"timestamp": datetime.now(),
"details": details
})
def get_failure_rate(self, category, hours=24):
cutoff = datetime.now() - timedelta(hours=hours)
recent = [f for f in self.failures[category]
if f["timestamp"] > cutoff]
return len(recent)
Key Failure Indicators
Watch for these warning signs:
- Repeated retries → Underlying service issue
- Increasing latency → Resource constraints
- Tool errors spike → Schema or API changes
- Context truncation → Need better memory management
Next: Learn how to handle failures gracefully without breaking the user experience. :::