Failure Modes in Agents

AI agents can fail in many ways—understanding these failure modes is the first step to building robust systems.

Common Failure Categories

Category	Example	Impact
API Failures	Rate limits, timeouts, service outages	Agent stops working
Tool Failures	Invalid parameters, permission errors	Incomplete tasks
Reasoning Failures	Hallucinations, loops, wrong conclusions	Incorrect outputs
Context Failures	Token overflow, lost context	Degraded performance

API and Network Failures

import time
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=60)
)
def call_llm_with_retry(prompt):
    """Retry LLM calls with exponential backoff"""
    try:
        return llm.generate(prompt)
    except RateLimitError:
        raise  # Will trigger retry
    except ServiceUnavailableError:
        raise  # Will trigger retry

Tool Execution Failures

class SafeToolExecutor:
    def execute(self, tool_name, params):
        try:
            result = self.tools[tool_name].run(**params)
            return {"success": True, "result": result}
        except KeyError:
            return {"success": False, "error": f"Unknown tool: {tool_name}"}
        except TypeError as e:
            return {"success": False, "error": f"Invalid parameters: {e}"}
        except PermissionError:
            return {"success": False, "error": "Permission denied"}
        except Exception as e:
            return {"success": False, "error": f"Unexpected error: {e}"}

Reasoning Failures

Infinite Loops

class LoopDetector:
    def __init__(self, max_iterations=10):
        self.max_iterations = max_iterations
        self.action_history = []

    def check_and_record(self, action):
        self.action_history.append(action)

        # Check for repeated patterns
        if len(self.action_history) > self.max_iterations:
            recent = self.action_history[-5:]
            if len(set(recent)) == 1:  # Same action repeated
                raise LoopDetectedError("Agent stuck in loop")

        return True

Hallucination Detection

def verify_tool_output(claimed_result, actual_result):
    """Check if agent's claimed result matches actual tool output"""
    if claimed_result != actual_result:
        return {
            "verified": False,
            "discrepancy": f"Agent claimed '{claimed_result}' but tool returned '{actual_result}'"
        }
    return {"verified": True}

Failure Monitoring

Track failures to identify patterns:

from collections import defaultdict
from datetime import datetime

class FailureTracker:
    def __init__(self):
        self.failures = defaultdict(list)

    def log_failure(self, category, details):
        self.failures[category].append({
            "timestamp": datetime.now(),
            "details": details
        })

    def get_failure_rate(self, category, hours=24):
        cutoff = datetime.now() - timedelta(hours=hours)
        recent = [f for f in self.failures[category]
                  if f["timestamp"] > cutoff]
        return len(recent)

Key Failure Indicators

Watch for these warning signs:

Repeated retries → Underlying service issue
Increasing latency → Resource constraints
Tool errors spike → Schema or API changes
Context truncation → Need better memory management

Next: Learn how to handle failures gracefully without breaking the user experience. :::