Error Handling & Recovery
Graceful Degradation
3 min read
When parts of your agent fail, the system should degrade gracefully—maintaining partial functionality rather than failing completely.
Degradation Strategies
| Strategy | When to Use | Example |
|---|---|---|
| Fallback responses | API failures | Return cached or default response |
| Reduced functionality | Tool unavailable | Skip optional features |
| Alternative paths | Primary method fails | Try backup approach |
| Human handoff | Critical failures | Escalate to human support |
Implementing Fallbacks
class ResilientAgent:
def __init__(self, primary_llm, fallback_llm):
self.primary = primary_llm
self.fallback = fallback_llm
self.cache = ResponseCache()
async def generate(self, prompt, context):
# Try primary LLM
try:
return await self.primary.generate(prompt)
except (RateLimitError, ServiceError):
pass
# Try fallback LLM
try:
return await self.fallback.generate(prompt)
except Exception:
pass
# Try cache
cached = self.cache.get_similar(prompt)
if cached:
return f"[Cached response] {cached}"
# Last resort: honest failure message
return "I'm experiencing technical difficulties. Please try again shortly."
Tool Fallback Chains
class ToolWithFallbacks:
def __init__(self, tools_by_priority):
self.tools = tools_by_priority # [primary, fallback1, fallback2]
def execute(self, query):
errors = []
for tool in self.tools:
try:
result = tool.run(query)
return {"success": True, "result": result, "tool": tool.name}
except Exception as e:
errors.append(f"{tool.name}: {e}")
continue
return {
"success": False,
"error": "All tools failed",
"details": errors
}
# Example: Search with fallbacks
search_chain = ToolWithFallbacks([
GoogleSearch(), # Primary
BingSearch(), # Fallback 1
DuckDuckGoSearch() # Fallback 2
])
Feature Flags for Degradation
class FeatureManager:
def __init__(self):
self.features = {
"web_search": True,
"code_execution": True,
"image_generation": True,
"file_operations": True
}
self.health_checks = {}
def check_health(self):
"""Periodically check and disable unhealthy features"""
for feature, checker in self.health_checks.items():
try:
is_healthy = checker()
self.features[feature] = is_healthy
except:
self.features[feature] = False
def is_available(self, feature):
return self.features.get(feature, False)
# Usage in agent
if feature_manager.is_available("web_search"):
result = web_search(query)
else:
result = "Web search is temporarily unavailable. Using cached knowledge."
User Communication
Always communicate degraded states clearly:
def format_degraded_response(response, degradation_info):
"""Add transparency about limitations"""
warnings = []
if degradation_info.get("using_fallback"):
warnings.append("Using backup service")
if degradation_info.get("cached"):
warnings.append(f"Based on cached data from {degradation_info['cache_date']}")
if degradation_info.get("limited_tools"):
warnings.append("Some features temporarily unavailable")
if warnings:
disclaimer = "\n---\n⚠️ " + " | ".join(warnings)
return response + disclaimer
return response
Circuit Breaker Pattern
Prevent cascading failures:
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.failures = 0
self.threshold = failure_threshold
self.reset_timeout = reset_timeout
self.state = "closed" # closed, open, half-open
self.last_failure = None
def call(self, func, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure > self.reset_timeout:
self.state = "half-open"
else:
raise CircuitOpenError("Service unavailable")
try:
result = func(*args, **kwargs)
if self.state == "half-open":
self.state = "closed"
self.failures = 0
return result
except Exception as e:
self.failures += 1
self.last_failure = time.time()
if self.failures >= self.threshold:
self.state = "open"
raise
Best Practices
| Do | Don't |
|---|---|
| Communicate limitations clearly | Hide failures from users |
| Log all degradation events | Silently switch to fallbacks |
| Set reasonable timeouts | Wait indefinitely |
| Test fallback paths regularly | Assume backups work |
Next: Learn validation techniques to prevent bad outputs from reaching users. :::