Lesson 14 of 20

Error Handling & Recovery

Graceful Degradation

3 min read

When parts of your agent fail, the system should degrade gracefully—maintaining partial functionality rather than failing completely.

Degradation Strategies

Strategy When to Use Example
Fallback responses API failures Return cached or default response
Reduced functionality Tool unavailable Skip optional features
Alternative paths Primary method fails Try backup approach
Human handoff Critical failures Escalate to human support

Implementing Fallbacks

class ResilientAgent:
    def __init__(self, primary_llm, fallback_llm):
        self.primary = primary_llm
        self.fallback = fallback_llm
        self.cache = ResponseCache()

    async def generate(self, prompt, context):
        # Try primary LLM
        try:
            return await self.primary.generate(prompt)
        except (RateLimitError, ServiceError):
            pass

        # Try fallback LLM
        try:
            return await self.fallback.generate(prompt)
        except Exception:
            pass

        # Try cache
        cached = self.cache.get_similar(prompt)
        if cached:
            return f"[Cached response] {cached}"

        # Last resort: honest failure message
        return "I'm experiencing technical difficulties. Please try again shortly."

Tool Fallback Chains

class ToolWithFallbacks:
    def __init__(self, tools_by_priority):
        self.tools = tools_by_priority  # [primary, fallback1, fallback2]

    def execute(self, query):
        errors = []

        for tool in self.tools:
            try:
                result = tool.run(query)
                return {"success": True, "result": result, "tool": tool.name}
            except Exception as e:
                errors.append(f"{tool.name}: {e}")
                continue

        return {
            "success": False,
            "error": "All tools failed",
            "details": errors
        }

# Example: Search with fallbacks
search_chain = ToolWithFallbacks([
    GoogleSearch(),      # Primary
    BingSearch(),        # Fallback 1
    DuckDuckGoSearch()   # Fallback 2
])

Feature Flags for Degradation

class FeatureManager:
    def __init__(self):
        self.features = {
            "web_search": True,
            "code_execution": True,
            "image_generation": True,
            "file_operations": True
        }
        self.health_checks = {}

    def check_health(self):
        """Periodically check and disable unhealthy features"""
        for feature, checker in self.health_checks.items():
            try:
                is_healthy = checker()
                self.features[feature] = is_healthy
            except:
                self.features[feature] = False

    def is_available(self, feature):
        return self.features.get(feature, False)

# Usage in agent
if feature_manager.is_available("web_search"):
    result = web_search(query)
else:
    result = "Web search is temporarily unavailable. Using cached knowledge."

User Communication

Always communicate degraded states clearly:

def format_degraded_response(response, degradation_info):
    """Add transparency about limitations"""

    warnings = []

    if degradation_info.get("using_fallback"):
        warnings.append("Using backup service")

    if degradation_info.get("cached"):
        warnings.append(f"Based on cached data from {degradation_info['cache_date']}")

    if degradation_info.get("limited_tools"):
        warnings.append("Some features temporarily unavailable")

    if warnings:
        disclaimer = "\n---\n⚠️ " + " | ".join(warnings)
        return response + disclaimer

    return response

Circuit Breaker Pattern

Prevent cascading failures:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.failures = 0
        self.threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.state = "closed"  # closed, open, half-open
        self.last_failure = None

    def call(self, func, *args, **kwargs):
        if self.state == "open":
            if time.time() - self.last_failure > self.reset_timeout:
                self.state = "half-open"
            else:
                raise CircuitOpenError("Service unavailable")

        try:
            result = func(*args, **kwargs)
            if self.state == "half-open":
                self.state = "closed"
                self.failures = 0
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure = time.time()
            if self.failures >= self.threshold:
                self.state = "open"
            raise

Best Practices

Do Don't
Communicate limitations clearly Hide failures from users
Log all degradation events Silently switch to fallbacks
Set reasonable timeouts Wait indefinitely
Test fallback paths regularly Assume backups work

Next: Learn validation techniques to prevent bad outputs from reaching users. :::

Quiz

Module 4: Error Handling & Recovery

Take Quiz