Evaluation Frameworks

"It works on my laptop" doesn't cut it for production agents. You need systematic evaluation before deployment and continuous monitoring after.

The Evaluation Challenge

Agents are non-deterministic. The same input can produce different outputs, different tool sequences, and different outcomes. Traditional unit tests break down.

# This test is useless for agents
def test_agent():
    result = agent.run("Summarize this document")
    assert result == "Expected summary"  # Will fail randomly

Evaluation Dimensions

Dimension	What It Measures	How to Test
Task Success	Did it complete the goal?	Outcome-based assertions
Correctness	Is the output accurate?	LLM-as-judge, ground truth
Efficiency	How many steps/tokens?	Cost and latency metrics
Safety	Any harmful outputs?	Red-team testing, filters
Reliability	Consistent across runs?	Statistical sampling

Outcome-Based Testing

Test the result, not the path:

class AgentEvaluator:
    def __init__(self, agent, judge_model: str = "claude-sonnet-4-20250514"):
        self.agent = agent
        self.judge_model = judge_model

    async def evaluate_task(self, task: str, success_criteria: list[str]) -> dict:
        """Run agent and evaluate against criteria."""

        # Execute the agent
        start = time.time()
        result = await self.agent.run(task)
        duration = time.time() - start

        # Use LLM to judge success
        judgments = []
        for criterion in success_criteria:
            judgment = await self.judge_criterion(result, criterion)
            judgments.append(judgment)

        return {
            "task": task,
            "result": result,
            "duration_seconds": duration,
            "token_count": result.usage.total_tokens,
            "criteria_results": judgments,
            "overall_success": all(j["passed"] for j in judgments)
        }

    async def judge_criterion(self, result: str, criterion: str) -> dict:
        """Use a judge model to evaluate a single criterion."""
        response = await llm.chat(
            model=self.judge_model,
            messages=[{
                "role": "user",
                "content": f"""Evaluate if this agent output meets the criterion.

Output: {result}

Criterion: {criterion}

Respond with JSON: {{"passed": true/false, "reason": "explanation"}}"""
            }]
        )
        return json.loads(response.content)

Statistical Evaluation

Run multiple times to measure reliability:

async def evaluate_reliability(agent, task: str, num_runs: int = 10) -> dict:
    """Run agent multiple times and measure consistency."""
    results = []

    for i in range(num_runs):
        result = await agent.run(task)
        results.append({
            "run": i,
            "output": result.content,
            "tokens": result.usage.total_tokens,
            "tool_calls": len(result.tool_calls)
        })

    # Analyze consistency
    outputs = [r["output"] for r in results]
    unique_outputs = len(set(outputs))

    return {
        "total_runs": num_runs,
        "unique_outputs": unique_outputs,
        "consistency_score": 1 - (unique_outputs - 1) / num_runs,
        "avg_tokens": sum(r["tokens"] for r in results) / num_runs,
        "token_variance": statistics.variance([r["tokens"] for r in results]),
        "results": results
    }

Benchmark Suites

Create reusable test suites:

# benchmarks/code_agent.yaml
name: "Code Agent Benchmark"
version: "1.0"
tasks:
  - id: "simple_function"
    prompt: "Write a Python function to check if a number is prime"
    criteria:
      - "Function is syntactically valid Python"
      - "Function returns True for 2, 3, 5, 7, 11"
      - "Function returns False for 0, 1, 4, 6, 9"
    max_tokens: 500
    timeout_seconds: 30

  - id: "file_modification"
    prompt: "Add error handling to the process_data function in utils.py"
    criteria:
      - "File was modified successfully"
      - "Try-except block was added"
      - "Original functionality preserved"
    setup: "cp fixtures/utils_original.py utils.py"
    teardown: "rm utils.py"

Continuous Monitoring

Production agents need ongoing evaluation:

class ProductionMonitor:
    def __init__(self, alert_threshold: float = 0.8):
        self.alert_threshold = alert_threshold
        self.metrics = []

    async def log_interaction(self, task: str, result: dict, user_feedback: int = None):
        """Log every production interaction."""
        metric = {
            "timestamp": datetime.now().isoformat(),
            "task_hash": hashlib.md5(task.encode()).hexdigest(),
            "success": result.get("success"),
            "tokens": result.get("tokens"),
            "latency_ms": result.get("latency_ms"),
            "user_feedback": user_feedback,  # 1-5 rating if available
            "error": result.get("error")
        }
        self.metrics.append(metric)
        await self.check_alerts()

    async def check_alerts(self):
        """Alert if metrics drop below threshold."""
        recent = self.metrics[-100:]  # Last 100 interactions
        if len(recent) < 10:
            return

        success_rate = sum(1 for m in recent if m["success"]) / len(recent)
        if success_rate < self.alert_threshold:
            await self.send_alert(f"Success rate dropped to {success_rate:.1%}")

Nerd Note: OpenAI's evals and Anthropic's model card testing both use LLM-as-judge for subjective quality. It's not perfect, but it scales.

Next: Securing your agents against misuse. :::