Production & Enterprise
Evaluation Frameworks
4 min read
"It works on my laptop" doesn't cut it for production agents. You need systematic evaluation before deployment and continuous monitoring after.
The Evaluation Challenge
Agents are non-deterministic. The same input can produce different outputs, different tool sequences, and different outcomes. Traditional unit tests break down.
# This test is useless for agents
def test_agent():
result = agent.run("Summarize this document")
assert result == "Expected summary" # Will fail randomly
Evaluation Dimensions
| Dimension | What It Measures | How to Test |
|---|---|---|
| Task Success | Did it complete the goal? | Outcome-based assertions |
| Correctness | Is the output accurate? | LLM-as-judge, ground truth |
| Efficiency | How many steps/tokens? | Cost and latency metrics |
| Safety | Any harmful outputs? | Red-team testing, filters |
| Reliability | Consistent across runs? | Statistical sampling |
Outcome-Based Testing
Test the result, not the path:
class AgentEvaluator:
def __init__(self, agent, judge_model: str = "claude-sonnet-4-20250514"):
self.agent = agent
self.judge_model = judge_model
async def evaluate_task(self, task: str, success_criteria: list[str]) -> dict:
"""Run agent and evaluate against criteria."""
# Execute the agent
start = time.time()
result = await self.agent.run(task)
duration = time.time() - start
# Use LLM to judge success
judgments = []
for criterion in success_criteria:
judgment = await self.judge_criterion(result, criterion)
judgments.append(judgment)
return {
"task": task,
"result": result,
"duration_seconds": duration,
"token_count": result.usage.total_tokens,
"criteria_results": judgments,
"overall_success": all(j["passed"] for j in judgments)
}
async def judge_criterion(self, result: str, criterion: str) -> dict:
"""Use a judge model to evaluate a single criterion."""
response = await llm.chat(
model=self.judge_model,
messages=[{
"role": "user",
"content": f"""Evaluate if this agent output meets the criterion.
Output: {result}
Criterion: {criterion}
Respond with JSON: {{"passed": true/false, "reason": "explanation"}}"""
}]
)
return json.loads(response.content)
Statistical Evaluation
Run multiple times to measure reliability:
async def evaluate_reliability(agent, task: str, num_runs: int = 10) -> dict:
"""Run agent multiple times and measure consistency."""
results = []
for i in range(num_runs):
result = await agent.run(task)
results.append({
"run": i,
"output": result.content,
"tokens": result.usage.total_tokens,
"tool_calls": len(result.tool_calls)
})
# Analyze consistency
outputs = [r["output"] for r in results]
unique_outputs = len(set(outputs))
return {
"total_runs": num_runs,
"unique_outputs": unique_outputs,
"consistency_score": 1 - (unique_outputs - 1) / num_runs,
"avg_tokens": sum(r["tokens"] for r in results) / num_runs,
"token_variance": statistics.variance([r["tokens"] for r in results]),
"results": results
}
Benchmark Suites
Create reusable test suites:
# benchmarks/code_agent.yaml
name: "Code Agent Benchmark"
version: "1.0"
tasks:
- id: "simple_function"
prompt: "Write a Python function to check if a number is prime"
criteria:
- "Function is syntactically valid Python"
- "Function returns True for 2, 3, 5, 7, 11"
- "Function returns False for 0, 1, 4, 6, 9"
max_tokens: 500
timeout_seconds: 30
- id: "file_modification"
prompt: "Add error handling to the process_data function in utils.py"
criteria:
- "File was modified successfully"
- "Try-except block was added"
- "Original functionality preserved"
setup: "cp fixtures/utils_original.py utils.py"
teardown: "rm utils.py"
Continuous Monitoring
Production agents need ongoing evaluation:
class ProductionMonitor:
def __init__(self, alert_threshold: float = 0.8):
self.alert_threshold = alert_threshold
self.metrics = []
async def log_interaction(self, task: str, result: dict, user_feedback: int = None):
"""Log every production interaction."""
metric = {
"timestamp": datetime.now().isoformat(),
"task_hash": hashlib.md5(task.encode()).hexdigest(),
"success": result.get("success"),
"tokens": result.get("tokens"),
"latency_ms": result.get("latency_ms"),
"user_feedback": user_feedback, # 1-5 rating if available
"error": result.get("error")
}
self.metrics.append(metric)
await self.check_alerts()
async def check_alerts(self):
"""Alert if metrics drop below threshold."""
recent = self.metrics[-100:] # Last 100 interactions
if len(recent) < 10:
return
success_rate = sum(1 for m in recent if m["success"]) / len(recent)
if success_rate < self.alert_threshold:
await self.send_alert(f"Success rate dropped to {success_rate:.1%}")
Nerd Note: OpenAI's evals and Anthropic's model card testing both use LLM-as-judge for subjective quality. It's not perfect, but it scales.
Next: Securing your agents against misuse. :::