Production Monitoring & Next Steps
Alerting & SLOs
3 min read
Production LLM systems need proactive monitoring. Set quality thresholds, configure alerts, and define Service Level Objectives (SLOs) to catch issues before users do.
What are SLOs for LLMs?
Service Level Objectives define acceptable quality levels:
| SLO Type | Example | Threshold |
|---|---|---|
| Latency | P95 response time | < 3 seconds |
| Quality | Helpfulness score | > 0.8 |
| Availability | Successful responses | > 99.5% |
| Cost | Cost per query | < $0.05 |
Defining Quality SLOs
Set thresholds for your evaluation metrics:
# Quality SLOs for a support bot
QUALITY_SLOS = {
"accuracy": {
"target": 0.90,
"warning": 0.85,
"critical": 0.75
},
"helpfulness": {
"target": 0.85,
"warning": 0.80,
"critical": 0.70
},
"response_time_p95_ms": {
"target": 2000,
"warning": 3000,
"critical": 5000
}
}
Setting Up Alerts
LangSmith Alerting
LangSmith supports alerting on trace metrics:
# Configure alert in LangSmith UI:
# 1. Navigate to Settings > Alerts
# 2. Create new alert rule
# 3. Set conditions:
alert_config = {
"name": "Quality Drop Alert",
"condition": "avg(helpfulness_score) < 0.8",
"window": "1 hour",
"notification": {
"type": "slack",
"channel": "#llm-alerts"
}
}
MLflow Alerting Pattern
import mlflow
def check_quality_slos(results: dict) -> list:
"""Check if evaluation results meet SLOs."""
violations = []
for metric, thresholds in QUALITY_SLOS.items():
value = results.get(metric)
if value is None:
continue
if value < thresholds["critical"]:
violations.append({
"metric": metric,
"level": "critical",
"value": value,
"threshold": thresholds["critical"]
})
elif value < thresholds["warning"]:
violations.append({
"metric": metric,
"level": "warning",
"value": value,
"threshold": thresholds["warning"]
})
return violations
# After each evaluation
violations = check_quality_slos(eval_results.metrics)
if violations:
send_alert(violations)
W&B Weave Alerting
import weave
@weave.op()
def production_eval_with_alerts():
"""Run evaluation and check SLOs."""
results = await evaluation.evaluate(production_model)
# Check against SLOs
if results.summary["accuracy"] < 0.85:
# Trigger alert
send_slack_alert(
message=f"Quality SLO breach: accuracy = {results.summary['accuracy']}"
)
return results
Alert Channels
Configure multiple notification channels:
| Channel | Use Case |
|---|---|
| Slack | Real-time team notifications |
| Detailed reports and summaries | |
| PagerDuty | Critical on-call alerts |
| Webhooks | Custom integrations |
Alert Fatigue Prevention
Avoid too many alerts:
- Set appropriate thresholds: Not too sensitive
- Use warning before critical: Catch issues early
- Aggregate alerts: Don't alert per-request
- Add context: Include relevant information
- Define escalation paths: Warning → Critical → Page
SLO Dashboard
Track SLO compliance over time:
SLO Dashboard - Last 7 Days
───────────────────────────────────────────
Metric │ Target │ Current │ Status
───────────────────────────────────────────
Accuracy │ 90% │ 92.3% │ ✅
Helpfulness │ 85% │ 87.1% │ ✅
P95 Latency │ 2s │ 1.8s │ ✅
Error Rate │ <1% │ 0.3% │ ✅
Cost/Query │ $0.05 │ $0.042 │ ✅
───────────────────────────────────────────
Overall SLO Compliance: 100%
Best Practices
| Practice | Why |
|---|---|
| Start with few SLOs | Add more as you understand your system |
| Use error budgets | Allow some SLO breaches |
| Review regularly | Adjust thresholds as needed |
| Document runbooks | What to do when alerts fire |
Tip: Start with 3-5 key SLOs. You can always add more, but too many early on leads to alert fatigue.
Next, we'll explore cost tracking and optimization strategies. :::