Production Monitoring & Next Steps

Alerting & SLOs

3 min read

Production LLM systems need proactive monitoring. Set quality thresholds, configure alerts, and define Service Level Objectives (SLOs) to catch issues before users do.

What are SLOs for LLMs?

Service Level Objectives define acceptable quality levels:

SLO Type Example Threshold
Latency P95 response time < 3 seconds
Quality Helpfulness score > 0.8
Availability Successful responses > 99.5%
Cost Cost per query < $0.05

Defining Quality SLOs

Set thresholds for your evaluation metrics:

# Quality SLOs for a support bot
QUALITY_SLOS = {
    "accuracy": {
        "target": 0.90,
        "warning": 0.85,
        "critical": 0.75
    },
    "helpfulness": {
        "target": 0.85,
        "warning": 0.80,
        "critical": 0.70
    },
    "response_time_p95_ms": {
        "target": 2000,
        "warning": 3000,
        "critical": 5000
    }
}

Setting Up Alerts

LangSmith Alerting

LangSmith supports alerting on trace metrics:

# Configure alert in LangSmith UI:
# 1. Navigate to Settings > Alerts
# 2. Create new alert rule
# 3. Set conditions:

alert_config = {
    "name": "Quality Drop Alert",
    "condition": "avg(helpfulness_score) < 0.8",
    "window": "1 hour",
    "notification": {
        "type": "slack",
        "channel": "#llm-alerts"
    }
}

MLflow Alerting Pattern

import mlflow

def check_quality_slos(results: dict) -> list:
    """Check if evaluation results meet SLOs."""
    violations = []

    for metric, thresholds in QUALITY_SLOS.items():
        value = results.get(metric)
        if value is None:
            continue

        if value < thresholds["critical"]:
            violations.append({
                "metric": metric,
                "level": "critical",
                "value": value,
                "threshold": thresholds["critical"]
            })
        elif value < thresholds["warning"]:
            violations.append({
                "metric": metric,
                "level": "warning",
                "value": value,
                "threshold": thresholds["warning"]
            })

    return violations

# After each evaluation
violations = check_quality_slos(eval_results.metrics)
if violations:
    send_alert(violations)

W&B Weave Alerting

import weave

@weave.op()
def production_eval_with_alerts():
    """Run evaluation and check SLOs."""
    results = await evaluation.evaluate(production_model)

    # Check against SLOs
    if results.summary["accuracy"] < 0.85:
        # Trigger alert
        send_slack_alert(
            message=f"Quality SLO breach: accuracy = {results.summary['accuracy']}"
        )

    return results

Alert Channels

Configure multiple notification channels:

Channel Use Case
Slack Real-time team notifications
Email Detailed reports and summaries
PagerDuty Critical on-call alerts
Webhooks Custom integrations

Alert Fatigue Prevention

Avoid too many alerts:

  1. Set appropriate thresholds: Not too sensitive
  2. Use warning before critical: Catch issues early
  3. Aggregate alerts: Don't alert per-request
  4. Add context: Include relevant information
  5. Define escalation paths: Warning → Critical → Page

SLO Dashboard

Track SLO compliance over time:

SLO Dashboard - Last 7 Days
───────────────────────────────────────────
Metric          │ Target │ Current │ Status
───────────────────────────────────────────
Accuracy        │ 90%    │ 92.3%   │ ✅
Helpfulness     │ 85%    │ 87.1%   │ ✅
P95 Latency     │ 2s     │ 1.8s    │ ✅
Error Rate      │ <1%    │ 0.3%    │ ✅
Cost/Query      │ $0.05  │ $0.042  │ ✅
───────────────────────────────────────────
Overall SLO Compliance: 100%

Best Practices

Practice Why
Start with few SLOs Add more as you understand your system
Use error budgets Allow some SLO breaches
Review regularly Adjust thresholds as needed
Document runbooks What to do when alerts fire

Tip: Start with 3-5 key SLOs. You can always add more, but too many early on leads to alert fatigue.

Next, we'll explore cost tracking and optimization strategies. :::

Quiz

Module 6: Production Monitoring & Next Steps

Take Quiz