Production Operations & GitOps
Monitoring & Alerting for ML Systems
3 min read
ML production systems require specialized monitoring beyond traditional application metrics. This includes model performance, data drift detection, GPU utilization, and inference latency tracking.
ML Monitoring Stack
┌─────────────────────────────────────────────────────────────────────┐
│ ML Monitoring Architecture │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Alerting & Notification │ │
│ │ [Alertmanager] → [PagerDuty] [Slack] [Email] │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ↑ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Visualization (Grafana) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ GPU Dash │ │ Inference│ │ Model │ │ Cost │ │ │
│ │ │ │ │ Latency │ │ Accuracy │ │ Analysis │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ↑ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Metrics Storage │ │
│ │ [Prometheus] [Thanos/Cortex for long-term] │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ↑ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Exporters & Collectors │ │
│ │ [DCGM] [kube-state] [node-exporter] [custom ML metrics] │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
GPU Monitoring with DCGM
# DCGM Exporter DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
ports:
- containerPort: 9400
name: metrics
env:
- name: DCGM_EXPORTER_LISTEN
value: ":9400"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
securityContext:
privileged: true
volumeMounts:
- name: pod-gpu-resources
mountPath: /var/lib/kubelet/pod-resources
volumes:
- name: pod-gpu-resources
hostPath:
path: /var/lib/kubelet/pod-resources
---
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
labels:
app: dcgm-exporter
spec:
selector:
app: dcgm-exporter
ports:
- port: 9400
targetPort: 9400
ML-Specific Alert Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ml-production-alerts
namespace: monitoring
spec:
groups:
- name: inference-sla
rules:
- alert: InferenceLatencyHigh
expr: |
histogram_quantile(0.99,
sum(rate(inference_latency_seconds_bucket[5m])) by (le, model)
) > 2
for: 5m
labels:
severity: warning
team: ml-platform
annotations:
summary: "P99 inference latency exceeds 2s for {{ $labels.model }}"
runbook_url: "https://runbooks.example.com/inference-latency"
- alert: InferenceErrorRateHigh
expr: |
sum(rate(inference_requests_total{status="error"}[5m])) by (model) /
sum(rate(inference_requests_total[5m])) by (model) > 0.01
for: 5m
labels:
severity: critical
team: ml-platform
annotations:
summary: "Error rate > 1% for model {{ $labels.model }}"
- name: gpu-health
rules:
- alert: GPUMemoryExhausted
expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE > 0.95
for: 5m
labels:
severity: critical
annotations:
summary: "GPU {{ $labels.gpu }} memory usage > 95%"
- alert: GPUTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 10m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu }} temperature {{ $value }}°C"
- alert: GPUXIDErrors
expr: increase(DCGM_FI_DEV_XID_ERRORS[1h]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "GPU XID error detected on {{ $labels.gpu }}"
- name: model-quality
rules:
- alert: ModelAccuracyDrift
expr: |
(model_accuracy - model_accuracy offset 1d) / model_accuracy offset 1d < -0.05
for: 1h
labels:
severity: warning
team: ml-engineering
annotations:
summary: "Model accuracy dropped >5% compared to yesterday"
- alert: PredictionDistributionShift
expr: |
abs(
avg_over_time(prediction_mean[1h]) -
avg_over_time(prediction_mean[1h] offset 7d)
) / stddev_over_time(prediction_mean[7d]) > 3
for: 30m
labels:
severity: warning
annotations:
summary: "Prediction distribution shifted significantly"
Custom ML Metrics Exporter
# Python metrics exporter for ML services
from prometheus_client import start_http_server, Gauge, Histogram, Counter, Summary
import time
# Define ML-specific metrics
MODEL_ACCURACY = Gauge(
'model_accuracy',
'Current model accuracy on validation set',
['model_name', 'model_version']
)
PREDICTION_LATENCY = Histogram(
'prediction_latency_seconds',
'Time spent processing prediction',
['model_name', 'batch_size'],
buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
FEATURE_DRIFT = Gauge(
'feature_drift_score',
'Feature drift score (PSI/KL divergence)',
['model_name', 'feature_name']
)
PREDICTION_DISTRIBUTION = Summary(
'prediction_values',
'Distribution of prediction values',
['model_name', 'class_label']
)
CACHE_HIT_RATIO = Gauge(
'model_cache_hit_ratio',
'Ratio of predictions served from cache',
['model_name']
)
# Export metrics on port 8000
if __name__ == '__main__':
start_http_server(8000)
while True:
# Update metrics from your ML system
MODEL_ACCURACY.labels(model_name='fraud_detector', model_version='v2').set(0.956)
time.sleep(30)
Alertmanager Configuration
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/xxx'
route:
group_by: ['alertname', 'severity', 'model']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty-ml'
continue: true
- match:
team: ml-platform
receiver: 'slack-ml-platform'
- match:
team: ml-engineering
receiver: 'slack-ml-engineering'
receivers:
- name: 'default'
slack_configs:
- channel: '#ml-alerts'
send_resolved: true
- name: 'pagerduty-ml'
pagerduty_configs:
- service_key: '<pagerduty-key>'
severity: '{{ .CommonLabels.severity }}'
description: '{{ .CommonAnnotations.summary }}'
- name: 'slack-ml-platform'
slack_configs:
- channel: '#ml-platform-alerts'
title: '{{ .CommonLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'slack-ml-engineering'
slack_configs:
- channel: '#ml-engineering-alerts'
Next lesson: Cost optimization for GPU workloads. :::