Monitoring & Alerting for ML Systems

ML production systems require specialized monitoring beyond traditional application metrics. This includes model performance, data drift detection, GPU utilization, and inference latency tracking.

ML Monitoring Stack

┌─────────────────────────────────────────────────────────────────────┐
│                    ML Monitoring Architecture                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                   Alerting & Notification                    │    │
│  │  [Alertmanager] → [PagerDuty] [Slack] [Email]               │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              ↑                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                   Visualization (Grafana)                    │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │    │
│  │  │ GPU Dash │  │ Inference│  │  Model   │  │  Cost    │    │    │
│  │  │          │  │ Latency  │  │ Accuracy │  │ Analysis │    │    │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              ↑                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                   Metrics Storage                            │    │
│  │  [Prometheus] [Thanos/Cortex for long-term]                 │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              ↑                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                   Exporters & Collectors                     │    │
│  │  [DCGM] [kube-state] [node-exporter] [custom ML metrics]   │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

GPU Monitoring with DCGM

# DCGM Exporter DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
        ports:
        - containerPort: 9400
          name: metrics
        env:
        - name: DCGM_EXPORTER_LISTEN
          value: ":9400"
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        securityContext:
          privileged: true
        volumeMounts:
        - name: pod-gpu-resources
          mountPath: /var/lib/kubelet/pod-resources
      volumes:
      - name: pod-gpu-resources
        hostPath:
          path: /var/lib/kubelet/pod-resources
---
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  labels:
    app: dcgm-exporter
spec:
  selector:
    app: dcgm-exporter
  ports:
  - port: 9400
    targetPort: 9400

ML-Specific Alert Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ml-production-alerts
  namespace: monitoring
spec:
  groups:
  - name: inference-sla
    rules:
    - alert: InferenceLatencyHigh
      expr: |
        histogram_quantile(0.99,
          sum(rate(inference_latency_seconds_bucket[5m])) by (le, model)
        ) > 2
      for: 5m
      labels:
        severity: warning
        team: ml-platform
      annotations:
        summary: "P99 inference latency exceeds 2s for {{ $labels.model }}"
        runbook_url: "https://runbooks.example.com/inference-latency"

    - alert: InferenceErrorRateHigh
      expr: |
        sum(rate(inference_requests_total{status="error"}[5m])) by (model) /
        sum(rate(inference_requests_total[5m])) by (model) > 0.01
      for: 5m
      labels:
        severity: critical
        team: ml-platform
      annotations:
        summary: "Error rate > 1% for model {{ $labels.model }}"

  - name: gpu-health
    rules:
    - alert: GPUMemoryExhausted
      expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE > 0.95
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "GPU {{ $labels.gpu }} memory usage > 95%"

    - alert: GPUTemperatureHigh
      expr: DCGM_FI_DEV_GPU_TEMP > 85
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "GPU {{ $labels.gpu }} temperature {{ $value }}°C"

    - alert: GPUXIDErrors
      expr: increase(DCGM_FI_DEV_XID_ERRORS[1h]) > 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "GPU XID error detected on {{ $labels.gpu }}"

  - name: model-quality
    rules:
    - alert: ModelAccuracyDrift
      expr: |
        (model_accuracy - model_accuracy offset 1d) / model_accuracy offset 1d < -0.05
      for: 1h
      labels:
        severity: warning
        team: ml-engineering
      annotations:
        summary: "Model accuracy dropped >5% compared to yesterday"

    - alert: PredictionDistributionShift
      expr: |
        abs(
          avg_over_time(prediction_mean[1h]) -
          avg_over_time(prediction_mean[1h] offset 7d)
        ) / stddev_over_time(prediction_mean[7d]) > 3
      for: 30m
      labels:
        severity: warning
      annotations:
        summary: "Prediction distribution shifted significantly"

Custom ML Metrics Exporter

# Python metrics exporter for ML services
from prometheus_client import start_http_server, Gauge, Histogram, Counter, Summary
import time

# Define ML-specific metrics
MODEL_ACCURACY = Gauge(
    'model_accuracy',
    'Current model accuracy on validation set',
    ['model_name', 'model_version']
)

PREDICTION_LATENCY = Histogram(
    'prediction_latency_seconds',
    'Time spent processing prediction',
    ['model_name', 'batch_size'],
    buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

FEATURE_DRIFT = Gauge(
    'feature_drift_score',
    'Feature drift score (PSI/KL divergence)',
    ['model_name', 'feature_name']
)

PREDICTION_DISTRIBUTION = Summary(
    'prediction_values',
    'Distribution of prediction values',
    ['model_name', 'class_label']
)

CACHE_HIT_RATIO = Gauge(
    'model_cache_hit_ratio',
    'Ratio of predictions served from cache',
    ['model_name']
)

# Export metrics on port 8000
if __name__ == '__main__':
    start_http_server(8000)
    while True:
        # Update metrics from your ML system
        MODEL_ACCURACY.labels(model_name='fraud_detector', model_version='v2').set(0.956)
        time.sleep(30)

Alertmanager Configuration

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/xxx'

    route:
      group_by: ['alertname', 'severity', 'model']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: 'default'
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty-ml'
        continue: true
      - match:
          team: ml-platform
        receiver: 'slack-ml-platform'
      - match:
          team: ml-engineering
        receiver: 'slack-ml-engineering'

    receivers:
    - name: 'default'
      slack_configs:
      - channel: '#ml-alerts'
        send_resolved: true

    - name: 'pagerduty-ml'
      pagerduty_configs:
      - service_key: '<pagerduty-key>'
        severity: '{{ .CommonLabels.severity }}'
        description: '{{ .CommonAnnotations.summary }}'

    - name: 'slack-ml-platform'
      slack_configs:
      - channel: '#ml-platform-alerts'
        title: '{{ .CommonLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

    - name: 'slack-ml-engineering'
      slack_configs:
      - channel: '#ml-engineering-alerts'

Next lesson: Cost optimization for GPU workloads. :::

ML Monitoring Stack

GPU Monitoring with DCGM

ML-Specific Alert Rules

Custom ML Metrics Exporter

Alertmanager Configuration

Quiz

Stay on the Nerd Track