Infrastructure Monitoring for ML

ML infrastructure monitoring goes beyond standard DevOps. Interviewers expect you to monitor GPU utilization, memory, and ML-specific metrics.

ML Infrastructure Metrics

Category	Metrics	Alert Threshold
GPU	Utilization, memory, temperature	<30% util warning
Inference	Latency p50/p99, throughput, errors	p99 > SLA
Pipeline	Success rate, duration, data freshness	<95% success
Model	Prediction distribution, confidence	Distribution shift

Prometheus + Grafana Setup

Interview Question: "How would you set up monitoring for a model serving cluster?"

# prometheus.yml - Scrape configuration
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'model-serving'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: model-serving
        action: keep
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        target_label: __metrics_path__
        regex: (.+)

  - job_name: 'nvidia-gpu'
    static_configs:
      - targets: ['dcgm-exporter:9400']

Custom ML Metrics:

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Inference metrics
INFERENCE_LATENCY = Histogram(
    'model_inference_latency_seconds',
    'Model inference latency',
    ['model_name', 'model_version'],
    buckets=[.01, .025, .05, .1, .25, .5, 1.0, 2.5]
)

INFERENCE_REQUESTS = Counter(
    'model_inference_requests_total',
    'Total inference requests',
    ['model_name', 'model_version', 'status']
)

PREDICTION_CONFIDENCE = Histogram(
    'model_prediction_confidence',
    'Model prediction confidence scores',
    ['model_name'],
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99]
)

# Expose metrics endpoint
def instrument_inference(func):
    def wrapper(model_name, model_version, features):
        with INFERENCE_LATENCY.labels(model_name, model_version).time():
            try:
                result = func(model_name, model_version, features)
                INFERENCE_REQUESTS.labels(model_name, model_version, 'success').inc()
                PREDICTION_CONFIDENCE.labels(model_name).observe(result['confidence'])
                return result
            except Exception as e:
                INFERENCE_REQUESTS.labels(model_name, model_version, 'error').inc()
                raise
    return wrapper

GPU Monitoring with DCGM

# DCGM Exporter deployment
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    spec:
      containers:
      - name: dcgm-exporter
        image: nvidia/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04
        ports:
        - containerPort: 9400
          name: metrics
        securityContext:
          privileged: true
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

Key GPU Metrics:

# DCGM metrics to monitor
critical_gpu_metrics:
  DCGM_FI_DEV_GPU_UTIL: "GPU utilization percentage"
  DCGM_FI_DEV_MEM_COPY_UTIL: "Memory copy utilization"
  DCGM_FI_DEV_FB_FREE: "Framebuffer memory free"
  DCGM_FI_DEV_FB_USED: "Framebuffer memory used"
  DCGM_FI_DEV_GPU_TEMP: "GPU temperature"
  DCGM_FI_DEV_POWER_USAGE: "Power usage"

Alerting Rules

# prometheus-rules.yml
groups:
  - name: ml-infrastructure
    rules:
      - alert: HighInferenceLatency
        expr: |
          histogram_quantile(0.99,
            rate(model_inference_latency_seconds_bucket[5m])
          ) > 0.5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High p99 latency for {{ $labels.model_name }}"
          description: "p99 latency is {{ $value }}s, SLA is 500ms"

      - alert: LowGPUUtilization
        expr: DCGM_FI_DEV_GPU_UTIL < 30
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Low GPU utilization on {{ $labels.gpu }}"
          description: "GPU utilization is {{ $value }}%. Consider consolidation."

      - alert: HighPredictionErrorRate
        expr: |
          rate(model_inference_requests_total{status="error"}[5m])
          / rate(model_inference_requests_total[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate for {{ $labels.model_name }}"

Interview Follow-up: Dashboard Design

Question: "What would your ideal ML monitoring dashboard show?"

ml_dashboard_panels:
  row_1_overview:
    - "Total requests/sec (all models)"
    - "Overall error rate"
    - "Active model versions"
    - "GPU cluster utilization"

  row_2_latency:
    - "p50/p95/p99 latency by model"
    - "Latency heatmap over time"
    - "SLA compliance percentage"

  row_3_model_health:
    - "Prediction confidence distribution"
    - "Prediction class distribution"
    - "Feature drift indicators"

  row_4_infrastructure:
    - "GPU memory by node"
    - "Pod restart count"
    - "Queue depth (if batching)"

Expert Insight: "We correlate low GPU utilization with high latency to detect batching misconfigurations - if GPUs are idle but latency is high, the batch size is too small."

Next, we'll cover logging and debugging production ML systems. :::

ML Infrastructure Metrics

Prometheus + Grafana Setup

GPU Monitoring with DCGM

Alerting Rules

Interview Follow-up: Dashboard Design

Quiz

Stay on the Nerd Track