Monitoring & Observability
Infrastructure Monitoring for ML
4 min read
ML infrastructure monitoring goes beyond standard DevOps. Interviewers expect you to monitor GPU utilization, memory, and ML-specific metrics.
ML Infrastructure Metrics
| Category | Metrics | Alert Threshold |
|---|---|---|
| GPU | Utilization, memory, temperature | <30% util warning |
| Inference | Latency p50/p99, throughput, errors | p99 > SLA |
| Pipeline | Success rate, duration, data freshness | <95% success |
| Model | Prediction distribution, confidence | Distribution shift |
Prometheus + Grafana Setup
Interview Question: "How would you set up monitoring for a model serving cluster?"
# prometheus.yml - Scrape configuration
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'model-serving'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: model-serving
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
regex: (.+)
- job_name: 'nvidia-gpu'
static_configs:
- targets: ['dcgm-exporter:9400']
Custom ML Metrics:
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Inference metrics
INFERENCE_LATENCY = Histogram(
'model_inference_latency_seconds',
'Model inference latency',
['model_name', 'model_version'],
buckets=[.01, .025, .05, .1, .25, .5, 1.0, 2.5]
)
INFERENCE_REQUESTS = Counter(
'model_inference_requests_total',
'Total inference requests',
['model_name', 'model_version', 'status']
)
PREDICTION_CONFIDENCE = Histogram(
'model_prediction_confidence',
'Model prediction confidence scores',
['model_name'],
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99]
)
# Expose metrics endpoint
def instrument_inference(func):
def wrapper(model_name, model_version, features):
with INFERENCE_LATENCY.labels(model_name, model_version).time():
try:
result = func(model_name, model_version, features)
INFERENCE_REQUESTS.labels(model_name, model_version, 'success').inc()
PREDICTION_CONFIDENCE.labels(model_name).observe(result['confidence'])
return result
except Exception as e:
INFERENCE_REQUESTS.labels(model_name, model_version, 'error').inc()
raise
return wrapper
GPU Monitoring with DCGM
# DCGM Exporter deployment
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
spec:
containers:
- name: dcgm-exporter
image: nvidia/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04
ports:
- containerPort: 9400
name: metrics
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
Key GPU Metrics:
# DCGM metrics to monitor
critical_gpu_metrics:
DCGM_FI_DEV_GPU_UTIL: "GPU utilization percentage"
DCGM_FI_DEV_MEM_COPY_UTIL: "Memory copy utilization"
DCGM_FI_DEV_FB_FREE: "Framebuffer memory free"
DCGM_FI_DEV_FB_USED: "Framebuffer memory used"
DCGM_FI_DEV_GPU_TEMP: "GPU temperature"
DCGM_FI_DEV_POWER_USAGE: "Power usage"
Alerting Rules
# prometheus-rules.yml
groups:
- name: ml-infrastructure
rules:
- alert: HighInferenceLatency
expr: |
histogram_quantile(0.99,
rate(model_inference_latency_seconds_bucket[5m])
) > 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "High p99 latency for {{ $labels.model_name }}"
description: "p99 latency is {{ $value }}s, SLA is 500ms"
- alert: LowGPUUtilization
expr: DCGM_FI_DEV_GPU_UTIL < 30
for: 15m
labels:
severity: warning
annotations:
summary: "Low GPU utilization on {{ $labels.gpu }}"
description: "GPU utilization is {{ $value }}%. Consider consolidation."
- alert: HighPredictionErrorRate
expr: |
rate(model_inference_requests_total{status="error"}[5m])
/ rate(model_inference_requests_total[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.model_name }}"
Interview Follow-up: Dashboard Design
Question: "What would your ideal ML monitoring dashboard show?"
ml_dashboard_panels:
row_1_overview:
- "Total requests/sec (all models)"
- "Overall error rate"
- "Active model versions"
- "GPU cluster utilization"
row_2_latency:
- "p50/p95/p99 latency by model"
- "Latency heatmap over time"
- "SLA compliance percentage"
row_3_model_health:
- "Prediction confidence distribution"
- "Prediction class distribution"
- "Feature drift indicators"
row_4_infrastructure:
- "GPU memory by node"
- "Pod restart count"
- "Queue depth (if batching)"
Expert Insight: "We correlate low GPU utilization with high latency to detect batching misconfigurations - if GPUs are idle but latency is high, the batch size is too small."
Next, we'll cover logging and debugging production ML systems. :::