Model Serving & Inference
Autoscaling & Traffic Management
3 min read
Production ML inference requires intelligent autoscaling based on GPU metrics and sophisticated traffic management for safe deployments. This lesson covers HPA, KEDA, and canary deployment strategies.
Autoscaling Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ ML Inference Autoscaling │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Prometheus │───→│ KEDA │ │
│ │ GPU Metrics │ │ ScaledObject │ │
│ └──────────────────┘ └────────┬─────────┘ │
│ │ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ HPA Controller │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ CPU/Memory │ │ GPU Util │ │ Custom │ │ │
│ │ │ Metrics │ │ Metrics │ │ Metrics │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Inference Deployment (1-N replicas) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
GPU-Based HPA with DCGM
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: triton-gpu-hpa
namespace: ml-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triton-inference
minReplicas: 2
maxReplicas: 20
metrics:
# GPU utilization from DCGM exporter
- type: Pods
pods:
metric:
name: DCGM_FI_DEV_GPU_UTIL
target:
type: AverageValue
averageValue: "70"
# GPU memory utilization
- type: Pods
pods:
metric:
name: DCGM_FI_DEV_MEM_COPY_UTIL
target:
type: AverageValue
averageValue: "80"
# Inference requests per second
- type: Pods
pods:
metric:
name: nv_inference_request_success
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120
KEDA for Advanced Scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: inference-scaler
namespace: ml-serving
spec:
scaleTargetRef:
name: triton-inference
minReplicaCount: 1
maxReplicaCount: 50
pollingInterval: 15
cooldownPeriod: 300
triggers:
# Prometheus GPU metrics
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: gpu_utilization
threshold: "70"
query: |
avg(DCGM_FI_DEV_GPU_UTIL{pod=~"triton-.*"})
# Request queue length
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: inference_queue_size
threshold: "100"
query: |
sum(nv_inference_pending_request_count{pod=~"triton-.*"})
# Kafka topic lag (for async inference)
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: inference-consumer
topic: inference-requests
lagThreshold: "1000"
KServe Autoscaling
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llm-service
annotations:
# Use HPA instead of KPA
serving.kserve.io/autoscalerClass: hpa
# Scale on GPU utilization
serving.kserve.io/metric: gpu
serving.kserve.io/targetUtilizationPercentage: "70"
spec:
predictor:
minReplicas: 1
maxReplicas: 10
scaleTarget: 70
scaleMetric: gpu
model:
modelFormat:
name: pytorch
storageUri: "s3://models/llm"
resources:
requests:
nvidia.com/gpu: 1
Canary Deployments
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: model-v2-canary
spec:
predictor:
# Stable version: 90% traffic
canaryTrafficPercent: 10
model:
modelFormat:
name: pytorch
storageUri: "s3://models/v2"
---
# Argo Rollouts for advanced canary
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: inference-rollout
spec:
replicas: 10
selector:
matchLabels:
app: inference
template:
spec:
containers:
- name: inference
image: inference:v2
resources:
limits:
nvidia.com/gpu: 1
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 10m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 25
- pause: {duration: 10m}
- analysis:
templates:
- templateName: latency-check
- setWeight: 50
- pause: {duration: 15m}
- setWeight: 100
canaryService: inference-canary
stableService: inference-stable
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.99
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(inference_success_total{app="inference"}[5m])) /
sum(rate(inference_requests_total{app="inference"}[5m]))
Traffic Splitting with Istio
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: inference-routing
spec:
hosts:
- inference.ml.svc.cluster.local
http:
- match:
- headers:
x-model-version:
exact: "v2"
route:
- destination:
host: inference-v2
port:
number: 8000
- route:
- destination:
host: inference-v1
port:
number: 8000
weight: 90
- destination:
host: inference-v2
port:
number: 8000
weight: 10
---
# A/B testing based on user segments
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ab-testing
spec:
hosts:
- inference.ml.svc.cluster.local
http:
- match:
- headers:
x-user-segment:
exact: "premium"
route:
- destination:
host: inference-premium
- route:
- destination:
host: inference-standard
Scale-to-Zero with Knative
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: batch-model
annotations:
# Enable scale to zero
serving.kserve.io/enable-scale-to-zero: "true"
# Minimum scale
autoscaling.knative.dev/min-scale: "0"
# Scale down delay
autoscaling.knative.dev/scale-down-delay: "5m"
# Target concurrency per pod
autoscaling.knative.dev/target: "10"
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: "s3://models/batch"
Next lesson: LLM serving with vLLM and TGI for large language model inference. :::