Model Serving & Inference

Autoscaling & Traffic Management

3 min read

Production ML inference requires intelligent autoscaling based on GPU metrics and sophisticated traffic management for safe deployments. This lesson covers HPA, KEDA, and canary deployment strategies.

Autoscaling Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    ML Inference Autoscaling                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────────┐    ┌──────────────────┐                       │
│  │   Prometheus     │───→│       KEDA       │                       │
│  │   GPU Metrics    │    │   ScaledObject   │                       │
│  └──────────────────┘    └────────┬─────────┘                       │
│                                   │                                  │
│                                   ↓                                  │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                   HPA Controller                             │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │    │
│  │  │ CPU/Memory  │  │ GPU Util    │  │ Custom      │          │    │
│  │  │ Metrics     │  │ Metrics     │  │ Metrics     │          │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘          │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                   │                                  │
│                                   ↓                                  │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              Inference Deployment (1-N replicas)             │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

GPU-Based HPA with DCGM

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triton-gpu-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton-inference
  minReplicas: 2
  maxReplicas: 20
  metrics:
  # GPU utilization from DCGM exporter
  - type: Pods
    pods:
      metric:
        name: DCGM_FI_DEV_GPU_UTIL
      target:
        type: AverageValue
        averageValue: "70"
  # GPU memory utilization
  - type: Pods
    pods:
      metric:
        name: DCGM_FI_DEV_MEM_COPY_UTIL
      target:
        type: AverageValue
        averageValue: "80"
  # Inference requests per second
  - type: Pods
    pods:
      metric:
        name: nv_inference_request_success
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 25
        periodSeconds: 120

KEDA for Advanced Scaling

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-scaler
  namespace: ml-serving
spec:
  scaleTargetRef:
    name: triton-inference
  minReplicaCount: 1
  maxReplicaCount: 50
  pollingInterval: 15
  cooldownPeriod: 300
  triggers:
  # Prometheus GPU metrics
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: gpu_utilization
      threshold: "70"
      query: |
        avg(DCGM_FI_DEV_GPU_UTIL{pod=~"triton-.*"})
  # Request queue length
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: inference_queue_size
      threshold: "100"
      query: |
        sum(nv_inference_pending_request_count{pod=~"triton-.*"})
  # Kafka topic lag (for async inference)
  - type: kafka
    metadata:
      bootstrapServers: kafka:9092
      consumerGroup: inference-consumer
      topic: inference-requests
      lagThreshold: "1000"

KServe Autoscaling

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llm-service
  annotations:
    # Use HPA instead of KPA
    serving.kserve.io/autoscalerClass: hpa
    # Scale on GPU utilization
    serving.kserve.io/metric: gpu
    serving.kserve.io/targetUtilizationPercentage: "70"
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 10
    scaleTarget: 70
    scaleMetric: gpu
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://models/llm"
      resources:
        requests:
          nvidia.com/gpu: 1

Canary Deployments

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: model-v2-canary
spec:
  predictor:
    # Stable version: 90% traffic
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://models/v2"
---
# Argo Rollouts for advanced canary
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: inference-rollout
spec:
  replicas: 10
  selector:
    matchLabels:
      app: inference
  template:
    spec:
      containers:
      - name: inference
        image: inference:v2
        resources:
          limits:
            nvidia.com/gpu: 1
  strategy:
    canary:
      steps:
      - setWeight: 5
      - pause: {duration: 10m}
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 25
      - pause: {duration: 10m}
      - analysis:
          templates:
          - templateName: latency-check
      - setWeight: 50
      - pause: {duration: 15m}
      - setWeight: 100
      canaryService: inference-canary
      stableService: inference-stable
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
  - name: success-rate
    interval: 1m
    successCondition: result[0] >= 0.99
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(inference_success_total{app="inference"}[5m])) /
          sum(rate(inference_requests_total{app="inference"}[5m]))

Traffic Splitting with Istio

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: inference-routing
spec:
  hosts:
  - inference.ml.svc.cluster.local
  http:
  - match:
    - headers:
        x-model-version:
          exact: "v2"
    route:
    - destination:
        host: inference-v2
        port:
          number: 8000
  - route:
    - destination:
        host: inference-v1
        port:
          number: 8000
      weight: 90
    - destination:
        host: inference-v2
        port:
          number: 8000
      weight: 10
---
# A/B testing based on user segments
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ab-testing
spec:
  hosts:
  - inference.ml.svc.cluster.local
  http:
  - match:
    - headers:
        x-user-segment:
          exact: "premium"
    route:
    - destination:
        host: inference-premium
  - route:
    - destination:
        host: inference-standard

Scale-to-Zero with Knative

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: batch-model
  annotations:
    # Enable scale to zero
    serving.kserve.io/enable-scale-to-zero: "true"
    # Minimum scale
    autoscaling.knative.dev/min-scale: "0"
    # Scale down delay
    autoscaling.knative.dev/scale-down-delay: "5m"
    # Target concurrency per pod
    autoscaling.knative.dev/target: "10"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://models/batch"

Next lesson: LLM serving with vLLM and TGI for large language model inference. :::

Quiz

Module 4: Model Serving & Inference

Take Quiz