Infrastructure & Deployment

Kubernetes for ML Workloads

5 min read

Kubernetes expertise separates MLOps Engineers from general DevOps. Expect deep questions on ML-specific K8s patterns.

Core Interview Question: ML Deployment

Question: "Design a Kubernetes deployment for a model serving 10,000 requests per second."

Answer Framework:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving
spec:
  replicas: 10  # Start with horizontal scaling
  selector:
    matchLabels:
      app: model-serving
  template:
    metadata:
      labels:
        app: model-serving
    spec:
      containers:
      - name: model-server
        image: model-serving:v1.2.3
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
            nvidia.com/gpu: "1"  # GPU request
          limits:
            memory: "8Gi"
            cpu: "4"
            nvidia.com/gpu: "1"
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30  # Model loading time
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 10
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: model-serving
              topologyKey: "kubernetes.io/hostname"

Key Concepts Interviewers Test

Concept Why It Matters for ML Common Question
Requests vs Limits GPU memory is not overcommittable "What happens if you don't set limits?"
Tolerations GPU nodes often tainted "How do you schedule on GPU nodes?"
Probes Models take time to load "Why is initialDelaySeconds important?"
Anti-affinity Spread for availability "How do you handle node failures?"

Horizontal Pod Autoscaler for ML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-serving-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-serving
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: inference_latency_p99
      target:
        type: AverageValue
        averageValue: "100m"  # 100ms target
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 300  # Conservative scale-down
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60

Interview Follow-up: "Why is scale-down stabilization longer than scale-up?"

Answer: "Model warm-up is expensive. Scaling down too fast causes latency spikes when traffic returns. We use 5-minute stabilization to avoid thrashing."

GPU Scheduling Deep Dive

# Node labels for GPU types
kubectl label nodes gpu-node-1 gpu-type=a100
kubectl label nodes gpu-node-2 gpu-type=t4

# Pod with specific GPU selection
spec:
  nodeSelector:
    gpu-type: a100
  containers:
  - name: training
    resources:
      limits:
        nvidia.com/gpu: 4  # Request 4 A100s

Interview Question: "How would you handle multi-GPU training jobs?"

Answer: "Use Kubernetes Jobs with multiple replicas, each requesting 1+ GPUs. For distributed training, use Kubeflow's Training Operator which manages PyTorchJob or TFJob resources with proper pod-to-pod networking."

Pro Tip: Know the difference between nvidia.com/gpu and other GPU resource types (amd.com/gpu, intel.com/gpu) for cloud-agnostic discussions.

Next, we'll explore model serving infrastructure. :::

Quiz

Module 2: Infrastructure & Deployment

Take Quiz