Kubernetes for ML Workloads

Kubernetes expertise separates MLOps Engineers from general DevOps. Expect deep questions on ML-specific K8s patterns.

Core Interview Question: ML Deployment

Question: "Design a Kubernetes deployment for a model serving 10,000 requests per second."

Answer Framework:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving
spec:
  replicas: 10  # Start with horizontal scaling
  selector:
    matchLabels:
      app: model-serving
  template:
    metadata:
      labels:
        app: model-serving
    spec:
      containers:
      - name: model-server
        image: model-serving:v1.2.3
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
            nvidia.com/gpu: "1"  # GPU request
          limits:
            memory: "8Gi"
            cpu: "4"
            nvidia.com/gpu: "1"
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30  # Model loading time
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 10
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: model-serving
              topologyKey: "kubernetes.io/hostname"

Key Concepts Interviewers Test

Concept	Why It Matters for ML	Common Question
Requests vs Limits	GPU memory is not overcommittable	"What happens if you don't set limits?"
Tolerations	GPU nodes often tainted	"How do you schedule on GPU nodes?"
Probes	Models take time to load	"Why is initialDelaySeconds important?"
Anti-affinity	Spread for availability	"How do you handle node failures?"

Horizontal Pod Autoscaler for ML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-serving-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-serving
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: inference_latency_p99
      target:
        type: AverageValue
        averageValue: "100m"  # 100ms target
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 300  # Conservative scale-down
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60

Interview Follow-up: "Why is scale-down stabilization longer than scale-up?"

Answer: "Model warm-up is expensive. Scaling down too fast causes latency spikes when traffic returns. We use 5-minute stabilization to avoid thrashing."

GPU Scheduling Deep Dive

# Node labels for GPU types
kubectl label nodes gpu-node-1 gpu-type=a100
kubectl label nodes gpu-node-2 gpu-type=t4

# Pod with specific GPU selection
spec:
  nodeSelector:
    gpu-type: a100
  containers:
  - name: training
    resources:
      limits:
        nvidia.com/gpu: 4  # Request 4 A100s

Interview Question: "How would you handle multi-GPU training jobs?"

Answer: "Use Kubernetes Jobs with multiple replicas, each requesting 1+ GPUs. For distributed training, use Kubeflow's Training Operator which manages PyTorchJob or TFJob resources with proper pod-to-pod networking."

Pro Tip: Know the difference between nvidia.com/gpu and other GPU resource types (amd.com/gpu, intel.com/gpu) for cloud-agnostic discussions.

Next, we'll explore model serving infrastructure. :::

Core Interview Question: ML Deployment

Key Concepts Interviewers Test

Horizontal Pod Autoscaler for ML

GPU Scheduling Deep Dive

Quiz

Stay on the Nerd Track