Infrastructure & Deployment
Kubernetes for ML Workloads
Kubernetes expertise separates MLOps Engineers from general DevOps. Expect deep questions on ML-specific K8s patterns.
Core Interview Question: ML Deployment
Question: "Design a Kubernetes deployment for a model serving 10,000 requests per second."
Answer Framework:
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-serving
spec:
replicas: 10 # Start with horizontal scaling
selector:
matchLabels:
app: model-serving
template:
metadata:
labels:
app: model-serving
spec:
containers:
- name: model-server
image: model-serving:v1.2.3
resources:
requests:
memory: "4Gi"
cpu: "2"
nvidia.com/gpu: "1" # GPU request
limits:
memory: "8Gi"
cpu: "4"
nvidia.com/gpu: "1"
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # Model loading time
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: model-serving
topologyKey: "kubernetes.io/hostname"
Key Concepts Interviewers Test
| Concept | Why It Matters for ML | Common Question |
|---|---|---|
| Requests vs Limits | GPU memory is not overcommittable | "What happens if you don't set limits?" |
| Tolerations | GPU nodes often tainted | "How do you schedule on GPU nodes?" |
| Probes | Models take time to load | "Why is initialDelaySeconds important?" |
| Anti-affinity | Spread for availability | "How do you handle node failures?" |
Horizontal Pod Autoscaler for ML
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-serving-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-serving
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: inference_latency_p99
target:
type: AverageValue
averageValue: "100m" # 100ms target
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300 # Conservative scale-down
policies:
- type: Percent
value: 10
periodSeconds: 60
Interview Follow-up: "Why is scale-down stabilization longer than scale-up?"
Answer: "Model warm-up is expensive. Scaling down too fast causes latency spikes when traffic returns. We use 5-minute stabilization to avoid thrashing."
GPU Scheduling Deep Dive
# Node labels for GPU types
kubectl label nodes gpu-node-1 gpu-type=a100
kubectl label nodes gpu-node-2 gpu-type=t4
# Pod with specific GPU selection
spec:
nodeSelector:
gpu-type: a100
containers:
- name: training
resources:
limits:
nvidia.com/gpu: 4 # Request 4 A100s
Interview Question: "How would you handle multi-GPU training jobs?"
Answer: "Use Kubernetes Jobs with multiple replicas, each requesting 1+ GPUs. For distributed training, use Kubeflow's Training Operator which manages PyTorchJob or TFJob resources with proper pod-to-pod networking."
Pro Tip: Know the difference between nvidia.com/gpu and other GPU resource types (amd.com/gpu, intel.com/gpu) for cloud-agnostic discussions.
Next, we'll explore model serving infrastructure. :::