Kubernetes Foundations for ML
Kubernetes Architecture for ML Engineers
4 min read
Understanding Kubernetes architecture is essential for optimizing ML workloads. This lesson covers the core components through the lens of machine learning requirements.
Control Plane Components
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Control Plane │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐│
│ │ kube-apiserver│ │ etcd │ │ kube-controller-manager││
│ │ │ │ (state) │ │ ││
│ └──────────────┘ └──────────────┘ └────────────────────────┘│
│ ┌──────────────┐ ┌──────────────────────────────────────────┐│
│ │kube-scheduler │ │ cloud-controller-manager (cloud only) ││
│ │ + DRA │ │ ││
│ └──────────────┘ └──────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│ Worker Nodes │
│ ┌────────────────────────────────────────────────────────────┐│
│ │ Node 1 (GPU) │ Node 2 (GPU) │ Node 3 (CPU)││
│ │ ┌────────┐ ┌────────┐ │ ┌────────┐ │ ┌────────┐ ││
│ │ │kubelet │ │kube- │ │ │Training│ │ │Inference│ ││
│ │ │ │ │proxy │ │ │ Pod │ │ │ Pod │ ││
│ │ └────────┘ └────────┘ │ └────────┘ │ └────────┘ ││
│ │ ┌────────────────────┐│ │ ││
│ │ │ NVIDIA Device ││ │ ││
│ │ │ Plugin ││ │ ││
│ │ └────────────────────┘│ │ ││
│ └────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
Scheduler: The ML Workload Brain
The kube-scheduler places pods on nodes. For ML workloads, this involves:
Standard Scheduling:
# Pod requesting GPU resources
apiVersion: v1
kind: Pod
metadata:
name: training-pod
spec:
containers:
- name: trainer
image: pytorch/pytorch:2.1-cuda12.1
resources:
limits:
nvidia.com/gpu: 4 # Request 4 GPUs
memory: "64Gi"
cpu: "16"
requests:
nvidia.com/gpu: 4
memory: "32Gi"
cpu: "8"
Advanced Scheduling with Node Affinity:
# Schedule on specific GPU types
apiVersion: v1
kind: Pod
metadata:
name: llm-training
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-A100-SXM4-80GB
- NVIDIA-H100-SXM5-80GB
containers:
- name: trainer
image: my-llm-trainer:latest
resources:
limits:
nvidia.com/gpu: 8
Worker Node Components
Node Architecture for ML
┌─────────────────────────────────────────────────────────────┐
│ GPU Worker Node │
├─────────────────────────────────────────────────────────────┤
│ kubelet │ Manages pods, reports node status │
│ kube-proxy │ Network rules for service traffic │
│ Container Runtime│ containerd (recommended for ML) │
├─────────────────────────────────────────────────────────────┤
│ Device Plugins │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ NVIDIA Device Plugin ││
│ │ - Advertises GPUs as schedulable resources ││
│ │ - Manages nvidia.com/gpu resource ││
│ │ - Handles device isolation ││
│ └─────────────────────────────────────────────────────────┘│
│ ┌─────────────────────────────────────────────────────────┐│
│ │ NVIDIA GPU Operator (2025 Standard) ││
│ │ - Installs drivers, toolkit, device plugin ││
│ │ - DCGM Exporter for metrics ││
│ │ - GPU Feature Discovery ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Node Labels for ML Scheduling
# Common GPU node labels
kubectl label nodes gpu-node-1 \
nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB \
nvidia.com/gpu.memory=81920 \
node-type=training \
accelerator=nvidia-gpu
# View node resources
kubectl describe node gpu-node-1 | grep -A5 "Allocatable"
# Allocatable:
# cpu: 64
# memory: 512Gi
# nvidia.com/gpu: 8
# ephemeral-storage: 1Ti
Kubernetes Objects for ML
Core Workload Resources
| Resource | ML Use Case | When to Use |
|---|---|---|
| Pod | Single container workload | Direct testing, simple inference |
| Deployment | Stateless inference servers | Model serving, API endpoints |
| StatefulSet | Distributed training | Parameter servers, sharded models |
| Job | Training runs | One-time training, experiments |
| CronJob | Scheduled retraining | Daily model updates |
| DaemonSet | Node-level services | GPU monitoring, log collection |
Training Job Example
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-training
spec:
completions: 1
parallelism: 1
backoffLimit: 3
ttlSecondsAfterFinished: 86400 # Cleanup after 24h
template:
spec:
restartPolicy: OnFailure
containers:
- name: pytorch
image: pytorch/pytorch:2.1-cuda12.1
command: ["python", "train.py"]
args:
- "--epochs=100"
- "--batch-size=64"
resources:
limits:
nvidia.com/gpu: 2
memory: "32Gi"
requests:
nvidia.com/gpu: 2
memory: "16Gi"
volumeMounts:
- name: data
mountPath: /data
- name: checkpoints
mountPath: /checkpoints
volumes:
- name: data
persistentVolumeClaim:
claimName: training-data
- name: checkpoints
persistentVolumeClaim:
claimName: model-checkpoints
Inference Deployment Example
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-inference
spec:
replicas: 3
selector:
matchLabels:
app: inference
template:
metadata:
labels:
app: inference
spec:
containers:
- name: model-server
image: my-model:v1.0
ports:
- containerPort: 8080
resources:
limits:
nvidia.com/gpu: 1
memory: "8Gi"
requests:
nvidia.com/gpu: 1
memory: "4Gi"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: inference-service
spec:
selector:
app: inference
ports:
- port: 80
targetPort: 8080
type: ClusterIP
Storage for ML Workloads
Storage Classes for ML
# High-performance SSD for training data
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: pd.csi.storage.gke.io
parameters:
type: pd-ssd
replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# PVC for training data
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: training-data
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 500Gi
Shared Storage for Distributed Training
# NFS for multi-pod access (distributed training)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-checkpoints
spec:
accessModes:
- ReadWriteMany # Multiple pods can write
storageClassName: nfs-client
resources:
requests:
storage: 100Gi
Next, we'll explore namespaces, resource quotas, and multi-tenancy patterns for ML teams. :::