Kubernetes Architecture for ML Engineers

Understanding Kubernetes architecture is essential for optimizing ML workloads. This lesson covers the core components through the lens of machine learning requirements.

Control Plane Components

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Control Plane                             │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────────┐│
│  │ kube-apiserver│  │    etcd      │  │ kube-controller-manager││
│  │              │  │  (state)     │  │                        ││
│  └──────────────┘  └──────────────┘  └────────────────────────┘│
│  ┌──────────────┐  ┌──────────────────────────────────────────┐│
│  │kube-scheduler │  │ cloud-controller-manager (cloud only)   ││
│  │  + DRA       │  │                                          ││
│  └──────────────┘  └──────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│                        Worker Nodes                              │
│  ┌────────────────────────────────────────────────────────────┐│
│  │ Node 1 (GPU)           │ Node 2 (GPU)        │ Node 3 (CPU)││
│  │ ┌────────┐ ┌────────┐ │ ┌────────┐          │ ┌────────┐  ││
│  │ │kubelet │ │kube-   │ │ │Training│          │ │Inference│  ││
│  │ │        │ │proxy   │ │ │ Pod    │          │ │ Pod     │  ││
│  │ └────────┘ └────────┘ │ └────────┘          │ └────────┘  ││
│  │ ┌────────────────────┐│                     │              ││
│  │ │ NVIDIA Device      ││                     │              ││
│  │ │ Plugin             ││                     │              ││
│  │ └────────────────────┘│                     │              ││
│  └────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Scheduler: The ML Workload Brain

The kube-scheduler places pods on nodes. For ML workloads, this involves:

Standard Scheduling:

# Pod requesting GPU resources
apiVersion: v1
kind: Pod
metadata:
  name: training-pod
spec:
  containers:
  - name: trainer
    image: pytorch/pytorch:2.1-cuda12.1
    resources:
      limits:
        nvidia.com/gpu: 4  # Request 4 GPUs
        memory: "64Gi"
        cpu: "16"
      requests:
        nvidia.com/gpu: 4
        memory: "32Gi"
        cpu: "8"

Advanced Scheduling with Node Affinity:

# Schedule on specific GPU types
apiVersion: v1
kind: Pod
metadata:
  name: llm-training
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.product
            operator: In
            values:
            - NVIDIA-A100-SXM4-80GB
            - NVIDIA-H100-SXM5-80GB
  containers:
  - name: trainer
    image: my-llm-trainer:latest
    resources:
      limits:
        nvidia.com/gpu: 8

Worker Node Components

Node Architecture for ML

┌─────────────────────────────────────────────────────────────┐
│                    GPU Worker Node                           │
├─────────────────────────────────────────────────────────────┤
│  kubelet          │ Manages pods, reports node status       │
│  kube-proxy       │ Network rules for service traffic       │
│  Container Runtime│ containerd (recommended for ML)         │
├─────────────────────────────────────────────────────────────┤
│                  Device Plugins                              │
│  ┌─────────────────────────────────────────────────────────┐│
│  │ NVIDIA Device Plugin                                    ││
│  │ - Advertises GPUs as schedulable resources              ││
│  │ - Manages nvidia.com/gpu resource                       ││
│  │ - Handles device isolation                              ││
│  └─────────────────────────────────────────────────────────┘│
│  ┌─────────────────────────────────────────────────────────┐│
│  │ NVIDIA GPU Operator (2025 Standard)                     ││
│  │ - Installs drivers, toolkit, device plugin              ││
│  │ - DCGM Exporter for metrics                             ││
│  │ - GPU Feature Discovery                                 ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Node Labels for ML Scheduling

# Common GPU node labels
kubectl label nodes gpu-node-1 \
  nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB \
  nvidia.com/gpu.memory=81920 \
  node-type=training \
  accelerator=nvidia-gpu

# View node resources
kubectl describe node gpu-node-1 | grep -A5 "Allocatable"
# Allocatable:
#   cpu:                64
#   memory:             512Gi
#   nvidia.com/gpu:     8
#   ephemeral-storage:  1Ti

Kubernetes Objects for ML

Core Workload Resources

Resource	ML Use Case	When to Use
Pod	Single container workload	Direct testing, simple inference
Deployment	Stateless inference servers	Model serving, API endpoints
StatefulSet	Distributed training	Parameter servers, sharded models
Job	Training runs	One-time training, experiments
CronJob	Scheduled retraining	Daily model updates
DaemonSet	Node-level services	GPU monitoring, log collection

Training Job Example

apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-training
spec:
  completions: 1
  parallelism: 1
  backoffLimit: 3
  ttlSecondsAfterFinished: 86400  # Cleanup after 24h
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: pytorch
        image: pytorch/pytorch:2.1-cuda12.1
        command: ["python", "train.py"]
        args:
        - "--epochs=100"
        - "--batch-size=64"
        resources:
          limits:
            nvidia.com/gpu: 2
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 2
            memory: "16Gi"
        volumeMounts:
        - name: data
          mountPath: /data
        - name: checkpoints
          mountPath: /checkpoints
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: training-data
      - name: checkpoints
        persistentVolumeClaim:
          claimName: model-checkpoints

Inference Deployment Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
    spec:
      containers:
      - name: model-server
        image: my-model:v1.0
        ports:
        - containerPort: 8080
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "8Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "4Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: inference-service
spec:
  selector:
    app: inference
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

Storage for ML Workloads

Storage Classes for ML

# High-performance SSD for training data
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-ssd
  replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# PVC for training data
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-data
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 500Gi

Shared Storage for Distributed Training

# NFS for multi-pod access (distributed training)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-checkpoints
spec:
  accessModes:
  - ReadWriteMany  # Multiple pods can write
  storageClassName: nfs-client
  resources:
    requests:
      storage: 100Gi

Next, we'll explore namespaces, resource quotas, and multi-tenancy patterns for ML teams. :::