GPU Scheduling & Resource Management
NVIDIA GPU Operator & Device Plugin
4 min read
The NVIDIA GPU Operator automates GPU management in Kubernetes, installing drivers, container toolkit, device plugin, and monitoring components. It's the 2025 standard for production GPU clusters.
GPU Scheduling Challenges
Traditional Limitations
┌─────────────────────────────────────────────────────────────────┐
│ Traditional GPU Scheduling │
├─────────────────────────────────────────────────────────────────┤
│ Pods request: nvidia.com/gpu: 1 │
│ │
│ Limitations: │
│ ├── GPUs treated as opaque, black-box devices │
│ ├── No memory awareness (80GB GPU for 8GB job) │
│ ├── No compute utilization visibility │
│ ├── No topology awareness (NVLink, PCIe) │
│ └── Exclusive access regardless of actual needs │
└─────────────────────────────────────────────────────────────────┘
NVIDIA's Solution Stack
┌─────────────────────────────────────────────────────────────────┐
│ NVIDIA GPU Management Stack │
├─────────────────────────────────────────────────────────────────┤
│ GPU Operator (Top Level) │
│ ├── Manages all components below │
│ ├── Automated driver installation │
│ └── Lifecycle management │
├─────────────────────────────────────────────────────────────────┤
│ NVIDIA Container Toolkit │
│ ├── nvidia-container-runtime │
│ └── Enables GPU access in containers │
├─────────────────────────────────────────────────────────────────┤
│ Device Plugin │
│ ├── Advertises GPUs to Kubernetes │
│ └── Allocates GPUs to pods │
├─────────────────────────────────────────────────────────────────┤
│ GPU Feature Discovery │
│ ├── Labels nodes with GPU attributes │
│ └── Enables fine-grained scheduling │
├─────────────────────────────────────────────────────────────────┤
│ DCGM Exporter │
│ ├── Prometheus metrics │
│ └── GPU utilization, memory, temperature │
└─────────────────────────────────────────────────────────────────┘
Installing GPU Operator
Prerequisites
# Verify GPU nodes are in cluster
kubectl get nodes -l feature.node.kubernetes.io/pci-10de.present=true
# Install Node Feature Discovery (if not present)
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/master/manifests/nfd-all.yaml
# Verify NFD is running
kubectl get pods -n node-feature-discovery
Helm Installation
# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install GPU Operator with defaults
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--wait
# Install with custom configuration
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set driver.version="550.54.15" \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--set gfd.enabled=true \
--set mig.strategy=mixed
Custom GPU Operator Configuration
# gpu-operator-values.yaml
driver:
enabled: true
version: "550.54.15"
repository: nvcr.io/nvidia
image: driver
imagePullPolicy: IfNotPresent
toolkit:
enabled: true
repository: nvcr.io/nvidia/k8s
image: container-toolkit
version: v1.14.3-ubuntu20.04
devicePlugin:
enabled: true
repository: nvcr.io/nvidia
image: k8s-device-plugin
version: v0.14.3
config:
name: time-slicing-config
default: any
dcgmExporter:
enabled: true
repository: nvcr.io/nvidia/k8s
image: dcgm-exporter
version: 3.3.0-3.2.0-ubuntu22.04
gfd:
enabled: true
repository: nvcr.io/nvidia
image: k8s-device-plugin
version: v0.14.3
mig:
strategy: mixed # none, single, mixed
# Node affinity for GPU operator components
daemonsets:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Install with custom values
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
-f gpu-operator-values.yaml
GPU Feature Discovery Labels
Auto-Generated Node Labels
# View GPU feature labels on a node
kubectl get node gpu-node-1 -o yaml | grep nvidia
# Common labels generated:
# nvidia.com/gpu.present=true
# nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
# nvidia.com/gpu.memory=81920
# nvidia.com/gpu.count=8
# nvidia.com/gpu.family=ampere
# nvidia.com/cuda.driver.major=550
# nvidia.com/cuda.runtime.major=12
# nvidia.com/mig.capable=true
Using GPU Labels in Pod Specs
apiVersion: v1
kind: Pod
metadata:
name: a100-training
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
# Require A100 or H100 GPUs
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-A100-SXM4-80GB
- NVIDIA-H100-SXM5-80GB
# Require minimum 80GB memory
- key: nvidia.com/gpu.memory
operator: Gt
values:
- "80000"
containers:
- name: trainer
image: pytorch/pytorch:2.1-cuda12.1
resources:
limits:
nvidia.com/gpu: 4
Verifying GPU Access
# Check GPU Operator pods
kubectl get pods -n gpu-operator
# Expected output:
# NAME READY STATUS RESTARTS AGE
# gpu-feature-discovery-xxxx 1/1 Running 0 5m
# gpu-operator-xxxx 1/1 Running 0 5m
# nvidia-container-toolkit-daemonset-xxxx 1/1 Running 0 5m
# nvidia-cuda-validator-xxxx 0/1 Completed 0 4m
# nvidia-dcgm-exporter-xxxx 1/1 Running 0 5m
# nvidia-device-plugin-daemonset-xxxx 1/1 Running 0 5m
# nvidia-driver-daemonset-xxxx 1/1 Running 0 5m
# nvidia-operator-validator-xxxx 0/1 Completed 0 4m
# Run test pod
kubectl run gpu-test --rm -it \
--image=nvidia/cuda:12.1-base-ubuntu22.04 \
--requests='nvidia.com/gpu=1' \
--restart=Never \
-- nvidia-smi
DCGM Metrics for Monitoring
Key GPU Metrics
| Metric | Description | Use Case |
|---|---|---|
| DCGM_FI_DEV_GPU_UTIL | GPU utilization % | Training efficiency |
| DCGM_FI_DEV_MEM_COPY_UTIL | Memory bandwidth % | Data loading bottleneck |
| DCGM_FI_DEV_FB_USED | GPU memory used (MB) | OOM prevention |
| DCGM_FI_DEV_FB_FREE | GPU memory free (MB) | Capacity planning |
| DCGM_FI_DEV_GPU_TEMP | Temperature (C) | Thermal throttling |
| DCGM_FI_DEV_POWER_USAGE | Power draw (W) | Cost/energy monitoring |
Prometheus Query Examples
# GPU utilization by pod
avg(DCGM_FI_DEV_GPU_UTIL{pod=~"training-.*"}) by (pod)
# Memory usage percentage
(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100
# Alert on low GPU utilization (waste detection)
avg(DCGM_FI_DEV_GPU_UTIL) by (Hostname) < 30
# Power consumption per node
sum(DCGM_FI_DEV_POWER_USAGE) by (Hostname)
Next, we'll explore GPU sharing strategies including MIG, time-slicing, and MPS for better utilization. :::