GPU Scheduling & Resource Management

NVIDIA GPU Operator & Device Plugin

4 min read

The NVIDIA GPU Operator automates GPU management in Kubernetes, installing drivers, container toolkit, device plugin, and monitoring components. It's the 2025 standard for production GPU clusters.

GPU Scheduling Challenges

Traditional Limitations

┌─────────────────────────────────────────────────────────────────┐
│              Traditional GPU Scheduling                          │
├─────────────────────────────────────────────────────────────────┤
│  Pods request: nvidia.com/gpu: 1                                │
│                                                                  │
│  Limitations:                                                    │
│  ├── GPUs treated as opaque, black-box devices                  │
│  ├── No memory awareness (80GB GPU for 8GB job)                 │
│  ├── No compute utilization visibility                          │
│  ├── No topology awareness (NVLink, PCIe)                       │
│  └── Exclusive access regardless of actual needs                │
└─────────────────────────────────────────────────────────────────┘

NVIDIA's Solution Stack

┌─────────────────────────────────────────────────────────────────┐
│                NVIDIA GPU Management Stack                       │
├─────────────────────────────────────────────────────────────────┤
│  GPU Operator (Top Level)                                        │
│  ├── Manages all components below                                │
│  ├── Automated driver installation                               │
│  └── Lifecycle management                                        │
├─────────────────────────────────────────────────────────────────┤
│  NVIDIA Container Toolkit                                        │
│  ├── nvidia-container-runtime                                    │
│  └── Enables GPU access in containers                            │
├─────────────────────────────────────────────────────────────────┤
│  Device Plugin                                                   │
│  ├── Advertises GPUs to Kubernetes                               │
│  └── Allocates GPUs to pods                                      │
├─────────────────────────────────────────────────────────────────┤
│  GPU Feature Discovery                                           │
│  ├── Labels nodes with GPU attributes                            │
│  └── Enables fine-grained scheduling                             │
├─────────────────────────────────────────────────────────────────┤
│  DCGM Exporter                                                   │
│  ├── Prometheus metrics                                          │
│  └── GPU utilization, memory, temperature                        │
└─────────────────────────────────────────────────────────────────┘

Installing GPU Operator

Prerequisites

# Verify GPU nodes are in cluster
kubectl get nodes -l feature.node.kubernetes.io/pci-10de.present=true

# Install Node Feature Discovery (if not present)
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/master/manifests/nfd-all.yaml

# Verify NFD is running
kubectl get pods -n node-feature-discovery

Helm Installation

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator with defaults
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --wait

# Install with custom configuration
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set driver.version="550.54.15" \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set gfd.enabled=true \
  --set mig.strategy=mixed

Custom GPU Operator Configuration

# gpu-operator-values.yaml
driver:
  enabled: true
  version: "550.54.15"
  repository: nvcr.io/nvidia
  image: driver
  imagePullPolicy: IfNotPresent

toolkit:
  enabled: true
  repository: nvcr.io/nvidia/k8s
  image: container-toolkit
  version: v1.14.3-ubuntu20.04

devicePlugin:
  enabled: true
  repository: nvcr.io/nvidia
  image: k8s-device-plugin
  version: v0.14.3
  config:
    name: time-slicing-config
    default: any

dcgmExporter:
  enabled: true
  repository: nvcr.io/nvidia/k8s
  image: dcgm-exporter
  version: 3.3.0-3.2.0-ubuntu22.04

gfd:
  enabled: true
  repository: nvcr.io/nvidia
  image: k8s-device-plugin
  version: v0.14.3

mig:
  strategy: mixed  # none, single, mixed

# Node affinity for GPU operator components
daemonsets:
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
# Install with custom values
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  -f gpu-operator-values.yaml

GPU Feature Discovery Labels

Auto-Generated Node Labels

# View GPU feature labels on a node
kubectl get node gpu-node-1 -o yaml | grep nvidia

# Common labels generated:
# nvidia.com/gpu.present=true
# nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
# nvidia.com/gpu.memory=81920
# nvidia.com/gpu.count=8
# nvidia.com/gpu.family=ampere
# nvidia.com/cuda.driver.major=550
# nvidia.com/cuda.runtime.major=12
# nvidia.com/mig.capable=true

Using GPU Labels in Pod Specs

apiVersion: v1
kind: Pod
metadata:
  name: a100-training
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          # Require A100 or H100 GPUs
          - key: nvidia.com/gpu.product
            operator: In
            values:
            - NVIDIA-A100-SXM4-80GB
            - NVIDIA-H100-SXM5-80GB
          # Require minimum 80GB memory
          - key: nvidia.com/gpu.memory
            operator: Gt
            values:
            - "80000"
  containers:
  - name: trainer
    image: pytorch/pytorch:2.1-cuda12.1
    resources:
      limits:
        nvidia.com/gpu: 4

Verifying GPU Access

# Check GPU Operator pods
kubectl get pods -n gpu-operator

# Expected output:
# NAME                                       READY   STATUS    RESTARTS   AGE
# gpu-feature-discovery-xxxx                 1/1     Running   0          5m
# gpu-operator-xxxx                          1/1     Running   0          5m
# nvidia-container-toolkit-daemonset-xxxx   1/1     Running   0          5m
# nvidia-cuda-validator-xxxx                 0/1     Completed 0          4m
# nvidia-dcgm-exporter-xxxx                  1/1     Running   0          5m
# nvidia-device-plugin-daemonset-xxxx        1/1     Running   0          5m
# nvidia-driver-daemonset-xxxx               1/1     Running   0          5m
# nvidia-operator-validator-xxxx             0/1     Completed 0          4m

# Run test pod
kubectl run gpu-test --rm -it \
  --image=nvidia/cuda:12.1-base-ubuntu22.04 \
  --requests='nvidia.com/gpu=1' \
  --restart=Never \
  -- nvidia-smi

DCGM Metrics for Monitoring

Key GPU Metrics

Metric Description Use Case
DCGM_FI_DEV_GPU_UTIL GPU utilization % Training efficiency
DCGM_FI_DEV_MEM_COPY_UTIL Memory bandwidth % Data loading bottleneck
DCGM_FI_DEV_FB_USED GPU memory used (MB) OOM prevention
DCGM_FI_DEV_FB_FREE GPU memory free (MB) Capacity planning
DCGM_FI_DEV_GPU_TEMP Temperature (C) Thermal throttling
DCGM_FI_DEV_POWER_USAGE Power draw (W) Cost/energy monitoring

Prometheus Query Examples

# GPU utilization by pod
avg(DCGM_FI_DEV_GPU_UTIL{pod=~"training-.*"}) by (pod)

# Memory usage percentage
(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100

# Alert on low GPU utilization (waste detection)
avg(DCGM_FI_DEV_GPU_UTIL) by (Hostname) < 30

# Power consumption per node
sum(DCGM_FI_DEV_POWER_USAGE) by (Hostname)

Next, we'll explore GPU sharing strategies including MIG, time-slicing, and MPS for better utilization. :::

Quiz

Module 2: GPU Scheduling & Resource Management

Take Quiz