GPU Scheduling & Resource Management
NVIDIA KAI Scheduler & Dynamic Resource Allocation
3 min read
In January 2025, NVIDIA open-sourced KAI (Kubernetes AI) Scheduler, bringing enterprise-grade GPU management to the community. Combined with Kubernetes 1.34's GA Dynamic Resource Allocation, this represents the cutting edge of GPU scheduling.
NVIDIA KAI Scheduler
KAI Architecture
┌─────────────────────────────────────────────────────────────────┐
│ NVIDIA KAI Scheduler │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ GPU-Aware Scheduler ││
│ │ ├── Topology awareness (NVLink, NVSwitch) ││
│ │ ├── Memory-aware placement ││
│ │ ├── Multi-GPU job optimization ││
│ │ └── Preemption & priority policies ││
│ └─────────────────────────────────────────────────────────────┘│
│ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Queue Management ││
│ │ ├── Fair share scheduling ││
│ │ ├── Gang scheduling ││
│ │ ├── Quota management ││
│ │ └── Borrowing/lending ││
│ └─────────────────────────────────────────────────────────────┘│
│ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Integration Layer ││
│ │ ├── Works with GPU Operator ││
│ │ ├── Compatible with Kueue ││
│ │ └── DRA-ready ││
│ └─────────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────────┘
KAI Scheduling Policies
# KAI scheduling policy
apiVersion: kai.nvidia.com/v1alpha1
kind: SchedulingPolicy
metadata:
name: gpu-topology-aware
spec:
# Topology-aware placement
topologyPolicy:
strategy: BestFit # BestFit, SpreadEvenly, SingleNode
preferNVLink: true
# Memory-aware scheduling
memoryPolicy:
strategy: FirstFit
memoryThreshold: 80% # Don't schedule if GPU memory > 80%
# Multi-GPU optimization
multiGPUPolicy:
preferSameSwitch: true # GPUs on same NVSwitch
maxGPUsPerNode: 8
# Preemption
preemption:
enabled: true
gracePeriod: 30s
Using KAI with Workloads
apiVersion: batch/v1
kind: Job
metadata:
name: llm-training
annotations:
kai.nvidia.com/scheduling-policy: gpu-topology-aware
spec:
template:
spec:
schedulerName: kai-scheduler
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/gpu: 8
env:
- name: NCCL_DEBUG
value: INFO
Dynamic Resource Allocation (DRA)
DRA in Kubernetes 1.34 (GA)
┌─────────────────────────────────────────────────────────────────┐
│ DRA vs Traditional GPU Scheduling │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Traditional (Device Plugin): │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Pod: nvidia.com/gpu: 1 ││
│ │ - Integer-based (whole GPUs only) ││
│ │ - No memory/compute awareness ││
│ │ - Exclusive access ││
│ │ - Static allocation ││
│ └─────────────────────────────────────────────────────────────┘│
│ │
│ DRA (Kubernetes 1.34 GA): │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ ResourceClaim: gpu-claim ││
│ │ - Fine-grained device selection ││
│ │ - Memory/compute constraints ││
│ │ - Multi-pod sharing ││
│ │ - Just-in-time allocation ││
│ │ - Device attributes (NVLink, MIG) ││
│ └─────────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────────┘
DRA Components
# DeviceClass defines GPU types
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
name: nvidia-gpu
spec:
selectors:
- cel:
expression: device.driver == "nvidia.com"
---
# ResourceClaimTemplate for repeated use
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaimTemplate
metadata:
name: gpu-claim-template
namespace: ml-research
spec:
spec:
devices:
requests:
- name: gpu
deviceClassName: nvidia-gpu
count: 1
selectors:
- cel:
expression: device.attributes["nvidia.com/memory"].quantity >= "40Gi"
Using DRA in Pods
apiVersion: v1
kind: Pod
metadata:
name: dra-training
namespace: ml-research
spec:
containers:
- name: trainer
image: pytorch/pytorch:2.1-cuda12.1
resources:
claims:
- name: gpu-claim
resourceClaims:
- name: gpu-claim
resourceClaimTemplateName: gpu-claim-template
Advanced DRA: Multi-Pod GPU Sharing
# Shareable ResourceClaim
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
name: shared-gpu
namespace: ml-research
spec:
devices:
requests:
- name: inference-gpu
deviceClassName: nvidia-gpu
count: 1
# Allow multiple pods to share this claim
allocationMode: WaitForFirstConsumer
---
# Multiple pods sharing the GPU
apiVersion: v1
kind: Pod
metadata:
name: inference-1
spec:
containers:
- name: server
image: inference:latest
resources:
claims:
- name: shared-gpu
request: inference-gpu
resourceClaims:
- name: shared-gpu
resourceClaimName: shared-gpu # Reference existing claim
---
apiVersion: v1
kind: Pod
metadata:
name: inference-2
spec:
containers:
- name: server
image: inference:latest
resources:
claims:
- name: shared-gpu
request: inference-gpu
resourceClaims:
- name: shared-gpu
resourceClaimName: shared-gpu # Same claim
Comparison: KAI vs DRA vs Kueue
| Feature | KAI Scheduler | DRA | Kueue |
|---|---|---|---|
| GPU Topology | Native | Via selectors | No |
| Queue Management | Built-in | No | Native |
| Fine-grained Allocation | Yes | Yes | No |
| Multi-pod Sharing | Via policies | Native | No |
| Gang Scheduling | Yes | No | Yes |
| Kubernetes Native | Extension | Native (1.34) | Native |
Recommended Stack for 2026
┌─────────────────────────────────────────────────────────────────┐
│ Production GPU Scheduling Stack │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: Queue Management (Kueue) │
│ └── Multi-tenant fairness, quotas, borrowing │
│ │
│ Layer 2: GPU Scheduling (KAI or DRA) │
│ └── Topology-aware, memory-aware placement │
│ │
│ Layer 3: Resource Management (GPU Operator) │
│ └── Drivers, device plugin, metrics │
│ │
│ Layer 4: Infrastructure (Kubernetes 1.34+) │
│ └── DRA GA, VolumeAttributesClass, OCI artifacts │
│ │
└─────────────────────────────────────────────────────────────────┘
Next module: Kubeflow and ML Pipelines for end-to-end ML workflow orchestration. :::