GPU Scheduling & Resource Management

NVIDIA KAI Scheduler & Dynamic Resource Allocation

3 min read

In January 2025, NVIDIA open-sourced KAI (Kubernetes AI) Scheduler, bringing enterprise-grade GPU management to the community. Combined with Kubernetes 1.34's GA Dynamic Resource Allocation, this represents the cutting edge of GPU scheduling.

NVIDIA KAI Scheduler

KAI Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    NVIDIA KAI Scheduler                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    GPU-Aware Scheduler                      ││
│  │  ├── Topology awareness (NVLink, NVSwitch)                 ││
│  │  ├── Memory-aware placement                                 ││
│  │  ├── Multi-GPU job optimization                            ││
│  │  └── Preemption & priority policies                        ││
│  └─────────────────────────────────────────────────────────────┘│
│                                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    Queue Management                         ││
│  │  ├── Fair share scheduling                                  ││
│  │  ├── Gang scheduling                                        ││
│  │  ├── Quota management                                       ││
│  │  └── Borrowing/lending                                      ││
│  └─────────────────────────────────────────────────────────────┘│
│                                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    Integration Layer                        ││
│  │  ├── Works with GPU Operator                               ││
│  │  ├── Compatible with Kueue                                  ││
│  │  └── DRA-ready                                              ││
│  └─────────────────────────────────────────────────────────────┘│
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

KAI Scheduling Policies

# KAI scheduling policy
apiVersion: kai.nvidia.com/v1alpha1
kind: SchedulingPolicy
metadata:
  name: gpu-topology-aware
spec:
  # Topology-aware placement
  topologyPolicy:
    strategy: BestFit  # BestFit, SpreadEvenly, SingleNode
    preferNVLink: true

  # Memory-aware scheduling
  memoryPolicy:
    strategy: FirstFit
    memoryThreshold: 80%  # Don't schedule if GPU memory > 80%

  # Multi-GPU optimization
  multiGPUPolicy:
    preferSameSwitch: true  # GPUs on same NVSwitch
    maxGPUsPerNode: 8

  # Preemption
  preemption:
    enabled: true
    gracePeriod: 30s

Using KAI with Workloads

apiVersion: batch/v1
kind: Job
metadata:
  name: llm-training
  annotations:
    kai.nvidia.com/scheduling-policy: gpu-topology-aware
spec:
  template:
    spec:
      schedulerName: kai-scheduler
      containers:
      - name: trainer
        image: nvcr.io/nvidia/pytorch:24.01-py3
        resources:
          limits:
            nvidia.com/gpu: 8
        env:
        - name: NCCL_DEBUG
          value: INFO

Dynamic Resource Allocation (DRA)

DRA in Kubernetes 1.34 (GA)

┌─────────────────────────────────────────────────────────────────┐
│              DRA vs Traditional GPU Scheduling                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Traditional (Device Plugin):                                    │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │  Pod: nvidia.com/gpu: 1                                     ││
│  │  - Integer-based (whole GPUs only)                          ││
│  │  - No memory/compute awareness                              ││
│  │  - Exclusive access                                         ││
│  │  - Static allocation                                        ││
│  └─────────────────────────────────────────────────────────────┘│
│                                                                  │
│  DRA (Kubernetes 1.34 GA):                                      │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │  ResourceClaim: gpu-claim                                   ││
│  │  - Fine-grained device selection                            ││
│  │  - Memory/compute constraints                               ││
│  │  - Multi-pod sharing                                        ││
│  │  - Just-in-time allocation                                  ││
│  │  - Device attributes (NVLink, MIG)                          ││
│  └─────────────────────────────────────────────────────────────┘│
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

DRA Components

# DeviceClass defines GPU types
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
  name: nvidia-gpu
spec:
  selectors:
  - cel:
      expression: device.driver == "nvidia.com"
---
# ResourceClaimTemplate for repeated use
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaimTemplate
metadata:
  name: gpu-claim-template
  namespace: ml-research
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: nvidia-gpu
        count: 1
        selectors:
        - cel:
            expression: device.attributes["nvidia.com/memory"].quantity >= "40Gi"

Using DRA in Pods

apiVersion: v1
kind: Pod
metadata:
  name: dra-training
  namespace: ml-research
spec:
  containers:
  - name: trainer
    image: pytorch/pytorch:2.1-cuda12.1
    resources:
      claims:
      - name: gpu-claim
  resourceClaims:
  - name: gpu-claim
    resourceClaimTemplateName: gpu-claim-template

Advanced DRA: Multi-Pod GPU Sharing

# Shareable ResourceClaim
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: shared-gpu
  namespace: ml-research
spec:
  devices:
    requests:
    - name: inference-gpu
      deviceClassName: nvidia-gpu
      count: 1
  # Allow multiple pods to share this claim
  allocationMode: WaitForFirstConsumer
---
# Multiple pods sharing the GPU
apiVersion: v1
kind: Pod
metadata:
  name: inference-1
spec:
  containers:
  - name: server
    image: inference:latest
    resources:
      claims:
      - name: shared-gpu
        request: inference-gpu
  resourceClaims:
  - name: shared-gpu
    resourceClaimName: shared-gpu  # Reference existing claim
---
apiVersion: v1
kind: Pod
metadata:
  name: inference-2
spec:
  containers:
  - name: server
    image: inference:latest
    resources:
      claims:
      - name: shared-gpu
        request: inference-gpu
  resourceClaims:
  - name: shared-gpu
    resourceClaimName: shared-gpu  # Same claim

Comparison: KAI vs DRA vs Kueue

Feature KAI Scheduler DRA Kueue
GPU Topology Native Via selectors No
Queue Management Built-in No Native
Fine-grained Allocation Yes Yes No
Multi-pod Sharing Via policies Native No
Gang Scheduling Yes No Yes
Kubernetes Native Extension Native (1.34) Native
┌─────────────────────────────────────────────────────────────────┐
│              Production GPU Scheduling Stack                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Layer 1: Queue Management (Kueue)                              │
│  └── Multi-tenant fairness, quotas, borrowing                   │
│                                                                  │
│  Layer 2: GPU Scheduling (KAI or DRA)                           │
│  └── Topology-aware, memory-aware placement                     │
│                                                                  │
│  Layer 3: Resource Management (GPU Operator)                    │
│  └── Drivers, device plugin, metrics                            │
│                                                                  │
│  Layer 4: Infrastructure (Kubernetes 1.34+)                     │
│  └── DRA GA, VolumeAttributesClass, OCI artifacts               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Next module: Kubeflow and ML Pipelines for end-to-end ML workflow orchestration. :::

Quiz

Module 2: GPU Scheduling & Resource Management

Take Quiz