GPU Scheduling & Resource Management

Kueue & Volcano: Advanced GPU Scheduling

4 min read

Organizations treating GPUs as shared, policy-driven resources win at AI scale. Kueue and Volcano provide queue-based admission control and gang scheduling essential for ML workloads.

The Queue Management Problem

Without Queue Management

┌─────────────────────────────────────────────────────────────────┐
│              Problem: Native Kubernetes Scheduling               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Team A submits: 32 GPU job                                     │
│  Team B submits: 8 GPU job                                      │
│  Team C submits: 64 GPU job                                     │
│                                                                  │
│  Kubernetes behavior:                                           │
│  - First pod scheduled gets resources                           │
│  - No fairness across teams                                     │
│  - Distributed training: partial pod allocation (deadlock!)     │
│  - No borrowing/lending between quotas                          │
│  - Jobs stuck waiting with no visibility                        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

With Queue Management

┌─────────────────────────────────────────────────────────────────┐
│              Solution: Kueue Queue Management                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ClusterQueue: gpu-cluster (64 GPUs total)                      │
│  ├── LocalQueue: team-a (quota: 16 GPUs, can borrow 32)        │
│  ├── LocalQueue: team-b (quota: 16 GPUs, can borrow 32)        │
│  └── LocalQueue: team-c (quota: 32 GPUs, can borrow 16)        │
│                                                                  │
│  Workload submitted → Admission control → Gang scheduling       │
│                                                                  │
│  Benefits:                                                       │
│  - Fair share across teams                                      │
│  - Gang admission (all-or-nothing)                              │
│  - Borrowing when queues are idle                               │
│  - Preemption policies                                          │
│  - Queue visibility and priorities                              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Kueue: Kubernetes-Native Job Queueing

Installing Kueue

# Install Kueue
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.9.0/manifests.yaml

# Verify installation
kubectl get pods -n kueue-system

# Check CRDs
kubectl get crd | grep kueue
# clusterqueues.kueue.x-k8s.io
# localqueues.kueue.x-k8s.io
# resourceflavors.kueue.x-k8s.io
# workloads.kueue.x-k8s.io

Resource Flavors

# Define GPU types as flavors
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: nvidia-a100
spec:
  nodeLabels:
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: nvidia-h100
spec:
  nodeLabels:
    nvidia.com/gpu.product: NVIDIA-H100-SXM5-80GB
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: nvidia-l4
spec:
  nodeLabels:
    nvidia.com/gpu.product: NVIDIA-L4

ClusterQueue Configuration

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: gpu-cluster
spec:
  namespaceSelector: {}  # All namespaces
  queueingStrategy: BestEffortFIFO
  preemption:
    reclaimWithinCohort: Any
    withinClusterQueue: LowerPriority
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: nvidia-a100
      resources:
      - name: "cpu"
        nominalQuota: 256
        borrowingLimit: 128
      - name: "memory"
        nominalQuota: 1Ti
        borrowingLimit: 512Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 32
        borrowingLimit: 16
    - name: nvidia-h100
      resources:
      - name: "cpu"
        nominalQuota: 128
      - name: "memory"
        nominalQuota: 512Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 16

LocalQueue per Team

# Team A's queue
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: ml-research-queue
  namespace: ml-research
spec:
  clusterQueue: gpu-cluster
---
# Team B's queue
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: ml-production-queue
  namespace: ml-production
spec:
  clusterQueue: gpu-cluster

Submitting Jobs to Kueue

apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
  namespace: ml-research
  labels:
    kueue.x-k8s.io/queue-name: ml-research-queue
spec:
  parallelism: 4
  completions: 4
  template:
    spec:
      containers:
      - name: trainer
        image: pytorch/pytorch:2.1-cuda12.1
        resources:
          requests:
            nvidia.com/gpu: 8
            cpu: "32"
            memory: "128Gi"
          limits:
            nvidia.com/gpu: 8
            cpu: "32"
            memory: "128Gi"
      restartPolicy: Never

Monitoring Kueue

# Check queue status
kubectl get clusterqueue gpu-cluster -o yaml

# View pending/admitted workloads
kubectl get workloads -n ml-research

# Check LocalQueue status
kubectl describe localqueue ml-research-queue -n ml-research

Volcano: Gang Scheduling for Distributed Training

Why Gang Scheduling?

┌─────────────────────────────────────────────────────────────────┐
│              Problem: Partial Pod Allocation                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  4-worker distributed training needs 4 GPUs simultaneously      │
│                                                                  │
│  Without gang scheduling:                                        │
│  Worker 0: ✓ Scheduled (waiting for others)                     │
│  Worker 1: ✓ Scheduled (waiting for others)                     │
│  Worker 2: ✗ Pending (no GPU)                                   │
│  Worker 3: ✗ Pending (no GPU)                                   │
│                                                                  │
│  Result: DEADLOCK - GPUs wasted, training stuck!                │
│                                                                  │
│  With gang scheduling:                                          │
│  All 4 workers: ✗ Waiting until 4 GPUs available                │
│  All 4 workers: ✓ Admitted together                             │
│                                                                  │
│  Result: No wasted resources                                    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Installing Volcano

# Install Volcano
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

# Verify
kubectl get pods -n volcano-system

Volcano Job Example

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: pytorch-distributed
  namespace: ml-research
spec:
  minAvailable: 4  # Gang scheduling: all 4 or nothing
  schedulerName: volcano
  plugins:
    env: []
    svc: []
  queue: default
  tasks:
  - replicas: 1
    name: master
    template:
      spec:
        containers:
        - name: pytorch
          image: pytorch/pytorch:2.1-cuda12.1
          command: ["python", "-m", "torch.distributed.launch"]
          args: ["--master_addr=master-0", "--nproc_per_node=1", "train.py"]
          resources:
            limits:
              nvidia.com/gpu: 1
  - replicas: 3
    name: worker
    template:
      spec:
        containers:
        - name: pytorch
          image: pytorch/pytorch:2.1-cuda12.1
          command: ["python", "-m", "torch.distributed.launch"]
          args: ["--master_addr=master-0", "--nproc_per_node=1", "train.py"]
          resources:
            limits:
              nvidia.com/gpu: 1

Kueue vs Volcano

Feature Kueue Volcano
Primary Focus Job queueing & admission Gang scheduling
Preemption Advanced policies Basic
Multi-tenancy Strong (cohorts, borrowing) Basic
CRD Required Uses native Jobs Custom VolcanoJob
CNCF Status Kubernetes SIG project CNCF Incubating
Best For Fair sharing, quotas Distributed training

Recommendation: Use Kueue for queue management + Volcano for strict gang scheduling.

Next, we'll cover NVIDIA KAI Scheduler and Dynamic Resource Allocation (DRA) for cutting-edge GPU management. :::

Quiz

Module 2: GPU Scheduling & Resource Management

Take Quiz