GPU Scheduling & Resource Management

GPU Sharing: MIG, Time-Slicing & MPS

4 min read

GPU utilization in Kubernetes clusters often averages 30-50%. Sharing strategies like MIG, time-slicing, and MPS can dramatically improve efficiency without sacrificing isolation.

GPU Sharing Strategies Comparison

Strategy Isolation Use Case GPU Support
MIG Hardware Production inference, guaranteed resources A100, H100
Time-Slicing Software Development, burst workloads All NVIDIA GPUs
MPS Process High-throughput inference Pascal+
vGPU Hypervisor VMs, multi-tenant Enterprise

Multi-Instance GPU (MIG)

MIG Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    A100-80GB with MIG                            │
├─────────────────────────────────────────────────────────────────┤
│  Without MIG: 1 workload gets entire 80GB                       │
│                                                                  │
│  With MIG (7 instances):                                        │
│  ┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│  │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │
│  │  Pod 1  │  Pod 2  │  Pod 3  │  Pod 4  │  Pod 5  │  Pod 6  │  Pod 7  │
│  └─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
│                                                                  │
│  Alternative configurations:                                     │
│  ┌─────────────────────┬─────────────────────┬─────────────────────┐
│  │      2g.20gb        │      2g.20gb        │      3g.40gb        │
│  │       Pod 1         │       Pod 2         │       Pod 3         │
│  └─────────────────────┴─────────────────────┴─────────────────────┘
└─────────────────────────────────────────────────────────────────┘

MIG Profiles

Profile Memory SMs Use Case
1g.10gb 10GB 14 Small inference
2g.20gb 20GB 28 Medium models
3g.40gb 40GB 42 Large inference
4g.40gb 40GB 56 Training
7g.80gb 80GB 98 Full GPU

Configuring MIG in Kubernetes

# MIG configuration ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      # All GPUs as small instances (inference)
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7

      # Mixed configuration
      mixed:
        - devices: [0, 1]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
        - devices: [2, 3]
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7

      # Training configuration
      training-4g:
        - devices: all
          mig-enabled: true
          mig-devices:
            "4g.40gb": 1
            "3g.40gb": 1

Using MIG Instances in Pods

apiVersion: v1
kind: Pod
metadata:
  name: inference-small
spec:
  containers:
  - name: model-server
    image: my-inference:latest
    resources:
      limits:
        # Request specific MIG slice
        nvidia.com/mig-1g.10gb: 1
---
apiVersion: v1
kind: Pod
metadata:
  name: inference-medium
spec:
  containers:
  - name: model-server
    image: my-inference:latest
    resources:
      limits:
        nvidia.com/mig-3g.40gb: 1

Time-Slicing

Time-Slicing Configuration

# Time-slicing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # 4 pods can share each GPU

Applying Time-Slicing

# Apply configuration
kubectl apply -f time-slicing-config.yaml

# Label nodes for time-slicing
kubectl label nodes gpu-node-1 \
  nvidia.com/device-plugin.config=time-slicing

# Patch GPU Operator to use config
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  -n gpu-operator \
  --type merge \
  -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'

Time-Slicing Pod Example

# 4 pods sharing one GPU
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-deployment
spec:
  replicas: 4
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
    spec:
      containers:
      - name: inference
        image: my-inference:latest
        resources:
          limits:
            nvidia.com/gpu: 1  # Each gets time-slice

Time-slicing considerations:

  • No memory isolation (OOM affects all pods)
  • Context switching overhead (~5-10%)
  • Best for burst/interactive workloads
  • Not recommended for latency-sensitive inference

Multi-Process Service (MPS)

MPS Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    MPS Architecture                              │
├─────────────────────────────────────────────────────────────────┤
│  Without MPS:                                                    │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                     │
│  │ Process │    │ Process │    │ Process │  (Context switches) │
│  └────┬────┘    └────┬────┘    └────┬────┘                     │
│       │              │              │                           │
│       └──────────────┼──────────────┘                           │
│                      ↓                                          │
│              ┌───────────────┐                                  │
│              │      GPU      │                                  │
│              └───────────────┘                                  │
├─────────────────────────────────────────────────────────────────┤
│  With MPS:                                                       │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                     │
│  │ Process │    │ Process │    │ Process │                     │
│  └────┬────┘    └────┬────┘    └────┬────┘                     │
│       │              │              │                           │
│       └──────────────┼──────────────┘                           │
│                      ↓                                          │
│              ┌───────────────┐                                  │
│              │  MPS Server   │  (Single context)               │
│              └───────┬───────┘                                  │
│                      ↓                                          │
│              ┌───────────────┐                                  │
│              │      GPU      │                                  │
│              └───────────────┘                                  │
└─────────────────────────────────────────────────────────────────┘

MPS Daemon Configuration

# Deploy MPS daemon as DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-mps
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-mps
  template:
    metadata:
      labels:
        app: nvidia-mps
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      containers:
      - name: mps
        image: nvidia/cuda:12.1-base-ubuntu22.04
        command:
        - /bin/bash
        - -c
        - |
          nvidia-cuda-mps-control -d
          sleep infinity
        securityContext:
          privileged: true
        env:
        - name: CUDA_MPS_PIPE_DIRECTORY
          value: /tmp/nvidia-mps
        - name: CUDA_MPS_LOG_DIRECTORY
          value: /tmp/nvidia-mps-log
        volumeMounts:
        - name: mps-pipe
          mountPath: /tmp/nvidia-mps
        resources:
          limits:
            nvidia.com/gpu: 1
      volumes:
      - name: mps-pipe
        hostPath:
          path: /tmp/nvidia-mps

Choosing the Right Strategy

┌─────────────────────────────────────────────────────────────────┐
│              Decision Tree: GPU Sharing Strategy                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Need hardware isolation? ──Yes──> MIG (A100/H100)              │
│           │                                                      │
│          No                                                      │
│           │                                                      │
│  High throughput inference? ──Yes──> MPS                        │
│           │                                                      │
│          No                                                      │
│           │                                                      │
│  Development/burst workloads? ──Yes──> Time-Slicing             │
│           │                                                      │
│          No                                                      │
│           │                                                      │
│  Full GPU exclusive ─────────────────> No sharing               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
Workload Type Recommended Strategy
Production inference MIG
Development notebooks Time-slicing
Batch inference MPS
Large model training Exclusive
Hyperparameter tuning Time-slicing

Next, we'll explore Kueue and Volcano for advanced GPU queue management and gang scheduling. :::

Quiz

Module 2: GPU Scheduling & Resource Management

Take Quiz