GPU Scheduling & Resource Management
GPU Sharing: MIG, Time-Slicing & MPS
4 min read
GPU utilization in Kubernetes clusters often averages 30-50%. Sharing strategies like MIG, time-slicing, and MPS can dramatically improve efficiency without sacrificing isolation.
GPU Sharing Strategies Comparison
| Strategy | Isolation | Use Case | GPU Support |
|---|---|---|---|
| MIG | Hardware | Production inference, guaranteed resources | A100, H100 |
| Time-Slicing | Software | Development, burst workloads | All NVIDIA GPUs |
| MPS | Process | High-throughput inference | Pascal+ |
| vGPU | Hypervisor | VMs, multi-tenant | Enterprise |
Multi-Instance GPU (MIG)
MIG Architecture
┌─────────────────────────────────────────────────────────────────┐
│ A100-80GB with MIG │
├─────────────────────────────────────────────────────────────────┤
│ Without MIG: 1 workload gets entire 80GB │
│ │
│ With MIG (7 instances): │
│ ┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │
│ │ Pod 1 │ Pod 2 │ Pod 3 │ Pod 4 │ Pod 5 │ Pod 6 │ Pod 7 │
│ └─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
│ │
│ Alternative configurations: │
│ ┌─────────────────────┬─────────────────────┬─────────────────────┐
│ │ 2g.20gb │ 2g.20gb │ 3g.40gb │
│ │ Pod 1 │ Pod 2 │ Pod 3 │
│ └─────────────────────┴─────────────────────┴─────────────────────┘
└─────────────────────────────────────────────────────────────────┘
MIG Profiles
| Profile | Memory | SMs | Use Case |
|---|---|---|---|
| 1g.10gb | 10GB | 14 | Small inference |
| 2g.20gb | 20GB | 28 | Medium models |
| 3g.40gb | 40GB | 42 | Large inference |
| 4g.40gb | 40GB | 56 | Training |
| 7g.80gb | 80GB | 98 | Full GPU |
Configuring MIG in Kubernetes
# MIG configuration ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
# All GPUs as small instances (inference)
all-1g.10gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7
# Mixed configuration
mixed:
- devices: [0, 1]
mig-enabled: true
mig-devices:
"3g.40gb": 2
- devices: [2, 3]
mig-enabled: true
mig-devices:
"1g.10gb": 7
# Training configuration
training-4g:
- devices: all
mig-enabled: true
mig-devices:
"4g.40gb": 1
"3g.40gb": 1
Using MIG Instances in Pods
apiVersion: v1
kind: Pod
metadata:
name: inference-small
spec:
containers:
- name: model-server
image: my-inference:latest
resources:
limits:
# Request specific MIG slice
nvidia.com/mig-1g.10gb: 1
---
apiVersion: v1
kind: Pod
metadata:
name: inference-medium
spec:
containers:
- name: model-server
image: my-inference:latest
resources:
limits:
nvidia.com/mig-3g.40gb: 1
Time-Slicing
Time-Slicing Configuration
# Time-slicing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4 # 4 pods can share each GPU
Applying Time-Slicing
# Apply configuration
kubectl apply -f time-slicing-config.yaml
# Label nodes for time-slicing
kubectl label nodes gpu-node-1 \
nvidia.com/device-plugin.config=time-slicing
# Patch GPU Operator to use config
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
-n gpu-operator \
--type merge \
-p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'
Time-Slicing Pod Example
# 4 pods sharing one GPU
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-deployment
spec:
replicas: 4
selector:
matchLabels:
app: inference
template:
metadata:
labels:
app: inference
spec:
containers:
- name: inference
image: my-inference:latest
resources:
limits:
nvidia.com/gpu: 1 # Each gets time-slice
Time-slicing considerations:
- No memory isolation (OOM affects all pods)
- Context switching overhead (~5-10%)
- Best for burst/interactive workloads
- Not recommended for latency-sensitive inference
Multi-Process Service (MPS)
MPS Architecture
┌─────────────────────────────────────────────────────────────────┐
│ MPS Architecture │
├─────────────────────────────────────────────────────────────────┤
│ Without MPS: │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Process │ │ Process │ │ Process │ (Context switches) │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └──────────────┼──────────────┘ │
│ ↓ │
│ ┌───────────────┐ │
│ │ GPU │ │
│ └───────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ With MPS: │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Process │ │ Process │ │ Process │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └──────────────┼──────────────┘ │
│ ↓ │
│ ┌───────────────┐ │
│ │ MPS Server │ (Single context) │
│ └───────┬───────┘ │
│ ↓ │
│ ┌───────────────┐ │
│ │ GPU │ │
│ └───────────────┘ │
└─────────────────────────────────────────────────────────────────┘
MPS Daemon Configuration
# Deploy MPS daemon as DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-mps
namespace: gpu-operator
spec:
selector:
matchLabels:
app: nvidia-mps
template:
metadata:
labels:
app: nvidia-mps
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
containers:
- name: mps
image: nvidia/cuda:12.1-base-ubuntu22.04
command:
- /bin/bash
- -c
- |
nvidia-cuda-mps-control -d
sleep infinity
securityContext:
privileged: true
env:
- name: CUDA_MPS_PIPE_DIRECTORY
value: /tmp/nvidia-mps
- name: CUDA_MPS_LOG_DIRECTORY
value: /tmp/nvidia-mps-log
volumeMounts:
- name: mps-pipe
mountPath: /tmp/nvidia-mps
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: mps-pipe
hostPath:
path: /tmp/nvidia-mps
Choosing the Right Strategy
┌─────────────────────────────────────────────────────────────────┐
│ Decision Tree: GPU Sharing Strategy │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Need hardware isolation? ──Yes──> MIG (A100/H100) │
│ │ │
│ No │
│ │ │
│ High throughput inference? ──Yes──> MPS │
│ │ │
│ No │
│ │ │
│ Development/burst workloads? ──Yes──> Time-Slicing │
│ │ │
│ No │
│ │ │
│ Full GPU exclusive ─────────────────> No sharing │
│ │
└─────────────────────────────────────────────────────────────────┘
| Workload Type | Recommended Strategy |
|---|---|
| Production inference | MIG |
| Development notebooks | Time-slicing |
| Batch inference | MPS |
| Large model training | Exclusive |
| Hyperparameter tuning | Time-slicing |
Next, we'll explore Kueue and Volcano for advanced GPU queue management and gang scheduling. :::