GPU Scheduling & Resource Management

GPU Sharing: MIG, Time-Slicing & MPS

4 min read

GPU utilization in Kubernetes clusters often averages 30-50%. Sharing strategies like MIG, time-slicing, and MPS can dramatically improve efficiency without sacrificing isolation.

GPU Sharing Strategies Comparison

StrategyIsolationUse CaseGPU Support
MIGHardwareProduction inference, guaranteed resourcesA100, H100
Time-SlicingSoftwareDevelopment, burst workloadsAll NVIDIA GPUs
MPSProcessHigh-throughput inferencePascal+
vGPUHypervisorVMs, multi-tenantEnterprise

Multi-Instance GPU (MIG)

MIG Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    A100-80GB with MIG                            │
├─────────────────────────────────────────────────────────────────┤
│  Without MIG: 1 workload gets entire 80GB                       │
│                                                                  │
│  With MIG (7 instances):                                        │
│  ┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│  │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │
│  │  Pod 1  │  Pod 2  │  Pod 3  │  Pod 4  │  Pod 5  │  Pod 6  │  Pod 7  │
│  └─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
│                                                                  │
│  Alternative configurations:                                     │
│  ┌─────────────────────┬─────────────────────┬─────────────────────┐
│  │      2g.20gb        │      2g.20gb        │      3g.40gb        │
│  │       Pod 1         │       Pod 2         │       Pod 3         │
│  └─────────────────────┴─────────────────────┴─────────────────────┘
└─────────────────────────────────────────────────────────────────┘

MIG Profiles

ProfileMemorySMsUse Case
1g.10gb10GB14Small inference
2g.20gb20GB28Medium models
3g.40gb40GB42Large inference
4g.40gb40GB56Training
7g.80gb80GB98Full GPU

Configuring MIG in Kubernetes

# MIG configuration ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      # All GPUs as small instances (inference)
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7

      # Mixed configuration
      mixed:
        - devices: [0, 1]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
        - devices: [2, 3]
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7

      # Training configuration
      training-4g:
        - devices: all
          mig-enabled: true
          mig-devices:
            "4g.40gb": 1
            "3g.40gb": 1

Using MIG Instances in Pods

apiVersion: v1
kind: Pod
metadata:
  name: inference-small
spec:
  containers:
  - name: model-server
    image: my-inference:latest
    resources:
      limits:
        # Request specific MIG slice
        nvidia.com/mig-1g.10gb: 1
---
apiVersion: v1
kind: Pod
metadata:
  name: inference-medium
spec:
  containers:
  - name: model-server
    image: my-inference:latest
    resources:
      limits:
        nvidia.com/mig-3g.40gb: 1

Time-Slicing

Time-Slicing Configuration

# Time-slicing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # 4 pods can share each GPU

Applying Time-Slicing

# Apply configuration
kubectl apply -f time-slicing-config.yaml

# Label nodes for time-slicing
kubectl label nodes gpu-node-1 \
  nvidia.com/device-plugin.config=time-slicing

# Patch GPU Operator to use config
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  -n gpu-operator \
  --type merge \
  -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'

Time-Slicing Pod Example

# 4 pods sharing one GPU
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-deployment
spec:
  replicas: 4
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
    spec:
      containers:
      - name: inference
        image: my-inference:latest
        resources:
          limits:
            nvidia.com/gpu: 1  # Each gets time-slice

Time-slicing considerations:

  • No memory isolation (OOM affects all pods)
  • Context switching overhead (~5-10%)
  • Best for burst/interactive workloads
  • Not recommended for latency-sensitive inference

Multi-Process Service (MPS)

MPS Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    MPS Architecture                              │
├─────────────────────────────────────────────────────────────────┤
│  Without MPS:                                                    │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                     │
│  │ Process │    │ Process │    │ Process │  (Context switches) │
│  └────┬────┘    └────┬────┘    └────┬────┘                     │
│       │              │              │                           │
│       └──────────────┼──────────────┘                           │
│                      ↓                                          │
│              ┌───────────────┐                                  │
│              │      GPU      │                                  │
│              └───────────────┘                                  │
├─────────────────────────────────────────────────────────────────┤
│  With MPS:                                                       │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                     │
│  │ Process │    │ Process │    │ Process │                     │
│  └────┬────┘    └────┬────┘    └────┬────┘                     │
│       │              │              │                           │
│       └──────────────┼──────────────┘                           │
│                      ↓                                          │
│              ┌───────────────┐                                  │
│              │  MPS Server   │  (Single context)               │
│              └───────┬───────┘                                  │
│                      ↓                                          │
│              ┌───────────────┐                                  │
│              │      GPU      │                                  │
│              └───────────────┘                                  │
└─────────────────────────────────────────────────────────────────┘

MPS Daemon Configuration

# Deploy MPS daemon as DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-mps
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-mps
  template:
    metadata:
      labels:
        app: nvidia-mps
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      containers:
      - name: mps
        image: nvidia/cuda:12.1-base-ubuntu22.04
        command:
        - /bin/bash
        - -c
        - |
          nvidia-cuda-mps-control -d
          sleep infinity
        securityContext:
          privileged: true
        env:
        - name: CUDA_MPS_PIPE_DIRECTORY
          value: /tmp/nvidia-mps
        - name: CUDA_MPS_LOG_DIRECTORY
          value: /tmp/nvidia-mps-log
        volumeMounts:
        - name: mps-pipe
          mountPath: /tmp/nvidia-mps
        resources:
          limits:
            nvidia.com/gpu: 1
      volumes:
      - name: mps-pipe
        hostPath:
          path: /tmp/nvidia-mps

Choosing the Right Strategy

┌─────────────────────────────────────────────────────────────────┐
│              Decision Tree: GPU Sharing Strategy                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Need hardware isolation? ──Yes──> MIG (A100/H100)              │
│           │                                                      │
│          No                                                      │
│           │                                                      │
│  High throughput inference? ──Yes──> MPS                        │
│           │                                                      │
│          No                                                      │
│           │                                                      │
│  Development/burst workloads? ──Yes──> Time-Slicing             │
│           │                                                      │
│          No                                                      │
│           │                                                      │
│  Full GPU exclusive ─────────────────> No sharing               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
Workload TypeRecommended Strategy
Production inferenceMIG
Development notebooksTime-slicing
Batch inferenceMPS
Large model trainingExclusive
Hyperparameter tuningTime-slicing

Next, we'll explore Kueue and Volcano for advanced GPU queue management and gang scheduling. :::

Quick check: how does this lesson land for you?

Quiz

Module 2: GPU Scheduling & Resource Management

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.