GPU Sharing: MIG, Time-Slicing & MPS

GPU utilization in Kubernetes clusters often averages 30-50%. Sharing strategies like MIG, time-slicing, and MPS can dramatically improve efficiency without sacrificing isolation.

Strategy	Isolation	Use Case	GPU Support
MIG	Hardware	Production inference, guaranteed resources	A100, H100
Time-Slicing	Software	Development, burst workloads	All NVIDIA GPUs
MPS	Process	High-throughput inference	Pascal+
vGPU	Hypervisor	VMs, multi-tenant	Enterprise

Multi-Instance GPU (MIG)

MIG Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    A100-80GB with MIG                            │
├─────────────────────────────────────────────────────────────────┤
│  Without MIG: 1 workload gets entire 80GB                       │
│                                                                  │
│  With MIG (7 instances):                                        │
│  ┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│  │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │ 1g.10gb │
│  │  Pod 1  │  Pod 2  │  Pod 3  │  Pod 4  │  Pod 5  │  Pod 6  │  Pod 7  │
│  └─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
│                                                                  │
│  Alternative configurations:                                     │
│  ┌─────────────────────┬─────────────────────┬─────────────────────┐
│  │      2g.20gb        │      2g.20gb        │      3g.40gb        │
│  │       Pod 1         │       Pod 2         │       Pod 3         │
│  └─────────────────────┴─────────────────────┴─────────────────────┘
└─────────────────────────────────────────────────────────────────┘

MIG Profiles

Profile	Memory	SMs	Use Case
1g.10gb	10GB	14	Small inference
2g.20gb	20GB	28	Medium models
3g.40gb	40GB	42	Large inference
4g.40gb	40GB	56	Training
7g.80gb	80GB	98	Full GPU

Configuring MIG in Kubernetes

# MIG configuration ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      # All GPUs as small instances (inference)
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7

      # Mixed configuration
      mixed:
        - devices: [0, 1]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
        - devices: [2, 3]
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7

      # Training configuration
      training-4g:
        - devices: all
          mig-enabled: true
          mig-devices:
            "4g.40gb": 1
            "3g.40gb": 1

Using MIG Instances in Pods

apiVersion: v1
kind: Pod
metadata:
  name: inference-small
spec:
  containers:
  - name: model-server
    image: my-inference:latest
    resources:
      limits:
        # Request specific MIG slice
        nvidia.com/mig-1g.10gb: 1
---
apiVersion: v1
kind: Pod
metadata:
  name: inference-medium
spec:
  containers:
  - name: model-server
    image: my-inference:latest
    resources:
      limits:
        nvidia.com/mig-3g.40gb: 1

Time-Slicing

Time-Slicing Configuration

# Time-slicing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # 4 pods can share each GPU

Applying Time-Slicing

# Apply configuration
kubectl apply -f time-slicing-config.yaml

# Label nodes for time-slicing
kubectl label nodes gpu-node-1 \
  nvidia.com/device-plugin.config=time-slicing

# Patch GPU Operator to use config
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  -n gpu-operator \
  --type merge \
  -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'

Time-Slicing Pod Example

# 4 pods sharing one GPU
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-deployment
spec:
  replicas: 4
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
    spec:
      containers:
      - name: inference
        image: my-inference:latest
        resources:
          limits:
            nvidia.com/gpu: 1  # Each gets time-slice

Time-slicing considerations:

No memory isolation (OOM affects all pods)
Context switching overhead (~5-10%)
Best for burst/interactive workloads
Not recommended for latency-sensitive inference

Multi-Process Service (MPS)

MPS Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    MPS Architecture                              │
├─────────────────────────────────────────────────────────────────┤
│  Without MPS:                                                    │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                     │
│  │ Process │    │ Process │    │ Process │  (Context switches) │
│  └────┬────┘    └────┬────┘    └────┬────┘                     │
│       │              │              │                           │
│       └──────────────┼──────────────┘                           │
│                      ↓                                          │
│              ┌───────────────┐                                  │
│              │      GPU      │                                  │
│              └───────────────┘                                  │
├─────────────────────────────────────────────────────────────────┤
│  With MPS:                                                       │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                     │
│  │ Process │    │ Process │    │ Process │                     │
│  └────┬────┘    └────┬────┘    └────┬────┘                     │
│       │              │              │                           │
│       └──────────────┼──────────────┘                           │
│                      ↓                                          │
│              ┌───────────────┐                                  │
│              │  MPS Server   │  (Single context)               │
│              └───────┬───────┘                                  │
│                      ↓                                          │
│              ┌───────────────┐                                  │
│              │      GPU      │                                  │
│              └───────────────┘                                  │
└─────────────────────────────────────────────────────────────────┘

MPS Daemon Configuration

# Deploy MPS daemon as DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-mps
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-mps
  template:
    metadata:
      labels:
        app: nvidia-mps
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      containers:
      - name: mps
        image: nvidia/cuda:12.1-base-ubuntu22.04
        command:
        - /bin/bash
        - -c
        - |
          nvidia-cuda-mps-control -d
          sleep infinity
        securityContext:
          privileged: true
        env:
        - name: CUDA_MPS_PIPE_DIRECTORY
          value: /tmp/nvidia-mps
        - name: CUDA_MPS_LOG_DIRECTORY
          value: /tmp/nvidia-mps-log
        volumeMounts:
        - name: mps-pipe
          mountPath: /tmp/nvidia-mps
        resources:
          limits:
            nvidia.com/gpu: 1
      volumes:
      - name: mps-pipe
        hostPath:
          path: /tmp/nvidia-mps

Choosing the Right Strategy

┌─────────────────────────────────────────────────────────────────┐
│              Decision Tree: GPU Sharing Strategy                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Need hardware isolation? ──Yes──> MIG (A100/H100)              │
│           │                                                      │
│          No                                                      │
│           │                                                      │
│  High throughput inference? ──Yes──> MPS                        │
│           │                                                      │
│          No                                                      │
│           │                                                      │
│  Development/burst workloads? ──Yes──> Time-Slicing             │
│           │                                                      │
│          No                                                      │
│           │                                                      │
│  Full GPU exclusive ─────────────────> No sharing               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Workload Type	Recommended Strategy
Production inference	MIG
Development notebooks	Time-slicing
Batch inference	MPS
Large model training	Exclusive
Hyperparameter tuning	Time-slicing

Next, we'll explore Kueue and Volcano for advanced GPU queue management and gang scheduling. :::