Cost Optimization for GPU Workloads

GPU costs often dominate ML infrastructure budgets. Effective cost optimization combines right-sizing, spot instances, GPU sharing, and intelligent scheduling to reduce costs by 40-70% while maintaining performance.

Cost Optimization Strategies

┌─────────────────────────────────────────────────────────────────────┐
│                    GPU Cost Optimization Layers                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  Layer 1: Right-Sizing                                       │    │
│  │  - Match GPU type to workload requirements                   │    │
│  │  - Avoid over-provisioning memory/compute                    │    │
│  │  - Use profiling to identify actual needs                    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  Layer 2: Spot/Preemptible Instances (50-90% savings)       │    │
│  │  - Training jobs on spot instances                           │    │
│  │  - Checkpointing for fault tolerance                         │    │
│  │  - Fallback to on-demand when needed                        │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  Layer 3: GPU Sharing & Time-Slicing                        │    │
│  │  - MIG for A100/H100 workload isolation                     │    │
│  │  - Time-slicing for smaller workloads                       │    │
│  │  - Multi-instance serving                                    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  Layer 4: Intelligent Scheduling                            │    │
│  │  - Bin packing optimization                                  │    │
│  │  - Preemption policies                                       │    │
│  │  - Queue-based admission (Kueue)                            │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Spot Instance Configuration

# GKE Spot Node Pool
apiVersion: container.gke.io/v1
kind: NodePool
metadata:
  name: gpu-spot-pool
spec:
  config:
    machineType: a2-highgpu-4g  # 4x A100 40GB
    accelerators:
    - acceleratorType: nvidia-tesla-a100
      acceleratorCount: 4
    spot: true
    taints:
    - key: cloud.google.com/gke-spot
      value: "true"
      effect: NoSchedule
---
# Training job with spot tolerance
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  template:
    spec:
      tolerations:
      - key: cloud.google.com/gke-spot
        operator: Equal
        value: "true"
        effect: NoSchedule
      containers:
      - name: trainer
        image: training:v1
        resources:
          limits:
            nvidia.com/gpu: 4
        env:
        - name: CHECKPOINT_DIR
          value: "s3://checkpoints/training-job"
        - name: CHECKPOINT_INTERVAL
          value: "300"  # Save every 5 minutes
      restartPolicy: OnFailure
---
# AWS Karpenter for spot GPU provisioning
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu-spot
spec:
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot", "on-demand"]
  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["p4d.24xlarge", "p3.16xlarge", "g5.48xlarge"]
  - key: nvidia.com/gpu.product
    operator: In
    values: ["NVIDIA-A100-SXM4-40GB", "NVIDIA-V100-SXM2-16GB"]
  limits:
    resources:
      nvidia.com/gpu: 100
  providerRef:
    name: default
  ttlSecondsAfterEmpty: 30
  ttlSecondsUntilExpired: 86400

GPU Utilization Monitoring

# OpenCost for GPU cost tracking
apiVersion: apps/v1
kind: Deployment
metadata:
  name: opencost
  namespace: opencost
spec:
  template:
    spec:
      containers:
      - name: opencost
        image: ghcr.io/opencost/opencost:1.110
        env:
        - name: CLUSTER_ID
          value: "ml-production"
        - name: PROMETHEUS_SERVER_ENDPOINT
          value: "http://prometheus:9090"
        - name: GPU_ENABLED
          value: "true"
        - name: CLOUD_PROVIDER_API_KEY
          valueFrom:
            secretKeyRef:
              name: cloud-provider-key
              key: api-key
---
# Grafana dashboard for GPU cost analysis
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-cost-dashboard
data:
  dashboard.json: |
    {
      "panels": [
        {
          "title": "GPU Cost by Namespace",
          "targets": [{
            "expr": "sum(gpu_cost_total) by (namespace)"
          }]
        },
        {
          "title": "GPU Utilization vs Cost Efficiency",
          "targets": [{
            "expr": "avg(DCGM_FI_DEV_GPU_UTIL) by (pod) / sum(gpu_cost_hourly) by (pod)"
          }]
        },
        {
          "title": "Idle GPU Cost (Waste)",
          "targets": [{
            "expr": "sum(gpu_cost_hourly * (1 - DCGM_FI_DEV_GPU_UTIL/100))"
          }]
        }
      ]
    }

Right-Sizing Recommendations

# Vertical Pod Autoscaler for GPU workloads
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: inference-vpa
  namespace: ml-serving
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-service
  updatePolicy:
    updateMode: "Off"  # Recommendation only
  resourcePolicy:
    containerPolicies:
    - containerName: inference
      controlledResources: ["cpu", "memory"]
      # GPU resources managed separately
---
# Custom GPU rightsizing based on utilization
apiVersion: batch/v1
kind: CronJob
metadata:
  name: gpu-rightsizing-report
spec:
  schedule: "0 8 * * 1"  # Weekly Monday 8 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: analyzer
            image: gpu-analyzer:v1
            command:
            - /bin/sh
            - -c
            - |
              # Query average GPU utilization over past week
              UTIL=$(curl -s "prometheus:9090/api/v1/query?query=avg_over_time(DCGM_FI_DEV_GPU_UTIL[7d])")

              # Generate rightsizing recommendations
              python3 /scripts/analyze.py --utilization "$UTIL" \
                --output-format markdown \
                --send-to slack:#ml-cost-alerts

Kueue for Batch Job Scheduling

# ResourceFlavor for different GPU types
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: spot-a100
spec:
  nodeLabels:
    cloud.google.com/gke-accelerator: nvidia-tesla-a100
    cloud.google.com/gke-spot: "true"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: ondemand-a100
spec:
  nodeLabels:
    cloud.google.com/gke-accelerator: nvidia-tesla-a100
    cloud.google.com/gke-spot: "false"
---
# ClusterQueue with cost optimization
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: ml-training
spec:
  preemption:
    reclaimWithinCohort: Any
    withinClusterQueue: LowerPriority
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: spot-a100
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 32
        borrowingLimit: 16
    - name: ondemand-a100
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 8
        borrowingLimit: 0
---
# Workload using spot by default, fallback to on-demand
apiVersion: kueue.x-k8s.io/v1beta1
kind: Workload
metadata:
  name: training-job
spec:
  queueName: ml-training
  priority: 100
  podSets:
  - name: main
    count: 4
    template:
      spec:
        containers:
        - name: trainer
          resources:
            requests:
              nvidia.com/gpu: 4

Cost Allocation Tags

# Labels for cost attribution
apiVersion: v1
kind: Namespace
metadata:
  name: ml-team-a
  labels:
    cost-center: "ml-research"
    team: "team-a"
    project: "llm-development"
---
# Pod with cost allocation
apiVersion: v1
kind: Pod
metadata:
  labels:
    cost-center: "ml-research"
    workload-type: "training"
    priority: "batch"
    spot-eligible: "true"
spec:
  containers:
  - name: training
    resources:
      limits:
        nvidia.com/gpu: 1

Next lesson: CI/CD pipelines for ML model deployment. :::