Production Operations & GitOps
Cost Optimization for GPU Workloads
3 min read
GPU costs often dominate ML infrastructure budgets. Effective cost optimization combines right-sizing, spot instances, GPU sharing, and intelligent scheduling to reduce costs by 40-70% while maintaining performance.
Cost Optimization Strategies
┌─────────────────────────────────────────────────────────────────────┐
│ GPU Cost Optimization Layers │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Layer 1: Right-Sizing │ │
│ │ - Match GPU type to workload requirements │ │
│ │ - Avoid over-provisioning memory/compute │ │
│ │ - Use profiling to identify actual needs │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Layer 2: Spot/Preemptible Instances (50-90% savings) │ │
│ │ - Training jobs on spot instances │ │
│ │ - Checkpointing for fault tolerance │ │
│ │ - Fallback to on-demand when needed │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Layer 3: GPU Sharing & Time-Slicing │ │
│ │ - MIG for A100/H100 workload isolation │ │
│ │ - Time-slicing for smaller workloads │ │
│ │ - Multi-instance serving │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Layer 4: Intelligent Scheduling │ │
│ │ - Bin packing optimization │ │
│ │ - Preemption policies │ │
│ │ - Queue-based admission (Kueue) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Spot Instance Configuration
# GKE Spot Node Pool
apiVersion: container.gke.io/v1
kind: NodePool
metadata:
name: gpu-spot-pool
spec:
config:
machineType: a2-highgpu-4g # 4x A100 40GB
accelerators:
- acceleratorType: nvidia-tesla-a100
acceleratorCount: 4
spot: true
taints:
- key: cloud.google.com/gke-spot
value: "true"
effect: NoSchedule
---
# Training job with spot tolerance
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training
spec:
template:
spec:
tolerations:
- key: cloud.google.com/gke-spot
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: trainer
image: training:v1
resources:
limits:
nvidia.com/gpu: 4
env:
- name: CHECKPOINT_DIR
value: "s3://checkpoints/training-job"
- name: CHECKPOINT_INTERVAL
value: "300" # Save every 5 minutes
restartPolicy: OnFailure
---
# AWS Karpenter for spot GPU provisioning
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: gpu-spot
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["p4d.24xlarge", "p3.16xlarge", "g5.48xlarge"]
- key: nvidia.com/gpu.product
operator: In
values: ["NVIDIA-A100-SXM4-40GB", "NVIDIA-V100-SXM2-16GB"]
limits:
resources:
nvidia.com/gpu: 100
providerRef:
name: default
ttlSecondsAfterEmpty: 30
ttlSecondsUntilExpired: 86400
GPU Utilization Monitoring
# OpenCost for GPU cost tracking
apiVersion: apps/v1
kind: Deployment
metadata:
name: opencost
namespace: opencost
spec:
template:
spec:
containers:
- name: opencost
image: ghcr.io/opencost/opencost:1.110
env:
- name: CLUSTER_ID
value: "ml-production"
- name: PROMETHEUS_SERVER_ENDPOINT
value: "http://prometheus:9090"
- name: GPU_ENABLED
value: "true"
- name: CLOUD_PROVIDER_API_KEY
valueFrom:
secretKeyRef:
name: cloud-provider-key
key: api-key
---
# Grafana dashboard for GPU cost analysis
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-cost-dashboard
data:
dashboard.json: |
{
"panels": [
{
"title": "GPU Cost by Namespace",
"targets": [{
"expr": "sum(gpu_cost_total) by (namespace)"
}]
},
{
"title": "GPU Utilization vs Cost Efficiency",
"targets": [{
"expr": "avg(DCGM_FI_DEV_GPU_UTIL) by (pod) / sum(gpu_cost_hourly) by (pod)"
}]
},
{
"title": "Idle GPU Cost (Waste)",
"targets": [{
"expr": "sum(gpu_cost_hourly * (1 - DCGM_FI_DEV_GPU_UTIL/100))"
}]
}
]
}
Right-Sizing Recommendations
# Vertical Pod Autoscaler for GPU workloads
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: inference-vpa
namespace: ml-serving
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: inference-service
updatePolicy:
updateMode: "Off" # Recommendation only
resourcePolicy:
containerPolicies:
- containerName: inference
controlledResources: ["cpu", "memory"]
# GPU resources managed separately
---
# Custom GPU rightsizing based on utilization
apiVersion: batch/v1
kind: CronJob
metadata:
name: gpu-rightsizing-report
spec:
schedule: "0 8 * * 1" # Weekly Monday 8 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: analyzer
image: gpu-analyzer:v1
command:
- /bin/sh
- -c
- |
# Query average GPU utilization over past week
UTIL=$(curl -s "prometheus:9090/api/v1/query?query=avg_over_time(DCGM_FI_DEV_GPU_UTIL[7d])")
# Generate rightsizing recommendations
python3 /scripts/analyze.py --utilization "$UTIL" \
--output-format markdown \
--send-to slack:#ml-cost-alerts
Kueue for Batch Job Scheduling
# ResourceFlavor for different GPU types
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: spot-a100
spec:
nodeLabels:
cloud.google.com/gke-accelerator: nvidia-tesla-a100
cloud.google.com/gke-spot: "true"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: ondemand-a100
spec:
nodeLabels:
cloud.google.com/gke-accelerator: nvidia-tesla-a100
cloud.google.com/gke-spot: "false"
---
# ClusterQueue with cost optimization
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: ml-training
spec:
preemption:
reclaimWithinCohort: Any
withinClusterQueue: LowerPriority
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: spot-a100
resources:
- name: nvidia.com/gpu
nominalQuota: 32
borrowingLimit: 16
- name: ondemand-a100
resources:
- name: nvidia.com/gpu
nominalQuota: 8
borrowingLimit: 0
---
# Workload using spot by default, fallback to on-demand
apiVersion: kueue.x-k8s.io/v1beta1
kind: Workload
metadata:
name: training-job
spec:
queueName: ml-training
priority: 100
podSets:
- name: main
count: 4
template:
spec:
containers:
- name: trainer
resources:
requests:
nvidia.com/gpu: 4
Cost Allocation Tags
# Labels for cost attribution
apiVersion: v1
kind: Namespace
metadata:
name: ml-team-a
labels:
cost-center: "ml-research"
team: "team-a"
project: "llm-development"
---
# Pod with cost allocation
apiVersion: v1
kind: Pod
metadata:
labels:
cost-center: "ml-research"
workload-type: "training"
priority: "batch"
spot-eligible: "true"
spec:
containers:
- name: training
resources:
limits:
nvidia.com/gpu: 1
Next lesson: CI/CD pipelines for ML model deployment. :::