Cost Optimization & Scaling
Scaling Patterns
3 min read
Scaling LLM infrastructure requires understanding the unique characteristics of inference workloads: GPU-bound computation, memory-intensive KV caches, and variable latency requirements.
Scaling Dimensions
┌─────────────────────────────────────────────────────────────┐
│ LLM Scaling Dimensions │
├─────────────────────────────────────────────────────────────┤
│ │
│ Horizontal Scaling (More Replicas) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │
│ │ │ GPU │ │ GPU │ │ GPU │ │ GPU │ ... │ │
│ │ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │ │
│ │ └─────┘ └─────┘ └─────┘ └─────┘ │ │
│ │ Same model, parallel requests │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Vertical Scaling (Bigger GPUs) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ A10G (24GB) → A100 (80GB) → H100 (80GB) → B200 │ │
│ │ More memory, faster compute, larger batch sizes │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Model Parallelism (Split Model) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Layers │ │ Layers │ │ Layers │ │ │
│ │ │ 1-12 │→ │ 13-24 │→ │ 25-36 │ │ │
│ │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ Pipeline parallelism for large models │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Kubernetes HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-deployment
minReplicas: 2
maxReplicas: 20
metrics:
# GPU utilization (primary)
- type: External
external:
metric:
name: dcgm_gpu_utilization
selector:
matchLabels:
deployment: vllm
target:
type: AverageValue
averageValue: "75" # Scale up at 75% GPU util
# Queue depth (secondary)
- type: External
external:
metric:
name: llm_pending_requests
target:
type: AverageValue
averageValue: "10" # Scale if >10 pending
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # 5 min cooldown
policies:
- type: Pods
value: 1
periodSeconds: 120
Queue-Based Scaling
import asyncio
from dataclasses import dataclass
from typing import List
@dataclass
class QueueMetrics:
pending_requests: int
processing_requests: int
avg_wait_time_ms: float
avg_process_time_ms: float
class AdaptiveScaler:
def __init__(
self,
min_replicas: int = 2,
max_replicas: int = 20,
target_wait_time_ms: float = 1000,
target_utilization: float = 0.75,
):
self.min_replicas = min_replicas
self.max_replicas = max_replicas
self.target_wait_time_ms = target_wait_time_ms
self.target_utilization = target_utilization
def calculate_desired_replicas(
self,
current_replicas: int,
metrics: QueueMetrics,
) -> int:
# Method 1: Based on queue depth
queue_based = (
metrics.pending_requests /
(self.target_utilization * 10) # 10 concurrent per replica
)
# Method 2: Based on wait time
if metrics.avg_wait_time_ms > 0:
wait_ratio = metrics.avg_wait_time_ms / self.target_wait_time_ms
wait_based = current_replicas * wait_ratio
else:
wait_based = current_replicas
# Use maximum of both signals
desired = max(queue_based, wait_based)
# Apply bounds
desired = max(self.min_replicas, min(self.max_replicas, int(desired)))
return desired
Multi-Region Deployment
┌─────────────────────────────────────────────────────────────┐
│ Multi-Region Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ US-East EU-West │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ ┌────────────┐ │ │ ┌────────────┐ │ │
│ │ │ LLM Pods │ │ │ │ LLM Pods │ │ │
│ │ │ (H100x8) │ │ │ │ (H100x8) │ │ │
│ │ └──────┬─────┘ │ │ └──────┬─────┘ │ │
│ │ │ │ │ │ │ │
│ │ ┌──────┴─────┐ │ │ ┌──────┴─────┐ │ │
│ │ │ Cache │ │ │ │ Cache │ │ │
│ │ │ (Redis) │ │ │ │ (Redis) │ │ │
│ │ └────────────┘ │ │ └────────────┘ │ │
│ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │
│ └───────────┬───────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ Global Load │ │
│ │ Balancer │ │
│ └─────────────────────┘ │
│ │
│ Latency-based routing: <50ms to nearest region │
│ Failover: Automatic cross-region on outage │
│ │
└─────────────────────────────────────────────────────────────┘
Capacity Planning
@dataclass
class CapacityRequirements:
peak_rps: float # Requests per second
avg_input_tokens: int
avg_output_tokens: int
target_ttft_ms: float
target_tpot_ms: float # Time per output token
def calculate_gpu_requirements(req: CapacityRequirements) -> dict:
"""Estimate GPU requirements for capacity."""
# Empirical throughput (tokens/second) per GPU type
GPU_THROUGHPUT = {
"A10G": {"70B": 0, "8B": 500, "7B": 600},
"A100-40GB": {"70B": 150, "8B": 1500, "7B": 1800},
"A100-80GB": {"70B": 300, "8B": 2000, "7B": 2500},
"H100": {"70B": 600, "8B": 4000, "7B": 5000},
"B200": {"70B": 1200, "8B": 8000, "7B": 10000},
}
total_tokens_per_request = req.avg_input_tokens + req.avg_output_tokens
tokens_per_second = req.peak_rps * total_tokens_per_request
recommendations = {}
for gpu, models in GPU_THROUGHPUT.items():
for model_size, throughput in models.items():
if throughput == 0:
continue
gpus_needed = tokens_per_second / throughput
recommendations[f"{gpu}_{model_size}"] = {
"gpus_needed": int(gpus_needed) + 1,
"headroom": 1.0 - (tokens_per_second / (throughput * (int(gpus_needed) + 1))),
}
return recommendations
# Example
req = CapacityRequirements(
peak_rps=100,
avg_input_tokens=1000,
avg_output_tokens=500,
target_ttft_ms=500,
target_tpot_ms=30,
)
print(calculate_gpu_requirements(req))
Scaling Best Practices
| Practice | Recommendation |
|---|---|
| Min replicas | Always ≥2 for high availability |
| Scale-up speed | Fast (1-2 min) to handle bursts |
| Scale-down speed | Slow (5-10 min) to avoid thrashing |
| GPU utilization target | 70-80% (leave headroom) |
| Queue depth alert | >20 pending for >2 minutes |
| Pre-warming | Load models before receiving traffic |
| Regional failover | <30 second detection and switch |
| ::: |