Cost Optimization & Scaling

Scaling Patterns

3 min read

Scaling LLM infrastructure requires understanding the unique characteristics of inference workloads: GPU-bound computation, memory-intensive KV caches, and variable latency requirements.

Scaling Dimensions

┌─────────────────────────────────────────────────────────────┐
│                  LLM Scaling Dimensions                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Horizontal Scaling (More Replicas)                         │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐                │   │
│  │  │ GPU │  │ GPU │  │ GPU │  │ GPU │  ...           │   │
│  │  │  1  │  │  2  │  │  3  │  │  4  │                │   │
│  │  └─────┘  └─────┘  └─────┘  └─────┘                │   │
│  │  Same model, parallel requests                      │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Vertical Scaling (Bigger GPUs)                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  A10G (24GB) → A100 (80GB) → H100 (80GB) → B200     │   │
│  │  More memory, faster compute, larger batch sizes    │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Model Parallelism (Split Model)                            │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐          │   │
│  │  │ Layers   │  │ Layers   │  │ Layers   │          │   │
│  │  │  1-12    │→ │  13-24   │→ │  25-36   │          │   │
│  │  │  GPU 1   │  │  GPU 2   │  │  GPU 3   │          │   │
│  │  └──────────┘  └──────────┘  └──────────┘          │   │
│  │  Pipeline parallelism for large models              │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Kubernetes HPA Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-deployment
  minReplicas: 2
  maxReplicas: 20
  metrics:
    # GPU utilization (primary)
    - type: External
      external:
        metric:
          name: dcgm_gpu_utilization
          selector:
            matchLabels:
              deployment: vllm
        target:
          type: AverageValue
          averageValue: "75"  # Scale up at 75% GPU util

    # Queue depth (secondary)
    - type: External
      external:
        metric:
          name: llm_pending_requests
        target:
          type: AverageValue
          averageValue: "10"  # Scale if >10 pending

  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 min cooldown
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

Queue-Based Scaling

import asyncio
from dataclasses import dataclass
from typing import List

@dataclass
class QueueMetrics:
    pending_requests: int
    processing_requests: int
    avg_wait_time_ms: float
    avg_process_time_ms: float

class AdaptiveScaler:
    def __init__(
        self,
        min_replicas: int = 2,
        max_replicas: int = 20,
        target_wait_time_ms: float = 1000,
        target_utilization: float = 0.75,
    ):
        self.min_replicas = min_replicas
        self.max_replicas = max_replicas
        self.target_wait_time_ms = target_wait_time_ms
        self.target_utilization = target_utilization

    def calculate_desired_replicas(
        self,
        current_replicas: int,
        metrics: QueueMetrics,
    ) -> int:
        # Method 1: Based on queue depth
        queue_based = (
            metrics.pending_requests /
            (self.target_utilization * 10)  # 10 concurrent per replica
        )

        # Method 2: Based on wait time
        if metrics.avg_wait_time_ms > 0:
            wait_ratio = metrics.avg_wait_time_ms / self.target_wait_time_ms
            wait_based = current_replicas * wait_ratio
        else:
            wait_based = current_replicas

        # Use maximum of both signals
        desired = max(queue_based, wait_based)

        # Apply bounds
        desired = max(self.min_replicas, min(self.max_replicas, int(desired)))

        return desired

Multi-Region Deployment

┌─────────────────────────────────────────────────────────────┐
│                 Multi-Region Architecture                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│     US-East                    EU-West                      │
│  ┌──────────────────┐    ┌──────────────────┐              │
│  │  ┌────────────┐  │    │  ┌────────────┐  │              │
│  │  │  LLM Pods  │  │    │  │  LLM Pods  │  │              │
│  │  │  (H100x8)  │  │    │  │  (H100x8)  │  │              │
│  │  └──────┬─────┘  │    │  └──────┬─────┘  │              │
│  │         │        │    │         │        │              │
│  │  ┌──────┴─────┐  │    │  ┌──────┴─────┐  │              │
│  │  │   Cache    │  │    │  │   Cache    │  │              │
│  │  │  (Redis)   │  │    │  │  (Redis)   │  │              │
│  │  └────────────┘  │    │  └────────────┘  │              │
│  └────────┬─────────┘    └────────┬─────────┘              │
│           │                       │                         │
│           └───────────┬───────────┘                         │
│                       │                                     │
│            ┌──────────┴──────────┐                         │
│            │   Global Load       │                         │
│            │   Balancer          │                         │
│            └─────────────────────┘                         │
│                                                             │
│  Latency-based routing: <50ms to nearest region            │
│  Failover: Automatic cross-region on outage                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Capacity Planning

@dataclass
class CapacityRequirements:
    peak_rps: float  # Requests per second
    avg_input_tokens: int
    avg_output_tokens: int
    target_ttft_ms: float
    target_tpot_ms: float  # Time per output token

def calculate_gpu_requirements(req: CapacityRequirements) -> dict:
    """Estimate GPU requirements for capacity."""

    # Empirical throughput (tokens/second) per GPU type
    GPU_THROUGHPUT = {
        "A10G": {"70B": 0, "8B": 500, "7B": 600},
        "A100-40GB": {"70B": 150, "8B": 1500, "7B": 1800},
        "A100-80GB": {"70B": 300, "8B": 2000, "7B": 2500},
        "H100": {"70B": 600, "8B": 4000, "7B": 5000},
        "B200": {"70B": 1200, "8B": 8000, "7B": 10000},
    }

    total_tokens_per_request = req.avg_input_tokens + req.avg_output_tokens
    tokens_per_second = req.peak_rps * total_tokens_per_request

    recommendations = {}
    for gpu, models in GPU_THROUGHPUT.items():
        for model_size, throughput in models.items():
            if throughput == 0:
                continue
            gpus_needed = tokens_per_second / throughput
            recommendations[f"{gpu}_{model_size}"] = {
                "gpus_needed": int(gpus_needed) + 1,
                "headroom": 1.0 - (tokens_per_second / (throughput * (int(gpus_needed) + 1))),
            }

    return recommendations

# Example
req = CapacityRequirements(
    peak_rps=100,
    avg_input_tokens=1000,
    avg_output_tokens=500,
    target_ttft_ms=500,
    target_tpot_ms=30,
)

print(calculate_gpu_requirements(req))

Scaling Best Practices

Practice Recommendation
Min replicas Always ≥2 for high availability
Scale-up speed Fast (1-2 min) to handle bursts
Scale-down speed Slow (5-10 min) to avoid thrashing
GPU utilization target 70-80% (leave headroom)
Queue depth alert >20 pending for >2 minutes
Pre-warming Load models before receiving traffic
Regional failover <30 second detection and switch
:::

Quiz

Module 6: Cost Optimization & Scaling

Take Quiz