Cost Optimization & Scaling

LLM Cost Analysis

3 min read

Understanding and managing LLM costs is critical for sustainable production deployments. This lesson covers cost structures, tracking methods, and analysis techniques.

LLM Cost Components

┌─────────────────────────────────────────────────────────────┐
│                   Total Cost of Ownership                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Direct API Costs                                           │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Input tokens  × Input price per 1M tokens          │   │
│  │  Output tokens × Output price per 1M tokens         │   │
│  │  + Embedding costs                                   │   │
│  │  + Image/audio costs (if applicable)                 │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Infrastructure Costs (Self-hosted)                         │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  GPU compute (per-hour or reserved)                  │   │
│  │  Memory (HBM for larger models)                      │   │
│  │  Storage (model weights, KV cache)                   │   │
│  │  Network egress                                      │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Operational Costs                                          │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Observability platform fees                         │   │
│  │  Gateway/proxy infrastructure                        │   │
│  │  Engineering time for maintenance                    │   │
│  │  Quality assurance and evaluation                    │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Model Pricing Comparison (January 2026)

Model Input (per 1M) Output (per 1M) Context Notes
GPT-4o $5.00 $15.00 128K General purpose
GPT-4o-mini $0.15 $0.60 128K Cost-effective
Claude 3.5 Sonnet $3.00 $15.00 200K Balanced
Claude 3.5 Haiku $0.80 $4.00 200K Fast & cheap
Llama 3.1 405B $3.00 $3.00 128K Open-source
Llama 3.1 70B $0.90 $0.90 128K Best value
Mistral Large $2.00 $6.00 128K European option

Self-Hosted Cost Comparison

Model GPU Required GPU/hour Tokens/$ Break-even
Llama 3.1 8B 1× A10G $1.00 ~500K 50K req/day
Llama 3.1 70B 2× A100 $6.00 ~80K 200K req/day
Llama 3.1 405B 8× H100 $40.00 ~25K 500K req/day

Cost Tracking Implementation

from dataclasses import dataclass
from datetime import datetime, timedelta
import json

@dataclass
class TokenUsage:
    input_tokens: int
    output_tokens: int
    model: str
    timestamp: datetime
    user_id: str
    request_type: str

class CostTracker:
    # Pricing per million tokens
    PRICING = {
        "gpt-4o": {"input": 5.0, "output": 15.0},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "claude-sonnet-4-20250514": {"input": 3.0, "output": 15.0},
        "claude-3-5-haiku-20241022": {"input": 0.80, "output": 4.0},
    }

    def __init__(self):
        self.usage_log = []

    def record_usage(self, usage: TokenUsage):
        self.usage_log.append(usage)

    def calculate_cost(self, usage: TokenUsage) -> float:
        pricing = self.PRICING.get(usage.model, {"input": 0, "output": 0})
        input_cost = (usage.input_tokens / 1_000_000) * pricing["input"]
        output_cost = (usage.output_tokens / 1_000_000) * pricing["output"]
        return input_cost + output_cost

    def get_daily_report(self, date: datetime) -> dict:
        day_start = date.replace(hour=0, minute=0, second=0)
        day_end = day_start + timedelta(days=1)

        day_usage = [
            u for u in self.usage_log
            if day_start <= u.timestamp < day_end
        ]

        report = {
            "date": date.isoformat(),
            "total_requests": len(day_usage),
            "total_input_tokens": sum(u.input_tokens for u in day_usage),
            "total_output_tokens": sum(u.output_tokens for u in day_usage),
            "total_cost": sum(self.calculate_cost(u) for u in day_usage),
            "by_model": {},
            "by_user": {},
            "by_type": {},
        }

        for u in day_usage:
            # Aggregate by model
            if u.model not in report["by_model"]:
                report["by_model"][u.model] = {"count": 0, "cost": 0}
            report["by_model"][u.model]["count"] += 1
            report["by_model"][u.model]["cost"] += self.calculate_cost(u)

            # Aggregate by user
            if u.user_id not in report["by_user"]:
                report["by_user"][u.user_id] = {"count": 0, "cost": 0}
            report["by_user"][u.user_id]["count"] += 1
            report["by_user"][u.user_id]["cost"] += self.calculate_cost(u)

        return report

Cost Allocation Strategies

┌─────────────────────────────────────────────────────────────┐
│              Cost Allocation Methods                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Per-Team Budgets                                        │
│     ┌─────────────────────────────────────────────────┐    │
│     │  Engineering: $5,000/month                       │    │
│     │  Data Science: $10,000/month                     │    │
│     │  Customer Support: $2,000/month                  │    │
│     └─────────────────────────────────────────────────┘    │
│                                                             │
│  2. Per-User Quotas                                         │
│     ┌─────────────────────────────────────────────────┐    │
│     │  Free tier: 1,000 requests/day                   │    │
│     │  Pro tier: 10,000 requests/day                   │    │
│     │  Enterprise: Unlimited (budget-controlled)       │    │
│     └─────────────────────────────────────────────────┘    │
│                                                             │
│  3. Per-Feature Tracking                                    │
│     ┌─────────────────────────────────────────────────┐    │
│     │  Chat: 40% of spend                              │    │
│     │  Code completion: 35% of spend                   │    │
│     │  Summarization: 15% of spend                     │    │
│     │  Other: 10% of spend                             │    │
│     └─────────────────────────────────────────────────┘    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Cost Anomaly Detection

import numpy as np
from collections import defaultdict

class CostAnomalyDetector:
    def __init__(self, window_size: int = 7):
        self.window_size = window_size
        self.daily_costs = []

    def add_daily_cost(self, cost: float):
        self.daily_costs.append(cost)

    def detect_anomaly(self, current_cost: float) -> dict:
        if len(self.daily_costs) < self.window_size:
            return {"is_anomaly": False, "reason": "Insufficient data"}

        recent = self.daily_costs[-self.window_size:]
        mean = np.mean(recent)
        std = np.std(recent)

        # Z-score > 2 is anomalous
        z_score = (current_cost - mean) / std if std > 0 else 0

        if z_score > 2:
            return {
                "is_anomaly": True,
                "reason": "Cost spike detected",
                "z_score": z_score,
                "expected_range": (mean - 2*std, mean + 2*std),
                "actual": current_cost,
            }

        return {"is_anomaly": False}

# Usage
detector = CostAnomalyDetector()

# Historical data
for cost in [100, 105, 98, 102, 99, 103, 101]:
    detector.add_daily_cost(cost)

# Check today's cost
result = detector.detect_anomaly(250)  # Unusual spike
if result["is_anomaly"]:
    alert_team(result)

Key Cost Metrics to Monitor

Metric Formula Target
Cost per request Total cost / Total requests Track trend
Cost per user Total cost / Active users By tier
Token efficiency Output tokens / Input tokens >0.5
Cache savings Cached tokens × Token price >20%
Model cost mix Spend per model / Total Optimize
:::

Quiz

Module 6: Cost Optimization & Scaling

Take Quiz