Cost Optimization & Scaling

LLM Cost Analysis

3 min read

Understanding and managing LLM costs is critical for sustainable production deployments. This lesson covers cost structures, tracking methods, and analysis techniques.

LLM Cost Components

┌─────────────────────────────────────────────────────────────┐
│                   Total Cost of Ownership                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Direct API Costs                                           │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Input tokens  × Input price per 1M tokens          │   │
│  │  Output tokens × Output price per 1M tokens         │   │
│  │  + Embedding costs                                   │   │
│  │  + Image/audio costs (if applicable)                 │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Infrastructure Costs (Self-hosted)                         │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  GPU compute (per-hour or reserved)                  │   │
│  │  Memory (HBM for larger models)                      │   │
│  │  Storage (model weights, KV cache)                   │   │
│  │  Network egress                                      │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Operational Costs                                          │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Observability platform fees                         │   │
│  │  Gateway/proxy infrastructure                        │   │
│  │  Engineering time for maintenance                    │   │
│  │  Quality assurance and evaluation                    │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Model Pricing Comparison (April 2026)

ModelInput (per 1M)Output (per 1M)ContextNotes
GPT-5.4$2.50$15.00400KGeneral purpose
GPT-5.4 Mini$0.75$4.50128KCost-effective
Claude Sonnet 4.6$3.00$15.001MBalanced
Claude Haiku 4.5$1.00$5.00200KFast & cheap
Llama 3.3 70B$0.59$0.79128KBest value
Mistral Large$2.00$6.00128KEuropean option

⚠ Prices change frequently. The values above are for illustration only and may be out of date. Always verify current pricing directly with the provider before making cost decisions: Anthropic · OpenAI · Google Gemini · Google Vertex AI · AWS Bedrock · Azure OpenAI · Mistral · Cohere · Together AI · DeepSeek · Groq · Fireworks AI · Perplexity · xAI · Cursor · GitHub Copilot · Windsurf.

Self-Hosted Cost Comparison

ModelGPU RequiredGPU/hourTokens/$Break-even
Llama 3.1 8B1× A10G$1.00~500K50K req/day
Llama 3.1 70B2× A100$6.00~80K200K req/day
Llama 3.1 405B8× H100$40.00~25K500K req/day

⚠ GPU and instance rental rates change frequently. The per-hour/GPU rates above are for illustration only and vary by region, commitment term, and spot availability. Always verify current pricing directly with the provider before committing compute budget: AWS EC2 (GPU) · Google Cloud GPU · Azure N-series · CoreWeave · Lambda · RunPod · Modal · Replicate · Anyscale · Together AI · Fireworks AI · Hugging Face Inference.

Cost Tracking Implementation

from dataclasses import dataclass
from datetime import datetime, timedelta
import json

@dataclass
class TokenUsage:
    input_tokens: int
    output_tokens: int
    model: str
    timestamp: datetime
    user_id: str
    request_type: str

class CostTracker:
    # Pricing per million tokens (April 2026)
    PRICING = {
        "gpt-5.4": {"input": 2.50, "output": 15.0},
        "gpt-5.4-mini": {"input": 0.75, "output": 4.50},
        "claude-sonnet-4-6": {"input": 3.0, "output": 15.0},
        "claude-haiku-4-5-20251001": {"input": 1.0, "output": 5.0},
    }

    def __init__(self):
        self.usage_log = []

    def record_usage(self, usage: TokenUsage):
        self.usage_log.append(usage)

    def calculate_cost(self, usage: TokenUsage) -> float:
        pricing = self.PRICING.get(usage.model, {"input": 0, "output": 0})
        input_cost = (usage.input_tokens / 1_000_000) * pricing["input"]
        output_cost = (usage.output_tokens / 1_000_000) * pricing["output"]
        return input_cost + output_cost

    def get_daily_report(self, date: datetime) -> dict:
        day_start = date.replace(hour=0, minute=0, second=0)
        day_end = day_start + timedelta(days=1)

        day_usage = [
            u for u in self.usage_log
            if day_start <= u.timestamp < day_end
        ]

        report = {
            "date": date.isoformat(),
            "total_requests": len(day_usage),
            "total_input_tokens": sum(u.input_tokens for u in day_usage),
            "total_output_tokens": sum(u.output_tokens for u in day_usage),
            "total_cost": sum(self.calculate_cost(u) for u in day_usage),
            "by_model": {},
            "by_user": {},
            "by_type": {},
        }

        for u in day_usage:
            # Aggregate by model
            if u.model not in report["by_model"]:
                report["by_model"][u.model] = {"count": 0, "cost": 0}
            report["by_model"][u.model]["count"] += 1
            report["by_model"][u.model]["cost"] += self.calculate_cost(u)

            # Aggregate by user
            if u.user_id not in report["by_user"]:
                report["by_user"][u.user_id] = {"count": 0, "cost": 0}
            report["by_user"][u.user_id]["count"] += 1
            report["by_user"][u.user_id]["cost"] += self.calculate_cost(u)

        return report

Cost Allocation Strategies

┌─────────────────────────────────────────────────────────────┐
│              Cost Allocation Methods                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Per-Team Budgets                                        │
│     ┌─────────────────────────────────────────────────┐    │
│     │  Engineering: $5,000/month                       │    │
│     │  Data Science: $10,000/month                     │    │
│     │  Customer Support: $2,000/month                  │    │
│     └─────────────────────────────────────────────────┘    │
│                                                             │
│  2. Per-User Quotas                                         │
│     ┌─────────────────────────────────────────────────┐    │
│     │  Free tier: 1,000 requests/day                   │    │
│     │  Pro tier: 10,000 requests/day                   │    │
│     │  Enterprise: Unlimited (budget-controlled)       │    │
│     └─────────────────────────────────────────────────┘    │
│                                                             │
│  3. Per-Feature Tracking                                    │
│     ┌─────────────────────────────────────────────────┐    │
│     │  Chat: 40% of spend                              │    │
│     │  Code completion: 35% of spend                   │    │
│     │  Summarization: 15% of spend                     │    │
│     │  Other: 10% of spend                             │    │
│     └─────────────────────────────────────────────────┘    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Cost Anomaly Detection

import numpy as np
from collections import defaultdict

class CostAnomalyDetector:
    def __init__(self, window_size: int = 7):
        self.window_size = window_size
        self.daily_costs = []

    def add_daily_cost(self, cost: float):
        self.daily_costs.append(cost)

    def detect_anomaly(self, current_cost: float) -> dict:
        if len(self.daily_costs) < self.window_size:
            return {"is_anomaly": False, "reason": "Insufficient data"}

        recent = self.daily_costs[-self.window_size:]
        mean = np.mean(recent)
        std = np.std(recent)

        # Z-score > 2 is anomalous
        z_score = (current_cost - mean) / std if std > 0 else 0

        if z_score > 2:
            return {
                "is_anomaly": True,
                "reason": "Cost spike detected",
                "z_score": z_score,
                "expected_range": (mean - 2*std, mean + 2*std),
                "actual": current_cost,
            }

        return {"is_anomaly": False}

# Usage
detector = CostAnomalyDetector()

# Historical data
for cost in [100, 105, 98, 102, 99, 103, 101]:
    detector.add_daily_cost(cost)

# Check today's cost
result = detector.detect_anomaly(250)  # Unusual spike
if result["is_anomaly"]:
    alert_team(result)

Key Cost Metrics to Monitor

MetricFormulaTarget
Cost per requestTotal cost / Total requestsTrack trend
Cost per userTotal cost / Active usersBy tier
Token efficiencyOutput tokens / Input tokens>0.5
Cache savingsCached tokens × Token price>20%
Model cost mixSpend per model / TotalOptimize
:::

Quick check: how does this lesson land for you?

Quiz

Module 6: Cost Optimization & Scaling

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.