Cost Optimization & Scaling
LLM Cost Analysis
3 min read
Understanding and managing LLM costs is critical for sustainable production deployments. This lesson covers cost structures, tracking methods, and analysis techniques.
LLM Cost Components
┌─────────────────────────────────────────────────────────────┐
│ Total Cost of Ownership │
├─────────────────────────────────────────────────────────────┤
│ │
│ Direct API Costs │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Input tokens × Input price per 1M tokens │ │
│ │ Output tokens × Output price per 1M tokens │ │
│ │ + Embedding costs │ │
│ │ + Image/audio costs (if applicable) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Infrastructure Costs (Self-hosted) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ GPU compute (per-hour or reserved) │ │
│ │ Memory (HBM for larger models) │ │
│ │ Storage (model weights, KV cache) │ │
│ │ Network egress │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Operational Costs │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Observability platform fees │ │
│ │ Gateway/proxy infrastructure │ │
│ │ Engineering time for maintenance │ │
│ │ Quality assurance and evaluation │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Model Pricing Comparison (January 2026)
| Model | Input (per 1M) | Output (per 1M) | Context | Notes |
|---|---|---|---|---|
| GPT-4o | $5.00 | $15.00 | 128K | General purpose |
| GPT-4o-mini | $0.15 | $0.60 | 128K | Cost-effective |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 200K | Balanced |
| Claude 3.5 Haiku | $0.80 | $4.00 | 200K | Fast & cheap |
| Llama 3.1 405B | $3.00 | $3.00 | 128K | Open-source |
| Llama 3.1 70B | $0.90 | $0.90 | 128K | Best value |
| Mistral Large | $2.00 | $6.00 | 128K | European option |
Self-Hosted Cost Comparison
| Model | GPU Required | GPU/hour | Tokens/$ | Break-even |
|---|---|---|---|---|
| Llama 3.1 8B | 1× A10G | $1.00 | ~500K | 50K req/day |
| Llama 3.1 70B | 2× A100 | $6.00 | ~80K | 200K req/day |
| Llama 3.1 405B | 8× H100 | $40.00 | ~25K | 500K req/day |
Cost Tracking Implementation
from dataclasses import dataclass
from datetime import datetime, timedelta
import json
@dataclass
class TokenUsage:
input_tokens: int
output_tokens: int
model: str
timestamp: datetime
user_id: str
request_type: str
class CostTracker:
# Pricing per million tokens
PRICING = {
"gpt-4o": {"input": 5.0, "output": 15.0},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-sonnet-4-20250514": {"input": 3.0, "output": 15.0},
"claude-3-5-haiku-20241022": {"input": 0.80, "output": 4.0},
}
def __init__(self):
self.usage_log = []
def record_usage(self, usage: TokenUsage):
self.usage_log.append(usage)
def calculate_cost(self, usage: TokenUsage) -> float:
pricing = self.PRICING.get(usage.model, {"input": 0, "output": 0})
input_cost = (usage.input_tokens / 1_000_000) * pricing["input"]
output_cost = (usage.output_tokens / 1_000_000) * pricing["output"]
return input_cost + output_cost
def get_daily_report(self, date: datetime) -> dict:
day_start = date.replace(hour=0, minute=0, second=0)
day_end = day_start + timedelta(days=1)
day_usage = [
u for u in self.usage_log
if day_start <= u.timestamp < day_end
]
report = {
"date": date.isoformat(),
"total_requests": len(day_usage),
"total_input_tokens": sum(u.input_tokens for u in day_usage),
"total_output_tokens": sum(u.output_tokens for u in day_usage),
"total_cost": sum(self.calculate_cost(u) for u in day_usage),
"by_model": {},
"by_user": {},
"by_type": {},
}
for u in day_usage:
# Aggregate by model
if u.model not in report["by_model"]:
report["by_model"][u.model] = {"count": 0, "cost": 0}
report["by_model"][u.model]["count"] += 1
report["by_model"][u.model]["cost"] += self.calculate_cost(u)
# Aggregate by user
if u.user_id not in report["by_user"]:
report["by_user"][u.user_id] = {"count": 0, "cost": 0}
report["by_user"][u.user_id]["count"] += 1
report["by_user"][u.user_id]["cost"] += self.calculate_cost(u)
return report
Cost Allocation Strategies
┌─────────────────────────────────────────────────────────────┐
│ Cost Allocation Methods │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. Per-Team Budgets │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Engineering: $5,000/month │ │
│ │ Data Science: $10,000/month │ │
│ │ Customer Support: $2,000/month │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ 2. Per-User Quotas │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Free tier: 1,000 requests/day │ │
│ │ Pro tier: 10,000 requests/day │ │
│ │ Enterprise: Unlimited (budget-controlled) │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ 3. Per-Feature Tracking │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Chat: 40% of spend │ │
│ │ Code completion: 35% of spend │ │
│ │ Summarization: 15% of spend │ │
│ │ Other: 10% of spend │ │
│ └─────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Cost Anomaly Detection
import numpy as np
from collections import defaultdict
class CostAnomalyDetector:
def __init__(self, window_size: int = 7):
self.window_size = window_size
self.daily_costs = []
def add_daily_cost(self, cost: float):
self.daily_costs.append(cost)
def detect_anomaly(self, current_cost: float) -> dict:
if len(self.daily_costs) < self.window_size:
return {"is_anomaly": False, "reason": "Insufficient data"}
recent = self.daily_costs[-self.window_size:]
mean = np.mean(recent)
std = np.std(recent)
# Z-score > 2 is anomalous
z_score = (current_cost - mean) / std if std > 0 else 0
if z_score > 2:
return {
"is_anomaly": True,
"reason": "Cost spike detected",
"z_score": z_score,
"expected_range": (mean - 2*std, mean + 2*std),
"actual": current_cost,
}
return {"is_anomaly": False}
# Usage
detector = CostAnomalyDetector()
# Historical data
for cost in [100, 105, 98, 102, 99, 103, 101]:
detector.add_daily_cost(cost)
# Check today's cost
result = detector.detect_anomaly(250) # Unusual spike
if result["is_anomaly"]:
alert_team(result)
Key Cost Metrics to Monitor
| Metric | Formula | Target |
|---|---|---|
| Cost per request | Total cost / Total requests | Track trend |
| Cost per user | Total cost / Active users | By tier |
| Token efficiency | Output tokens / Input tokens | >0.5 |
| Cache savings | Cached tokens × Token price | >20% |
| Model cost mix | Spend per model / Total | Optimize |
| ::: |