Cost Optimization & Scaling
Cost Optimization Strategies
4 min read
Effective cost optimization can reduce LLM expenses by 50-80% without sacrificing quality. This lesson covers proven strategies for optimizing LLM costs in production.
Optimization Hierarchy
┌─────────────────────────────────────────────────────────────┐
│ Cost Optimization Pyramid │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────┐ │
│ │ Model │ ← Biggest impact │
│ │ Selection │ (50-90% savings) │
│ └─────┬─────┘ │
│ ┌──────────┴──────────┐ │
│ │ Prompt Engineering│ ← 20-40% savings │
│ │ & Compression │ │
│ └──────────┬──────────┘ │
│ ┌───────────────┴───────────────┐ │
│ │ Caching & Batching │ ← 30-50% │
│ └───────────────┬───────────────┘ │
│ ┌────────────────────┴────────────────────┐ │
│ │ Infrastructure Optimization │ ← 10-30% │
│ │ (Quantization, Self-hosting) │ │
│ └─────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Strategy 1: Model Selection & Routing
Choose the right model for each task:
from litellm import completion
from functools import lru_cache
class SmartRouter:
MODEL_TIERS = {
"simple": {
"model": "gpt-4o-mini",
"cost_per_1k": 0.00075, # Avg of input/output
"tasks": ["classification", "extraction", "simple_qa"]
},
"standard": {
"model": "claude-3-5-haiku-20241022",
"cost_per_1k": 0.0024,
"tasks": ["summarization", "translation", "analysis"]
},
"complex": {
"model": "gpt-4o",
"cost_per_1k": 0.01,
"tasks": ["reasoning", "code_generation", "creative"]
}
}
def classify_task(self, prompt: str, task_type: str = None) -> str:
"""Determine appropriate model tier."""
if task_type:
for tier, config in self.MODEL_TIERS.items():
if task_type in config["tasks"]:
return tier
# Heuristic fallback
word_count = len(prompt.split())
if word_count < 50:
return "simple"
elif word_count < 200:
return "standard"
return "complex"
def route(self, prompt: str, task_type: str = None) -> str:
tier = self.classify_task(prompt, task_type)
return self.MODEL_TIERS[tier]["model"]
# Estimated savings: 60-80% for mixed workloads
Strategy 2: Prompt Optimization
Reduce token usage through efficient prompts:
# Before: 847 tokens
VERBOSE_PROMPT = """
You are a helpful AI assistant. Your task is to help the user with their question.
Please provide a comprehensive, detailed response that covers all aspects of the topic.
Make sure to be thorough and explain everything clearly.
If you need any clarification, please ask.
Here is the user's question: {question}
"""
# After: 127 tokens (85% reduction)
OPTIMIZED_PROMPT = """Answer concisely: {question}"""
# Even better with structured output
STRUCTURED_PROMPT = """{question}
Respond in JSON: {"answer": "...", "confidence": 0-1}"""
Token Compression Techniques
import tiktoken
def compress_context(text: str, max_tokens: int, model: str = "gpt-4o") -> str:
"""Compress context to fit within token limit."""
encoder = tiktoken.encoding_for_model(model)
tokens = encoder.encode(text)
if len(tokens) <= max_tokens:
return text
# Strategy 1: Truncate from middle (preserve start and end)
keep_tokens = max_tokens - 50 # Buffer for ellipsis
start_tokens = keep_tokens // 2
end_tokens = keep_tokens - start_tokens
compressed = (
encoder.decode(tokens[:start_tokens]) +
"\n...[truncated]...\n" +
encoder.decode(tokens[-end_tokens:])
)
return compressed
def summarize_long_context(text: str, model: str = "gpt-4o-mini") -> str:
"""Summarize long context using cheap model."""
if len(text.split()) < 500:
return text
response = completion(
model=model,
messages=[{
"role": "user",
"content": f"Summarize in 200 words:\n\n{text}"
}]
)
return response.choices[0].message.content
Strategy 3: Caching
Cache responses for repeated or similar queries:
import hashlib
import json
from functools import lru_cache
from typing import Optional
class SemanticCache:
def __init__(self, embedding_model: str = "text-embedding-3-small"):
self.embedding_model = embedding_model
self.cache = {} # In production, use Redis or similar
self.similarity_threshold = 0.95
def _get_cache_key(self, messages: list) -> str:
"""Generate cache key from messages."""
content = json.dumps(messages, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
def get(self, messages: list) -> Optional[str]:
"""Check cache for exact or semantic match."""
# Exact match
key = self._get_cache_key(messages)
if key in self.cache:
return self.cache[key]
# Semantic match (simplified)
# In production, use vector DB for similarity search
return None
def set(self, messages: list, response: str):
"""Cache the response."""
key = self._get_cache_key(messages)
self.cache[key] = response
# Usage with LiteLLM
cache = SemanticCache()
def cached_completion(messages: list, model: str):
cached = cache.get(messages)
if cached:
return {"cached": True, "response": cached}
response = completion(model=model, messages=messages)
content = response.choices[0].message.content
cache.set(messages, content)
return {"cached": False, "response": content}
Strategy 4: Batching
Process multiple requests together:
import asyncio
from litellm import acompletion
async def batch_process(prompts: list, model: str, batch_size: int = 10):
"""Process prompts in batches to optimize throughput."""
results = []
for i in range(0, len(prompts), batch_size):
batch = prompts[i:i + batch_size]
# Process batch concurrently
tasks = [
acompletion(
model=model,
messages=[{"role": "user", "content": p}]
)
for p in batch
]
batch_results = await asyncio.gather(*tasks)
results.extend(batch_results)
return results
# For bulk operations, consider using batch API
# OpenAI Batch API: 50% discount for async processing
Strategy 5: Output Limits
Control response length:
def controlled_completion(prompt: str, max_output: int = 150):
"""Limit output tokens to control costs."""
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
max_tokens=max_output, # Hard limit
)
return response
# Alternative: Instruct in prompt
CONCISE_PROMPT = """
{question}
Answer in 2-3 sentences maximum.
"""
Optimization Checklist
┌─────────────────────────────────────────────────────────────┐
│ Cost Optimization Checklist │
├─────────────────────────────────────────────────────────────┤
│ │
│ □ Use smallest effective model for each task │
│ □ Implement semantic caching for repeated queries │
│ □ Compress/summarize long contexts │
│ □ Set max_tokens limits on all requests │
│ □ Batch similar requests together │
│ □ Cache embeddings (they don't change) │
│ □ Use prompt templates (fewer tokens) │
│ □ Monitor and alert on cost spikes │
│ □ Track cost per feature/user/team │
│ □ Review top cost drivers weekly │
│ │
└─────────────────────────────────────────────────────────────┘
Real-World Savings Examples
| Optimization | Before | After | Savings |
|---|---|---|---|
| Model routing | $10,000/mo | $3,000/mo | 70% |
| Caching | $5,000/mo | $2,500/mo | 50% |
| Prompt optimization | $3,000/mo | $1,800/mo | 40% |
| Output limits | $2,000/mo | $1,400/mo | 30% |
| Combined | $10,000/mo | $2,000/mo | 80% |
| ::: |