Cost Optimization & Scaling

Cost Optimization Strategies

4 min read

Effective cost optimization can reduce LLM expenses by 50-80% without sacrificing quality. This lesson covers proven strategies for optimizing LLM costs in production.

Optimization Hierarchy

┌─────────────────────────────────────────────────────────────┐
│              Cost Optimization Pyramid                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│                    ┌───────────┐                            │
│                    │  Model    │  ← Biggest impact          │
│                    │ Selection │     (50-90% savings)       │
│                    └─────┬─────┘                            │
│               ┌──────────┴──────────┐                       │
│               │   Prompt Engineering│  ← 20-40% savings     │
│               │   & Compression     │                       │
│               └──────────┬──────────┘                       │
│          ┌───────────────┴───────────────┐                  │
│          │        Caching & Batching     │  ← 30-50%        │
│          └───────────────┬───────────────┘                  │
│     ┌────────────────────┴────────────────────┐             │
│     │      Infrastructure Optimization         │  ← 10-30%  │
│     │    (Quantization, Self-hosting)         │             │
│     └─────────────────────────────────────────┘             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Strategy 1: Model Selection & Routing

Choose the right model for each task:

from litellm import completion
from functools import lru_cache

class SmartRouter:
    MODEL_TIERS = {
        "simple": {
            "model": "gpt-4o-mini",
            "cost_per_1k": 0.00075,  # Avg of input/output
            "tasks": ["classification", "extraction", "simple_qa"]
        },
        "standard": {
            "model": "claude-3-5-haiku-20241022",
            "cost_per_1k": 0.0024,
            "tasks": ["summarization", "translation", "analysis"]
        },
        "complex": {
            "model": "gpt-4o",
            "cost_per_1k": 0.01,
            "tasks": ["reasoning", "code_generation", "creative"]
        }
    }

    def classify_task(self, prompt: str, task_type: str = None) -> str:
        """Determine appropriate model tier."""
        if task_type:
            for tier, config in self.MODEL_TIERS.items():
                if task_type in config["tasks"]:
                    return tier

        # Heuristic fallback
        word_count = len(prompt.split())
        if word_count < 50:
            return "simple"
        elif word_count < 200:
            return "standard"
        return "complex"

    def route(self, prompt: str, task_type: str = None) -> str:
        tier = self.classify_task(prompt, task_type)
        return self.MODEL_TIERS[tier]["model"]

# Estimated savings: 60-80% for mixed workloads

Strategy 2: Prompt Optimization

Reduce token usage through efficient prompts:

# Before: 847 tokens
VERBOSE_PROMPT = """
You are a helpful AI assistant. Your task is to help the user with their question.
Please provide a comprehensive, detailed response that covers all aspects of the topic.
Make sure to be thorough and explain everything clearly.
If you need any clarification, please ask.
Here is the user's question: {question}
"""

# After: 127 tokens (85% reduction)
OPTIMIZED_PROMPT = """Answer concisely: {question}"""

# Even better with structured output
STRUCTURED_PROMPT = """{question}

Respond in JSON: {"answer": "...", "confidence": 0-1}"""

Token Compression Techniques

import tiktoken

def compress_context(text: str, max_tokens: int, model: str = "gpt-4o") -> str:
    """Compress context to fit within token limit."""
    encoder = tiktoken.encoding_for_model(model)
    tokens = encoder.encode(text)

    if len(tokens) <= max_tokens:
        return text

    # Strategy 1: Truncate from middle (preserve start and end)
    keep_tokens = max_tokens - 50  # Buffer for ellipsis
    start_tokens = keep_tokens // 2
    end_tokens = keep_tokens - start_tokens

    compressed = (
        encoder.decode(tokens[:start_tokens]) +
        "\n...[truncated]...\n" +
        encoder.decode(tokens[-end_tokens:])
    )

    return compressed

def summarize_long_context(text: str, model: str = "gpt-4o-mini") -> str:
    """Summarize long context using cheap model."""
    if len(text.split()) < 500:
        return text

    response = completion(
        model=model,
        messages=[{
            "role": "user",
            "content": f"Summarize in 200 words:\n\n{text}"
        }]
    )

    return response.choices[0].message.content

Strategy 3: Caching

Cache responses for repeated or similar queries:

import hashlib
import json
from functools import lru_cache
from typing import Optional

class SemanticCache:
    def __init__(self, embedding_model: str = "text-embedding-3-small"):
        self.embedding_model = embedding_model
        self.cache = {}  # In production, use Redis or similar
        self.similarity_threshold = 0.95

    def _get_cache_key(self, messages: list) -> str:
        """Generate cache key from messages."""
        content = json.dumps(messages, sort_keys=True)
        return hashlib.sha256(content.encode()).hexdigest()

    def get(self, messages: list) -> Optional[str]:
        """Check cache for exact or semantic match."""
        # Exact match
        key = self._get_cache_key(messages)
        if key in self.cache:
            return self.cache[key]

        # Semantic match (simplified)
        # In production, use vector DB for similarity search
        return None

    def set(self, messages: list, response: str):
        """Cache the response."""
        key = self._get_cache_key(messages)
        self.cache[key] = response

# Usage with LiteLLM
cache = SemanticCache()

def cached_completion(messages: list, model: str):
    cached = cache.get(messages)
    if cached:
        return {"cached": True, "response": cached}

    response = completion(model=model, messages=messages)
    content = response.choices[0].message.content

    cache.set(messages, content)
    return {"cached": False, "response": content}

Strategy 4: Batching

Process multiple requests together:

import asyncio
from litellm import acompletion

async def batch_process(prompts: list, model: str, batch_size: int = 10):
    """Process prompts in batches to optimize throughput."""
    results = []

    for i in range(0, len(prompts), batch_size):
        batch = prompts[i:i + batch_size]

        # Process batch concurrently
        tasks = [
            acompletion(
                model=model,
                messages=[{"role": "user", "content": p}]
            )
            for p in batch
        ]

        batch_results = await asyncio.gather(*tasks)
        results.extend(batch_results)

    return results

# For bulk operations, consider using batch API
# OpenAI Batch API: 50% discount for async processing

Strategy 5: Output Limits

Control response length:

def controlled_completion(prompt: str, max_output: int = 150):
    """Limit output tokens to control costs."""
    response = completion(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_output,  # Hard limit
    )

    return response

# Alternative: Instruct in prompt
CONCISE_PROMPT = """
{question}

Answer in 2-3 sentences maximum.
"""

Optimization Checklist

┌─────────────────────────────────────────────────────────────┐
│              Cost Optimization Checklist                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  □ Use smallest effective model for each task               │
│  □ Implement semantic caching for repeated queries          │
│  □ Compress/summarize long contexts                         │
│  □ Set max_tokens limits on all requests                    │
│  □ Batch similar requests together                          │
│  □ Cache embeddings (they don't change)                     │
│  □ Use prompt templates (fewer tokens)                      │
│  □ Monitor and alert on cost spikes                         │
│  □ Track cost per feature/user/team                         │
│  □ Review top cost drivers weekly                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Real-World Savings Examples

Optimization Before After Savings
Model routing $10,000/mo $3,000/mo 70%
Caching $5,000/mo $2,500/mo 50%
Prompt optimization $3,000/mo $1,800/mo 40%
Output limits $2,000/mo $1,400/mo 30%
Combined $10,000/mo $2,000/mo 80%
:::

Quiz

Module 6: Cost Optimization & Scaling

Take Quiz