Cost Tracking & Optimization

LLM costs can grow quickly in production. Track token usage, optimize prompts, and implement smart routing to manage expenses while maintaining quality.

Understanding LLM Costs

Cost Factor	Impact
Input tokens	Prompt length
Output tokens	Response length (usually higher cost)
Model choice	GPT-4o vs GPT-4o-mini pricing
Request volume	Total API calls

Tracking Token Usage

With LangSmith

Token usage is automatically tracked:

from langsmith import traceable

@traceable
def generate_response(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    # Token counts logged automatically
    return response.choices[0].message.content

View in LangSmith UI:

Total tokens per trace
Input vs output token breakdown
Cost estimates per request

With MLflow

import mlflow

with mlflow.start_run():
    response = call_llm(prompt)

    # Log token metrics
    mlflow.log_metric("input_tokens", response.usage.prompt_tokens)
    mlflow.log_metric("output_tokens", response.usage.completion_tokens)
    mlflow.log_metric("total_tokens", response.usage.total_tokens)

    # Calculate and log cost
    cost = calculate_cost(
        model="gpt-4o-mini",
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens
    )
    mlflow.log_metric("cost_usd", cost)

With W&B Weave

import weave

@weave.op()
def tracked_llm_call(prompt: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )

    return {
        "content": response.choices[0].message.content,
        "tokens": response.usage.total_tokens,
        "cost": calculate_cost(response.usage)
    }

Cost Optimization Strategies

1. Model Routing

Use cheaper models when quality allows:

def smart_model_router(query: str, complexity: str) -> str:
    """Route to appropriate model based on complexity."""
    if complexity == "simple":
        return "gpt-4o-mini"  # $0.15/1M input tokens
    elif complexity == "medium":
        return "gpt-4o"       # $2.50/1M input tokens
    else:
        return "gpt-4o"       # $2.50/1M input tokens

# Classify query complexity first
complexity = classify_complexity(query)
model = smart_model_router(query, complexity)
response = call_llm(query, model=model)

2. Prompt Optimization

Shorter prompts = lower costs:

# Before: 500 tokens
system_prompt_verbose = """
You are a helpful customer support assistant. You should always
be polite and professional. When answering questions, provide
detailed information but also be concise. Make sure to address
all parts of the customer's question...
"""

# After: 100 tokens
system_prompt_optimized = """
Customer support assistant. Be helpful, polite, concise.
Address all parts of the question.
"""

3. Response Length Limits

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=200  # Limit output length
)

4. Caching

Cache repeated queries:

import hashlib

cache = {}

def cached_llm_call(prompt: str) -> str:
    cache_key = hashlib.md5(prompt.encode()).hexdigest()

    if cache_key in cache:
        return cache[cache_key]

    response = call_llm(prompt)
    cache[cache_key] = response

    return response

Cost Dashboard

Track costs over time:

Cost Dashboard - This Month
───────────────────────────────────────────
Model         │ Requests │ Tokens   │ Cost
───────────────────────────────────────────
gpt-4o-mini   │ 45,230   │ 12.3M    │ $24.60
gpt-4o        │ 8,420    │ 4.1M     │ $82.00
claude-sonnet │ 320      │ 0.2M     │ $0.60
───────────────────────────────────────────
Total         │ 53,970   │ 16.6M    │ $107.20

Projected Monthly: $128.64
Budget: $200.00
Status: ✅ On track

Cost vs Quality Trade-offs

Optimization	Cost Savings	Quality Impact
Smaller model	50-90%	May decrease
Shorter prompts	10-30%	Usually none
Caching	Variable	None
Response limits	20-40%	May truncate
Batch processing	10-20%	None

Best Practices

Set budgets: Define monthly/daily limits
Alert on spikes: Catch runaway costs early
A/B test models: Find quality/cost balance
Monitor trends: Track cost per query over time
Regular review: Identify optimization opportunities

Tip: Start with the most expensive model, then experiment with cheaper alternatives. Quality loss is easier to detect than finding the right model upfront.

Next, we'll explore how to integrate LLM evaluation into your CI/CD pipeline. :::