Production Monitoring & Next Steps

Cost Tracking & Optimization

3 min read

LLM costs can grow quickly in production. Track token usage, optimize prompts, and implement smart routing to manage expenses while maintaining quality.

Understanding LLM Costs

Cost Factor Impact
Input tokens Prompt length
Output tokens Response length (usually higher cost)
Model choice GPT-4 vs GPT-4o-mini pricing
Request volume Total API calls

Tracking Token Usage

With LangSmith

Token usage is automatically tracked:

from langsmith import traceable

@traceable
def generate_response(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    # Token counts logged automatically
    return response.choices[0].message.content

View in LangSmith UI:

  • Total tokens per trace
  • Input vs output token breakdown
  • Cost estimates per request

With MLflow

import mlflow

with mlflow.start_run():
    response = call_llm(prompt)

    # Log token metrics
    mlflow.log_metric("input_tokens", response.usage.prompt_tokens)
    mlflow.log_metric("output_tokens", response.usage.completion_tokens)
    mlflow.log_metric("total_tokens", response.usage.total_tokens)

    # Calculate and log cost
    cost = calculate_cost(
        model="gpt-4o-mini",
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens
    )
    mlflow.log_metric("cost_usd", cost)

With W&B Weave

import weave

@weave.op()
def tracked_llm_call(prompt: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )

    return {
        "content": response.choices[0].message.content,
        "tokens": response.usage.total_tokens,
        "cost": calculate_cost(response.usage)
    }

Cost Optimization Strategies

1. Model Routing

Use cheaper models when quality allows:

def smart_model_router(query: str, complexity: str) -> str:
    """Route to appropriate model based on complexity."""
    if complexity == "simple":
        return "gpt-4o-mini"  # $0.15/1M input tokens
    elif complexity == "medium":
        return "gpt-4o"       # $2.50/1M input tokens
    else:
        return "gpt-4"        # $30/1M input tokens

# Classify query complexity first
complexity = classify_complexity(query)
model = smart_model_router(query, complexity)
response = call_llm(query, model=model)

2. Prompt Optimization

Shorter prompts = lower costs:

# Before: 500 tokens
system_prompt_verbose = """
You are a helpful customer support assistant. You should always
be polite and professional. When answering questions, provide
detailed information but also be concise. Make sure to address
all parts of the customer's question...
"""

# After: 100 tokens
system_prompt_optimized = """
Customer support assistant. Be helpful, polite, concise.
Address all parts of the question.
"""

3. Response Length Limits

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=200  # Limit output length
)

4. Caching

Cache repeated queries:

import hashlib

cache = {}

def cached_llm_call(prompt: str) -> str:
    cache_key = hashlib.md5(prompt.encode()).hexdigest()

    if cache_key in cache:
        return cache[cache_key]

    response = call_llm(prompt)
    cache[cache_key] = response

    return response

Cost Dashboard

Track costs over time:

Cost Dashboard - This Month
───────────────────────────────────────────
Model         │ Requests │ Tokens   │ Cost
───────────────────────────────────────────
gpt-4o-mini   │ 45,230   │ 12.3M    │ $24.60
gpt-4o        │ 8,420    │ 4.1M     │ $82.00
gpt-4         │ 320      │ 0.2M     │ $12.00
───────────────────────────────────────────
Total         │ 53,970   │ 16.6M    │ $118.60

Projected Monthly: $142.32
Budget: $200.00
Status: ✅ On track

Cost vs Quality Trade-offs

Optimization Cost Savings Quality Impact
Smaller model 50-90% May decrease
Shorter prompts 10-30% Usually none
Caching Variable None
Response limits 20-40% May truncate
Batch processing 10-20% None

Best Practices

  1. Set budgets: Define monthly/daily limits
  2. Alert on spikes: Catch runaway costs early
  3. A/B test models: Find quality/cost balance
  4. Monitor trends: Track cost per query over time
  5. Regular review: Identify optimization opportunities

Tip: Start with the most expensive model, then experiment with cheaper alternatives. Quality loss is easier to detect than finding the right model upfront.

Next, we'll explore how to integrate LLM evaluation into your CI/CD pipeline. :::

Quiz

Module 6: Production Monitoring & Next Steps

Take Quiz