Production Monitoring & Next Steps
Cost Tracking & Optimization
3 min read
LLM costs can grow quickly in production. Track token usage, optimize prompts, and implement smart routing to manage expenses while maintaining quality.
Understanding LLM Costs
| Cost Factor | Impact |
|---|---|
| Input tokens | Prompt length |
| Output tokens | Response length (usually higher cost) |
| Model choice | GPT-4 vs GPT-4o-mini pricing |
| Request volume | Total API calls |
Tracking Token Usage
With LangSmith
Token usage is automatically tracked:
from langsmith import traceable
@traceable
def generate_response(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
# Token counts logged automatically
return response.choices[0].message.content
View in LangSmith UI:
- Total tokens per trace
- Input vs output token breakdown
- Cost estimates per request
With MLflow
import mlflow
with mlflow.start_run():
response = call_llm(prompt)
# Log token metrics
mlflow.log_metric("input_tokens", response.usage.prompt_tokens)
mlflow.log_metric("output_tokens", response.usage.completion_tokens)
mlflow.log_metric("total_tokens", response.usage.total_tokens)
# Calculate and log cost
cost = calculate_cost(
model="gpt-4o-mini",
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens
)
mlflow.log_metric("cost_usd", cost)
With W&B Weave
import weave
@weave.op()
def tracked_llm_call(prompt: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
return {
"content": response.choices[0].message.content,
"tokens": response.usage.total_tokens,
"cost": calculate_cost(response.usage)
}
Cost Optimization Strategies
1. Model Routing
Use cheaper models when quality allows:
def smart_model_router(query: str, complexity: str) -> str:
"""Route to appropriate model based on complexity."""
if complexity == "simple":
return "gpt-4o-mini" # $0.15/1M input tokens
elif complexity == "medium":
return "gpt-4o" # $2.50/1M input tokens
else:
return "gpt-4" # $30/1M input tokens
# Classify query complexity first
complexity = classify_complexity(query)
model = smart_model_router(query, complexity)
response = call_llm(query, model=model)
2. Prompt Optimization
Shorter prompts = lower costs:
# Before: 500 tokens
system_prompt_verbose = """
You are a helpful customer support assistant. You should always
be polite and professional. When answering questions, provide
detailed information but also be concise. Make sure to address
all parts of the customer's question...
"""
# After: 100 tokens
system_prompt_optimized = """
Customer support assistant. Be helpful, polite, concise.
Address all parts of the question.
"""
3. Response Length Limits
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=200 # Limit output length
)
4. Caching
Cache repeated queries:
import hashlib
cache = {}
def cached_llm_call(prompt: str) -> str:
cache_key = hashlib.md5(prompt.encode()).hexdigest()
if cache_key in cache:
return cache[cache_key]
response = call_llm(prompt)
cache[cache_key] = response
return response
Cost Dashboard
Track costs over time:
Cost Dashboard - This Month
───────────────────────────────────────────
Model │ Requests │ Tokens │ Cost
───────────────────────────────────────────
gpt-4o-mini │ 45,230 │ 12.3M │ $24.60
gpt-4o │ 8,420 │ 4.1M │ $82.00
gpt-4 │ 320 │ 0.2M │ $12.00
───────────────────────────────────────────
Total │ 53,970 │ 16.6M │ $118.60
Projected Monthly: $142.32
Budget: $200.00
Status: ✅ On track
Cost vs Quality Trade-offs
| Optimization | Cost Savings | Quality Impact |
|---|---|---|
| Smaller model | 50-90% | May decrease |
| Shorter prompts | 10-30% | Usually none |
| Caching | Variable | None |
| Response limits | 20-40% | May truncate |
| Batch processing | 10-20% | None |
Best Practices
- Set budgets: Define monthly/daily limits
- Alert on spikes: Catch runaway costs early
- A/B test models: Find quality/cost balance
- Monitor trends: Track cost per query over time
- Regular review: Identify optimization opportunities
Tip: Start with the most expensive model, then experiment with cheaper alternatives. Quality loss is easier to detect than finding the right model upfront.
Next, we'll explore how to integrate LLM evaluation into your CI/CD pipeline. :::