Production Observability
LLM Observability Fundamentals
4 min read
LLM observability extends traditional application monitoring to address the unique challenges of AI systems: non-deterministic outputs, complex multi-step reasoning, and quality assessment at scale.
Why LLM Observability Matters
┌─────────────────────────────────────────────────────────────┐
│ Traditional vs LLM Observability │
├─────────────────────────────────────────────────────────────┤
│ │
│ Traditional Monitoring LLM Observability │
│ ────────────────────── ───────────────── │
│ • Response time • Response time │
│ • Error rates • Error rates │
│ • Throughput • Throughput │
│ │
│ + LLM-Specific: │
│ ───────────────── │
│ • Token usage & costs • Prompt/completion tracing │
│ • Output quality scores • Hallucination detection │
│ • Latency breakdown • User feedback loops │
│ (TTFT, generation) • Model comparison A/B │
│ • Conversation context • Retrieval quality (RAG) │
│ • Safety/guardrail triggers • Prompt injection attempts │
│ │
└─────────────────────────────────────────────────────────────┘
The Observability Stack
┌─────────────────────────────────────────────────────────────┐
│ LLM Observability Stack │
├─────────────────────────────────────────────────────────────┤
│ │
│ Layer 4: Analytics & Insights │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Dashboards, reports, cost analysis, quality trends │ │
│ └─────────────────────────────────────────────────────┘ │
│ ↑ │
│ Layer 3: Evaluation │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ LLM-as-judge, human feedback, automated scoring │ │
│ └─────────────────────────────────────────────────────┘ │
│ ↑ │
│ Layer 2: Tracing │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Spans, traces, prompts, completions, metadata │ │
│ └─────────────────────────────────────────────────────┘ │
│ ↑ │
│ Layer 1: Collection │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ SDK instrumentation, API proxies, log aggregation │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Core Concepts
Traces and Spans
LLM traces capture the full lifecycle of a request:
# Conceptual trace structure
trace = {
"trace_id": "abc-123",
"name": "customer_support_query",
"spans": [
{
"span_id": "span-1",
"name": "embedding_generation",
"input": "How do I reset my password?",
"output": "[0.123, 0.456, ...]",
"model": "text-embedding-3-small",
"tokens": 8,
"latency_ms": 45,
},
{
"span_id": "span-2",
"name": "vector_search",
"input": {"query_vector": "..."},
"output": {"documents": [...], "scores": [...]},
"latency_ms": 12,
},
{
"span_id": "span-3",
"name": "llm_completion",
"input": {"system": "...", "user": "..."},
"output": "To reset your password...",
"model": "gpt-4o",
"prompt_tokens": 1250,
"completion_tokens": 185,
"latency_ms": 890,
"cost_usd": 0.0134,
},
],
"total_latency_ms": 947,
"total_cost_usd": 0.0136,
"user_id": "user-456",
"session_id": "session-789",
}
Evaluation Dimensions
Quality Evaluation Matrix:
Dimension Methods Automation Level
─────────────────────────────────────────────────────────────
Correctness LLM-as-judge, RAG High
ground truth comparison
Relevance Semantic similarity, High
topic classification
Helpfulness User ratings, Medium
task completion rates
Safety Guardrail checks, High
toxicity detection
Coherence LLM-as-judge, High
readability scores
Groundedness Citation verification, Medium
(RAG) source attribution
Key Metrics to Track
Latency Metrics
| Metric | Description | Target |
|---|---|---|
| TTFT | Time to first token | <500ms |
| Total latency | End-to-end response time | <3s |
| P95 latency | 95th percentile response | <5s |
| Generation speed | Tokens per second | >30 tok/s |
Quality Metrics
| Metric | Description | Target |
|---|---|---|
| User satisfaction | Thumbs up/down ratio | >85% positive |
| Task completion | Did user achieve goal? | >90% |
| Hallucination rate | Factually incorrect responses | <5% |
| Guardrail triggers | Safety filter activations | <1% |
Cost Metrics
| Metric | Description | Optimization |
|---|---|---|
| Cost per query | Average $ per request | Track trends |
| Token efficiency | Output/input ratio | Optimize prompts |
| Cache hit rate | Reused responses | >70% for similar queries |
| Model cost mix | Spend by model tier | Route appropriately |
Observability Platform Comparison
| Platform | Strengths | Best For |
|---|---|---|
| Langfuse | Open-source, self-host, LLM-as-judge | Full control, privacy |
| Helicone | Ultra-low latency proxy, caching | High-scale production |
| LangSmith | LangChain integration, playground | LangChain apps |
| Weights & Biases | ML experiment tracking | Research teams |
| Datadog LLM | Enterprise APM integration | Existing Datadog users |
Integration Patterns
Proxy-Based (Zero-Code)
Your App → Observability Proxy → LLM Provider
↓
Analytics Dashboard
SDK-Based (Detailed Control)
from observability_sdk import trace, span
@trace(name="chat_completion")
def process_query(user_message):
with span("embedding"):
embedding = get_embedding(user_message)
with span("retrieval"):
docs = search_documents(embedding)
with span("completion"):
response = generate_response(docs, user_message)
return response
OpenTelemetry Compatible
from opentelemetry import trace
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
# Auto-instrument OpenAI calls
OpenAIInstrumentor().instrument()
# Traces flow to your OTel backend
:::