LLM Observability Fundamentals

LLM observability extends traditional application monitoring to address the unique challenges of AI systems: non-deterministic outputs, complex multi-step reasoning, and quality assessment at scale.

Why LLM Observability Matters

┌─────────────────────────────────────────────────────────────┐
│           Traditional vs LLM Observability                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Traditional Monitoring        LLM Observability            │
│  ──────────────────────       ─────────────────             │
│  • Response time               • Response time              │
│  • Error rates                 • Error rates                │
│  • Throughput                  • Throughput                 │
│                                                             │
│  + LLM-Specific:                                            │
│  ─────────────────                                          │
│  • Token usage & costs         • Prompt/completion tracing  │
│  • Output quality scores       • Hallucination detection    │
│  • Latency breakdown           • User feedback loops        │
│    (TTFT, generation)          • Model comparison A/B       │
│  • Conversation context        • Retrieval quality (RAG)    │
│  • Safety/guardrail triggers   • Prompt injection attempts  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The Observability Stack

┌─────────────────────────────────────────────────────────────┐
│                   LLM Observability Stack                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Layer 4: Analytics & Insights                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Dashboards, reports, cost analysis, quality trends │   │
│  └─────────────────────────────────────────────────────┘   │
│                        ↑                                    │
│  Layer 3: Evaluation                                        │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  LLM-as-judge, human feedback, automated scoring    │   │
│  └─────────────────────────────────────────────────────┘   │
│                        ↑                                    │
│  Layer 2: Tracing                                           │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Spans, traces, prompts, completions, metadata      │   │
│  └─────────────────────────────────────────────────────┘   │
│                        ↑                                    │
│  Layer 1: Collection                                        │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  SDK instrumentation, API proxies, log aggregation  │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Core Concepts

Traces and Spans

LLM traces capture the full lifecycle of a request:

# Conceptual trace structure
trace = {
    "trace_id": "abc-123",
    "name": "customer_support_query",
    "spans": [
        {
            "span_id": "span-1",
            "name": "embedding_generation",
            "input": "How do I reset my password?",
            "output": "[0.123, 0.456, ...]",
            "model": "text-embedding-3-small",
            "tokens": 8,
            "latency_ms": 45,
        },
        {
            "span_id": "span-2",
            "name": "vector_search",
            "input": {"query_vector": "..."},
            "output": {"documents": [...], "scores": [...]},
            "latency_ms": 12,
        },
        {
            "span_id": "span-3",
            "name": "llm_completion",
            "input": {"system": "...", "user": "..."},
            "output": "To reset your password...",
            "model": "gpt-4o",
            "prompt_tokens": 1250,
            "completion_tokens": 185,
            "latency_ms": 890,
            "cost_usd": 0.0134,
        },
    ],
    "total_latency_ms": 947,
    "total_cost_usd": 0.0136,
    "user_id": "user-456",
    "session_id": "session-789",
}

Evaluation Dimensions

Quality Evaluation Matrix:

Dimension        Methods                    Automation Level
─────────────────────────────────────────────────────────────
Correctness      LLM-as-judge, RAG         High
                 ground truth comparison

Relevance        Semantic similarity,       High
                 topic classification

Helpfulness      User ratings,              Medium
                 task completion rates

Safety           Guardrail checks,          High
                 toxicity detection

Coherence        LLM-as-judge,              High
                 readability scores

Groundedness     Citation verification,     Medium
(RAG)            source attribution

Key Metrics to Track

Latency Metrics

Metric	Description	Target
TTFT	Time to first token	<500ms
Total latency	End-to-end response time	<3s
P95 latency	95th percentile response	<5s
Generation speed	Tokens per second	>30 tok/s

Quality Metrics

Metric	Description	Target
User satisfaction	Thumbs up/down ratio	>85% positive
Task completion	Did user achieve goal?	>90%
Hallucination rate	Factually incorrect responses	<5%
Guardrail triggers	Safety filter activations	<1%

Cost Metrics

Metric	Description	Optimization
Cost per query	Average $ per request	Track trends
Token efficiency	Output/input ratio	Optimize prompts
Cache hit rate	Reused responses	>70% for similar queries
Model cost mix	Spend by model tier	Route appropriately

Observability Platform Comparison

Platform	Strengths	Best For
Langfuse	Open-source, self-host, LLM-as-judge	Full control, privacy
Helicone	Ultra-low latency proxy, caching	High-scale production
LangSmith	LangChain integration, playground	LangChain apps
Weights & Biases	ML experiment tracking	Research teams
Datadog LLM	Enterprise APM integration	Existing Datadog users

Integration Patterns

Proxy-Based (Zero-Code)

Your App → Observability Proxy → LLM Provider
                 ↓
           Analytics Dashboard

SDK-Based (Detailed Control)

from observability_sdk import trace, span

@trace(name="chat_completion")
def process_query(user_message):
    with span("embedding"):
        embedding = get_embedding(user_message)

    with span("retrieval"):
        docs = search_documents(embedding)

    with span("completion"):
        response = generate_response(docs, user_message)

    return response

OpenTelemetry Compatible

from opentelemetry import trace
from opentelemetry.instrumentation.openai import OpenAIInstrumentor

# Auto-instrument OpenAI calls
OpenAIInstrumentor().instrument()

# Traces flow to your OTel backend

:::