Production Observability

Langfuse: Open-Source LLM Observability

4 min read

Langfuse is the leading open-source platform for LLM observability, offering tracing, evaluation, and prompt management. In June 2025, Langfuse open-sourced its previously commercial features including LLM-as-a-judge evaluations, annotation queues, and the prompt playground under the MIT license.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                     Langfuse Platform                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                    Web Dashboard                     │   │
│  │  ┌──────────┐ ┌──────────┐ ┌────────────────────┐   │   │
│  │  │  Traces  │ │ Sessions │ │ Prompt Management  │   │   │
│  │  └──────────┘ └──────────┘ └────────────────────┘   │   │
│  │  ┌──────────┐ ┌──────────┐ ┌────────────────────┐   │   │
│  │  │  Scores  │ │ Datasets │ │ Annotation Queues  │   │   │
│  │  └──────────┘ └──────────┘ └────────────────────┘   │   │
│  └─────────────────────────────────────────────────────┘   │
│                          ↑ REST API                         │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Backend (TypeScript/Node.js)            │   │
│  │         PostgreSQL + ClickHouse (analytics)          │   │
│  └─────────────────────────────────────────────────────┘   │
│                          ↑ SDKs                             │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌──────────┐   │
│  │  Python   │ │ TypeScript│ │ LangChain │ │ LlamaIdx │   │
│  └───────────┘ └───────────┘ └───────────┘ └──────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Quick Start with Python SDK

Installation and Setup

pip install langfuse

# Set environment variables
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_HOST="https://cloud.langfuse.com"  # or self-hosted

Basic Tracing

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
from openai import OpenAI

langfuse = Langfuse()
client = OpenAI()

@observe()
def process_query(user_question: str) -> str:
    # Trace embedding generation
    with langfuse_context.span(name="embedding") as span:
        embedding_response = client.embeddings.create(
            model="text-embedding-3-small",
            input=user_question
        )
        span.update(
            input=user_question,
            output={"dimensions": len(embedding_response.data[0].embedding)}
        )

    # Trace LLM completion
    with langfuse_context.generation(name="completion") as gen:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": user_question}
            ]
        )
        gen.update(
            model="gpt-4o",
            input=user_question,
            output=response.choices[0].message.content,
            usage={
                "prompt_tokens": response.usage.prompt_tokens,
                "completion_tokens": response.usage.completion_tokens
            }
        )

    return response.choices[0].message.content

# Trace is automatically captured
result = process_query("What is the capital of France?")

Session Tracking for Conversations

@observe()
def chat(messages: list, session_id: str, user_id: str):
    # Link to session and user
    langfuse_context.update_current_trace(
        session_id=session_id,
        user_id=user_id,
        metadata={"source": "web_chat"}
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )

    return response.choices[0].message.content

LLM-as-a-Judge Evaluations (Open-Source June 2025)

Langfuse now includes built-in LLM evaluators:

from langfuse import Langfuse

langfuse = Langfuse()

# Create an evaluation template
langfuse.create_prompt(
    name="helpfulness-evaluator",
    prompt="""Rate the helpfulness of this AI response on a scale of 1-5.

User Question: {{question}}
AI Response: {{response}}

Criteria:
- 1: Not helpful at all, incorrect or irrelevant
- 2: Minimally helpful, partially addresses the question
- 3: Moderately helpful, addresses the question adequately
- 4: Very helpful, addresses the question well with good detail
- 5: Extremely helpful, comprehensive and insightful

Provide your rating as a single number.""",
    config={
        "model": "gpt-4o-mini",
        "temperature": 0
    }
)

# Run automated evaluation on traces
evaluation_results = langfuse.evaluate(
    evaluator_name="helpfulness-evaluator",
    trace_filter={
        "created_at": {"gte": "2026-01-01"},
        "tags": ["production"]
    },
    sample_size=100
)

Custom Evaluator Functions

from langfuse.decorators import observe

@observe(as_type="generation")
def evaluate_response(trace_id: str, response: str) -> dict:
    # Your custom evaluation logic
    evaluation = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Evaluate the following response..."},
            {"role": "user", "content": response}
        ]
    )

    score = extract_score(evaluation.choices[0].message.content)

    # Log score back to trace
    langfuse.score(
        trace_id=trace_id,
        name="custom-quality",
        value=score,
        comment="Automated evaluation"
    )

    return {"score": score}

Annotation Queues (Open-Source June 2025)

Create human review workflows:

# Create annotation queue
queue = langfuse.create_annotation_queue(
    name="low-confidence-reviews",
    description="Review traces where model confidence was low",
    filter={
        "metadata.confidence": {"lt": 0.7}
    }
)

# Annotators review in dashboard and add scores
# Scores are automatically linked to traces

Prompt Management

Version-Controlled Prompts

# Fetch production prompt
prompt = langfuse.get_prompt("customer-support-v2")

# Use in your application
response = client.chat.completions.create(
    model=prompt.config["model"],
    messages=[
        {"role": "system", "content": prompt.compile()},
        {"role": "user", "content": user_message}
    ]
)

# Link generation to prompt version for A/B testing
with langfuse_context.generation(name="completion") as gen:
    gen.update(prompt=prompt)

Self-Hosting Langfuse

# docker-compose.yml for self-hosted Langfuse
version: "3.8"
services:
  langfuse:
    image: langfuse/langfuse:latest
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://postgres:password@db:5432/langfuse
      - CLICKHOUSE_URL=http://clickhouse:8123
      - NEXTAUTH_SECRET=${NEXTAUTH_SECRET}
      - SALT=${SALT}
    depends_on:
      - db
      - clickhouse

  db:
    image: postgres:16
    environment:
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=langfuse
    volumes:
      - postgres_data:/var/lib/postgresql/data

  clickhouse:
    image: clickhouse/clickhouse-server:latest
    volumes:
      - clickhouse_data:/var/lib/clickhouse

volumes:
  postgres_data:
  clickhouse_data:

Integration with LangChain

from langchain_openai import ChatOpenAI
from langchain.callbacks import LangfuseCallbackHandler

# Automatic tracing for LangChain
handler = LangfuseCallbackHandler(
    public_key="pk-lf-...",
    secret_key="sk-lf-..."
)

llm = ChatOpenAI(
    model="gpt-4o",
    callbacks=[handler]
)

# All LangChain operations are traced
response = llm.invoke("What is machine learning?")

:::

Quiz

Module 4: Production Observability

Take Quiz