Production Observability
Langfuse: Open-Source LLM Observability
4 min read
Langfuse is the leading open-source platform for LLM observability, offering tracing, evaluation, and prompt management. In June 2025, Langfuse open-sourced its previously commercial features including LLM-as-a-judge evaluations, annotation queues, and the prompt playground under the MIT license.
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ Langfuse Platform │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Web Dashboard │ │
│ │ ┌──────────┐ ┌──────────┐ ┌────────────────────┐ │ │
│ │ │ Traces │ │ Sessions │ │ Prompt Management │ │ │
│ │ └──────────┘ └──────────┘ └────────────────────┘ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌────────────────────┐ │ │
│ │ │ Scores │ │ Datasets │ │ Annotation Queues │ │ │
│ │ └──────────┘ └──────────┘ └────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ ↑ REST API │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Backend (TypeScript/Node.js) │ │
│ │ PostgreSQL + ClickHouse (analytics) │ │
│ └─────────────────────────────────────────────────────┘ │
│ ↑ SDKs │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Python │ │ TypeScript│ │ LangChain │ │ LlamaIdx │ │
│ └───────────┘ └───────────┘ └───────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Quick Start with Python SDK
Installation and Setup
pip install langfuse
# Set environment variables
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_HOST="https://cloud.langfuse.com" # or self-hosted
Basic Tracing
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
from openai import OpenAI
langfuse = Langfuse()
client = OpenAI()
@observe()
def process_query(user_question: str) -> str:
# Trace embedding generation
with langfuse_context.span(name="embedding") as span:
embedding_response = client.embeddings.create(
model="text-embedding-3-small",
input=user_question
)
span.update(
input=user_question,
output={"dimensions": len(embedding_response.data[0].embedding)}
)
# Trace LLM completion
with langfuse_context.generation(name="completion") as gen:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": user_question}
]
)
gen.update(
model="gpt-4o",
input=user_question,
output=response.choices[0].message.content,
usage={
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens
}
)
return response.choices[0].message.content
# Trace is automatically captured
result = process_query("What is the capital of France?")
Session Tracking for Conversations
@observe()
def chat(messages: list, session_id: str, user_id: str):
# Link to session and user
langfuse_context.update_current_trace(
session_id=session_id,
user_id=user_id,
metadata={"source": "web_chat"}
)
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
return response.choices[0].message.content
LLM-as-a-Judge Evaluations (Open-Source June 2025)
Langfuse now includes built-in LLM evaluators:
from langfuse import Langfuse
langfuse = Langfuse()
# Create an evaluation template
langfuse.create_prompt(
name="helpfulness-evaluator",
prompt="""Rate the helpfulness of this AI response on a scale of 1-5.
User Question: {{question}}
AI Response: {{response}}
Criteria:
- 1: Not helpful at all, incorrect or irrelevant
- 2: Minimally helpful, partially addresses the question
- 3: Moderately helpful, addresses the question adequately
- 4: Very helpful, addresses the question well with good detail
- 5: Extremely helpful, comprehensive and insightful
Provide your rating as a single number.""",
config={
"model": "gpt-4o-mini",
"temperature": 0
}
)
# Run automated evaluation on traces
evaluation_results = langfuse.evaluate(
evaluator_name="helpfulness-evaluator",
trace_filter={
"created_at": {"gte": "2026-01-01"},
"tags": ["production"]
},
sample_size=100
)
Custom Evaluator Functions
from langfuse.decorators import observe
@observe(as_type="generation")
def evaluate_response(trace_id: str, response: str) -> dict:
# Your custom evaluation logic
evaluation = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Evaluate the following response..."},
{"role": "user", "content": response}
]
)
score = extract_score(evaluation.choices[0].message.content)
# Log score back to trace
langfuse.score(
trace_id=trace_id,
name="custom-quality",
value=score,
comment="Automated evaluation"
)
return {"score": score}
Annotation Queues (Open-Source June 2025)
Create human review workflows:
# Create annotation queue
queue = langfuse.create_annotation_queue(
name="low-confidence-reviews",
description="Review traces where model confidence was low",
filter={
"metadata.confidence": {"lt": 0.7}
}
)
# Annotators review in dashboard and add scores
# Scores are automatically linked to traces
Prompt Management
Version-Controlled Prompts
# Fetch production prompt
prompt = langfuse.get_prompt("customer-support-v2")
# Use in your application
response = client.chat.completions.create(
model=prompt.config["model"],
messages=[
{"role": "system", "content": prompt.compile()},
{"role": "user", "content": user_message}
]
)
# Link generation to prompt version for A/B testing
with langfuse_context.generation(name="completion") as gen:
gen.update(prompt=prompt)
Self-Hosting Langfuse
# docker-compose.yml for self-hosted Langfuse
version: "3.8"
services:
langfuse:
image: langfuse/langfuse:latest
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgresql://postgres:password@db:5432/langfuse
- CLICKHOUSE_URL=http://clickhouse:8123
- NEXTAUTH_SECRET=${NEXTAUTH_SECRET}
- SALT=${SALT}
depends_on:
- db
- clickhouse
db:
image: postgres:16
environment:
- POSTGRES_PASSWORD=password
- POSTGRES_DB=langfuse
volumes:
- postgres_data:/var/lib/postgresql/data
clickhouse:
image: clickhouse/clickhouse-server:latest
volumes:
- clickhouse_data:/var/lib/clickhouse
volumes:
postgres_data:
clickhouse_data:
Integration with LangChain
from langchain_openai import ChatOpenAI
from langchain.callbacks import LangfuseCallbackHandler
# Automatic tracing for LangChain
handler = LangfuseCallbackHandler(
public_key="pk-lf-...",
secret_key="sk-lf-..."
)
llm = ChatOpenAI(
model="gpt-4o",
callbacks=[handler]
)
# All LangChain operations are traced
response = llm.invoke("What is machine learning?")
:::