Production Observability
LangSmith: LangChain's Observability Platform
3 min read
LangSmith is LangChain's integrated observability and evaluation platform, providing deep visibility into LangChain applications with features like tracing, testing, and the prompt playground.
Architecture and Integration
┌─────────────────────────────────────────────────────────────┐
│ LangSmith Platform │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Web Dashboard │ │
│ │ ┌──────────┐ ┌──────────┐ ┌────────────────────┐ │ │
│ │ │ Traces │ │Playground│ │ Evaluations │ │ │
│ │ └──────────┘ └──────────┘ └────────────────────┘ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌────────────────────┐ │ │
│ │ │ Datasets │ │Annotations│ │ Hub (Prompts) │ │ │
│ │ └──────────┘ └──────────┘ └────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ ↑ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ LangChain Integration │ │
│ │ │ │
│ │ Automatic tracing for: │ │
│ │ • Chains, Agents, Tools │ │
│ │ • LLM calls (any provider) │ │
│ │ • Retrieval operations │ │
│ │ • Custom components │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Quick Start
Setup
pip install langsmith langchain langchain-openai
# Set environment variables
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="ls__..."
export LANGCHAIN_PROJECT="my-project"
Automatic LangChain Tracing
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# All LangChain operations are automatically traced
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("user", "{question}")
])
chain = prompt | llm | StrOutputParser()
# This entire chain is traced in LangSmith
response = chain.invoke({"question": "What is machine learning?"})
Manual Tracing with @traceable
from langsmith import traceable
from openai import OpenAI
client = OpenAI()
@traceable(name="generate_response")
def generate_response(user_message: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": user_message}
]
)
return response.choices[0].message.content
@traceable(name="process_query")
def process_query(query: str) -> dict:
# Nested traces are automatically linked
response = generate_response(query)
return {"query": query, "response": response}
Prompt Playground
Test and iterate on prompts directly in LangSmith:
from langsmith import Client
client = Client()
# Create a prompt in the playground
prompt = client.create_prompt(
name="customer_support",
template="""You are a customer support agent for {{company}}.
User Query: {{query}}
Respond helpfully and professionally.""",
input_schema={
"company": {"type": "string"},
"query": {"type": "string"}
}
)
# Use the prompt in your application
from langchain import hub
prompt_template = hub.pull("customer_support")
chain = prompt_template | llm | StrOutputParser()
Dataset-Based Evaluation
Create evaluation datasets and run automated tests:
from langsmith import Client
from langsmith.evaluation import evaluate
client = Client()
# Create a dataset
dataset = client.create_dataset(
name="support_qa_test",
description="Test cases for customer support"
)
# Add examples
client.create_examples(
inputs=[
{"query": "How do I reset my password?"},
{"query": "What are your business hours?"},
],
outputs=[
{"expected": "password reset instructions"},
{"expected": "hours of operation"},
],
dataset_id=dataset.id
)
# Define your chain/function to evaluate
def my_chain(inputs: dict) -> dict:
response = chain.invoke(inputs)
return {"response": response}
# Run evaluation
results = evaluate(
my_chain,
data=dataset.name,
evaluators=[
"correctness", # Built-in evaluator
"helpfulness",
],
experiment_prefix="support-v2"
)
Custom Evaluators
from langsmith.evaluation import EvaluationResult
def custom_evaluator(run, example) -> EvaluationResult:
"""Check if response mentions key terms."""
response = run.outputs.get("response", "")
expected_terms = example.outputs.get("expected_terms", [])
matches = sum(1 for term in expected_terms if term.lower() in response.lower())
score = matches / len(expected_terms) if expected_terms else 1.0
return EvaluationResult(
key="term_coverage",
score=score,
comment=f"Matched {matches}/{len(expected_terms)} terms"
)
results = evaluate(
my_chain,
data=dataset.name,
evaluators=[custom_evaluator]
)
Annotation Queues
Set up human review workflows:
# Create an annotation queue
queue = client.create_annotation_queue(
name="low_confidence_reviews",
description="Review responses with low confidence"
)
# Add runs to queue programmatically
@traceable
def generate_with_review(query: str):
response = chain.invoke({"query": query})
# Check confidence and queue for review if needed
confidence = estimate_confidence(response)
if confidence < 0.7:
# Run will appear in annotation queue
pass
return response
A/B Testing with Experiments
Compare different configurations:
# Run experiment with variant A
results_a = evaluate(
chain_v1,
data="test_dataset",
experiment_prefix="chain-v1"
)
# Run experiment with variant B
results_b = evaluate(
chain_v2,
data="test_dataset",
experiment_prefix="chain-v2"
)
# Compare in LangSmith dashboard
Production Monitoring
from langsmith import Client
from datetime import datetime, timedelta
client = Client()
# Query recent runs
runs = client.list_runs(
project_name="production",
start_time=datetime.now() - timedelta(hours=24),
filter='eq(status, "error")' # Find errors
)
# Analyze metrics
for run in runs:
print(f"Error: {run.error}")
print(f"Latency: {run.latency_ms}ms")
print(f"Tokens: {run.total_tokens}")
Best Practices
┌─────────────────────────────────────────────────────────────┐
│ LangSmith Best Practices │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. Project Organization │
│ ├── Separate projects for dev/staging/prod │
│ ├── Use consistent naming conventions │
│ └── Tag runs with version info │
│ │
│ 2. Evaluation Strategy │
│ ├── Create representative test datasets │
│ ├── Run evals before each deployment │
│ ├── Track metrics over time │
│ └── Set up regression alerts │
│ │
│ 3. Prompt Management │
│ ├── Version prompts in LangChain Hub │
│ ├── Test in playground before prod │
│ └── Link generations to prompt versions │
│ │
│ 4. Human Feedback │
│ ├── Set up annotation queues │
│ ├── Route edge cases for review │
│ └── Use feedback to improve evaluators │
│ │
└─────────────────────────────────────────────────────────────┘
LangSmith vs Alternatives
| Feature | LangSmith | Langfuse | Helicone |
|---|---|---|---|
| LangChain integration | Native | Good | Basic |
| Self-hosting | No | Yes | Yes |
| Prompt playground | Excellent | Good | Basic |
| Evaluation framework | Built-in | Built-in | Limited |
| Pricing | Paid tiers | Open-source core | Free tier |
| Best for | LangChain apps | Open-source needs | High-scale proxy |
| ::: |