Production Observability

LangSmith: LangChain's Observability Platform

3 min read

LangSmith is LangChain's integrated observability and evaluation platform, providing deep visibility into LangChain applications with features like tracing, testing, and the prompt playground.

Architecture and Integration

┌─────────────────────────────────────────────────────────────┐
│                   LangSmith Platform                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                    Web Dashboard                     │   │
│  │  ┌──────────┐ ┌──────────┐ ┌────────────────────┐   │   │
│  │  │  Traces  │ │Playground│ │    Evaluations     │   │   │
│  │  └──────────┘ └──────────┘ └────────────────────┘   │   │
│  │  ┌──────────┐ ┌──────────┐ ┌────────────────────┐   │   │
│  │  │ Datasets │ │Annotations│ │   Hub (Prompts)   │   │   │
│  │  └──────────┘ └──────────┘ └────────────────────┘   │   │
│  └─────────────────────────────────────────────────────┘   │
│                          ↑                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              LangChain Integration                   │   │
│  │                                                      │   │
│  │  Automatic tracing for:                              │   │
│  │  • Chains, Agents, Tools                             │   │
│  │  • LLM calls (any provider)                          │   │
│  │  • Retrieval operations                              │   │
│  │  • Custom components                                 │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Quick Start

Setup

pip install langsmith langchain langchain-openai

# Set environment variables
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="ls__..."
export LANGCHAIN_PROJECT="my-project"

Automatic LangChain Tracing

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# All LangChain operations are automatically traced
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("user", "{question}")
])

chain = prompt | llm | StrOutputParser()

# This entire chain is traced in LangSmith
response = chain.invoke({"question": "What is machine learning?"})

Manual Tracing with @traceable

from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable(name="generate_response")
def generate_response(user_message: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are helpful."},
            {"role": "user", "content": user_message}
        ]
    )
    return response.choices[0].message.content

@traceable(name="process_query")
def process_query(query: str) -> dict:
    # Nested traces are automatically linked
    response = generate_response(query)
    return {"query": query, "response": response}

Prompt Playground

Test and iterate on prompts directly in LangSmith:

from langsmith import Client

client = Client()

# Create a prompt in the playground
prompt = client.create_prompt(
    name="customer_support",
    template="""You are a customer support agent for {{company}}.

User Query: {{query}}

Respond helpfully and professionally.""",
    input_schema={
        "company": {"type": "string"},
        "query": {"type": "string"}
    }
)

# Use the prompt in your application
from langchain import hub

prompt_template = hub.pull("customer_support")
chain = prompt_template | llm | StrOutputParser()

Dataset-Based Evaluation

Create evaluation datasets and run automated tests:

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

# Create a dataset
dataset = client.create_dataset(
    name="support_qa_test",
    description="Test cases for customer support"
)

# Add examples
client.create_examples(
    inputs=[
        {"query": "How do I reset my password?"},
        {"query": "What are your business hours?"},
    ],
    outputs=[
        {"expected": "password reset instructions"},
        {"expected": "hours of operation"},
    ],
    dataset_id=dataset.id
)

# Define your chain/function to evaluate
def my_chain(inputs: dict) -> dict:
    response = chain.invoke(inputs)
    return {"response": response}

# Run evaluation
results = evaluate(
    my_chain,
    data=dataset.name,
    evaluators=[
        "correctness",  # Built-in evaluator
        "helpfulness",
    ],
    experiment_prefix="support-v2"
)

Custom Evaluators

from langsmith.evaluation import EvaluationResult

def custom_evaluator(run, example) -> EvaluationResult:
    """Check if response mentions key terms."""
    response = run.outputs.get("response", "")
    expected_terms = example.outputs.get("expected_terms", [])

    matches = sum(1 for term in expected_terms if term.lower() in response.lower())
    score = matches / len(expected_terms) if expected_terms else 1.0

    return EvaluationResult(
        key="term_coverage",
        score=score,
        comment=f"Matched {matches}/{len(expected_terms)} terms"
    )

results = evaluate(
    my_chain,
    data=dataset.name,
    evaluators=[custom_evaluator]
)

Annotation Queues

Set up human review workflows:

# Create an annotation queue
queue = client.create_annotation_queue(
    name="low_confidence_reviews",
    description="Review responses with low confidence"
)

# Add runs to queue programmatically
@traceable
def generate_with_review(query: str):
    response = chain.invoke({"query": query})

    # Check confidence and queue for review if needed
    confidence = estimate_confidence(response)
    if confidence < 0.7:
        # Run will appear in annotation queue
        pass

    return response

A/B Testing with Experiments

Compare different configurations:

# Run experiment with variant A
results_a = evaluate(
    chain_v1,
    data="test_dataset",
    experiment_prefix="chain-v1"
)

# Run experiment with variant B
results_b = evaluate(
    chain_v2,
    data="test_dataset",
    experiment_prefix="chain-v2"
)

# Compare in LangSmith dashboard

Production Monitoring

from langsmith import Client
from datetime import datetime, timedelta

client = Client()

# Query recent runs
runs = client.list_runs(
    project_name="production",
    start_time=datetime.now() - timedelta(hours=24),
    filter='eq(status, "error")'  # Find errors
)

# Analyze metrics
for run in runs:
    print(f"Error: {run.error}")
    print(f"Latency: {run.latency_ms}ms")
    print(f"Tokens: {run.total_tokens}")

Best Practices

┌─────────────────────────────────────────────────────────────┐
│              LangSmith Best Practices                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Project Organization                                    │
│     ├── Separate projects for dev/staging/prod              │
│     ├── Use consistent naming conventions                   │
│     └── Tag runs with version info                          │
│                                                             │
│  2. Evaluation Strategy                                     │
│     ├── Create representative test datasets                 │
│     ├── Run evals before each deployment                    │
│     ├── Track metrics over time                             │
│     └── Set up regression alerts                            │
│                                                             │
│  3. Prompt Management                                       │
│     ├── Version prompts in LangChain Hub                    │
│     ├── Test in playground before prod                      │
│     └── Link generations to prompt versions                 │
│                                                             │
│  4. Human Feedback                                          │
│     ├── Set up annotation queues                            │
│     ├── Route edge cases for review                         │
│     └── Use feedback to improve evaluators                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

LangSmith vs Alternatives

Feature LangSmith Langfuse Helicone
LangChain integration Native Good Basic
Self-hosting No Yes Yes
Prompt playground Excellent Good Basic
Evaluation framework Built-in Built-in Limited
Pricing Paid tiers Open-source core Free tier
Best for LangChain apps Open-source needs High-scale proxy
:::

Quiz

Module 4: Production Observability

Take Quiz