LangSmith Deep Dive

Custom Evaluators & Datasets

3 min read

Built-in evaluators cover common cases, but production systems need custom evaluators tailored to your specific quality criteria.

Creating Evaluation Datasets

First, create a dataset to evaluate against:

from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset(
    dataset_name="support-qa-eval",
    description="Customer support Q&A evaluation set"
)

# Add examples
client.create_examples(
    inputs=[
        {"question": "How do I reset my password?"},
        {"question": "What are your business hours?"},
        {"question": "Can I get a refund?"}
    ],
    outputs=[
        {"answer": "Click 'Forgot Password' on the login page..."},
        {"answer": "We're open Monday-Friday, 9 AM to 5 PM..."},
        {"answer": "Refunds are available within 30 days..."}
    ],
    dataset_id=dataset.id
)

Writing Custom Evaluators

Simple Scoring Function

from langsmith.evaluation import evaluate

def check_length(run, example) -> dict:
    """Evaluator that checks response length."""
    output = run.outputs.get("output", "")
    is_appropriate = 50 <= len(output) <= 500
    return {
        "key": "appropriate_length",
        "score": 1 if is_appropriate else 0,
        "comment": f"Length: {len(output)} characters"
    }

LLM-as-Judge Evaluator

from langsmith.evaluation import LangChainStringEvaluator

# Use an LLM to evaluate
helpfulness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "helpfulness": "Is the response helpful and actionable?"
        }
    }
)

Custom LLM Judge

from openai import OpenAI

judge_client = OpenAI()

def custom_judge(run, example) -> dict:
    """Custom LLM-as-Judge evaluator."""
    question = example.inputs.get("question", "")
    expected = example.outputs.get("answer", "")
    actual = run.outputs.get("output", "")

    response = judge_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """You are an evaluation judge. Score the response on:
            1. Accuracy (matches expected answer semantically)
            2. Completeness (covers all key points)
            3. Clarity (easy to understand)

            Return JSON: {"score": 1-5, "reasoning": "..."}"""
        }, {
            "role": "user",
            "content": f"Question: {question}\nExpected: {expected}\nActual: {actual}"
        }],
        response_format={"type": "json_object"}
    )

    result = json.loads(response.choices[0].message.content)
    return {
        "key": "quality_score",
        "score": result["score"] / 5,  # Normalize to 0-1
        "comment": result["reasoning"]
    }

Running Evaluations

from langsmith.evaluation import evaluate

def my_app(inputs: dict) -> dict:
    """Your LLM application."""
    question = inputs["question"]
    # Your logic here
    return {"output": answer}

# Run evaluation
results = evaluate(
    my_app,
    data="support-qa-eval",  # Dataset name
    evaluators=[
        check_length,
        custom_judge,
        helpfulness_evaluator
    ],
    experiment_prefix="v1.2-gpt4o"
)

Viewing Results

In the LangSmith UI:

  1. Go to Datasets & Testing
  2. Select your dataset
  3. Click Compare Experiments

Compare across:

  • Different prompts
  • Different models
  • Code changes
  • Parameter variations

Evaluation Best Practices

Practice Why
Version your evaluators Track changes in scoring criteria
Use multiple evaluators Different aspects need different metrics
Include baselines Compare against a known good version
Log reasoning Understand why scores were given
Test evaluators too Ensure evaluators are consistent

Example: Complete Evaluation Pipeline

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

# 1. Define evaluators
evaluators = [
    check_length,
    custom_judge,
    LangChainStringEvaluator("criteria", config={
        "criteria": {"accuracy": "Is the response factually correct?"}
    })
]

# 2. Run evaluation
results = evaluate(
    target=my_app,
    data="support-qa-eval",
    evaluators=evaluators,
    experiment_prefix="prod-v2.1",
    max_concurrency=4
)

# 3. Check results
print(f"Mean score: {results.aggregate_metrics['quality_score']['mean']}")

Key Takeaways

  1. Datasets are versioned: Track changes to your test data
  2. Evaluators are functions: Simple Python with run and example args
  3. LLM judges scale: Use LLMs for nuanced evaluation
  4. Compare experiments: Track quality across versions
  5. Automate in CI: Run evaluations on every PR

With LangSmith mastered, let's explore MLflow's approach to LLM evaluation. :::

Quiz

Module 3: LangSmith Deep Dive

Take Quiz