Custom Evaluators & Datasets

Built-in evaluators cover common cases, but production systems need custom evaluators tailored to your specific quality criteria.

Creating Evaluation Datasets

First, create a dataset to evaluate against:

from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset(
    dataset_name="support-qa-eval",
    description="Customer support Q&A evaluation set"
)

# Add examples
client.create_examples(
    inputs=[
        {"question": "How do I reset my password?"},
        {"question": "What are your business hours?"},
        {"question": "Can I get a refund?"}
    ],
    outputs=[
        {"answer": "Click 'Forgot Password' on the login page..."},
        {"answer": "We're open Monday-Friday, 9 AM to 5 PM..."},
        {"answer": "Refunds are available within 30 days..."}
    ],
    dataset_id=dataset.id
)

Writing Custom Evaluators

Simple Scoring Function

from langsmith.evaluation import evaluate

def check_length(run, example) -> dict:
    """Evaluator that checks response length."""
    output = run.outputs.get("output", "")
    is_appropriate = 50 <= len(output) <= 500
    return {
        "key": "appropriate_length",
        "score": 1 if is_appropriate else 0,
        "comment": f"Length: {len(output)} characters"
    }

LLM-as-Judge Evaluator

from langsmith.evaluation import LangChainStringEvaluator

# Use an LLM to evaluate
helpfulness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "helpfulness": "Is the response helpful and actionable?"
        }
    }
)

Custom LLM Judge

from openai import OpenAI

judge_client = OpenAI()

def custom_judge(run, example) -> dict:
    """Custom LLM-as-Judge evaluator."""
    question = example.inputs.get("question", "")
    expected = example.outputs.get("answer", "")
    actual = run.outputs.get("output", "")

    response = judge_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """You are an evaluation judge. Score the response on:
            1. Accuracy (matches expected answer semantically)
            2. Completeness (covers all key points)
            3. Clarity (easy to understand)

            Return JSON: {"score": 1-5, "reasoning": "..."}"""
        }, {
            "role": "user",
            "content": f"Question: {question}\nExpected: {expected}\nActual: {actual}"
        }],
        response_format={"type": "json_object"}
    )

    result = json.loads(response.choices[0].message.content)
    return {
        "key": "quality_score",
        "score": result["score"] / 5,  # Normalize to 0-1
        "comment": result["reasoning"]
    }

Running Evaluations

from langsmith.evaluation import evaluate

def my_app(inputs: dict) -> dict:
    """Your LLM application."""
    question = inputs["question"]
    # Your logic here
    return {"output": answer}

# Run evaluation
results = evaluate(
    my_app,
    data="support-qa-eval",  # Dataset name
    evaluators=[
        check_length,
        custom_judge,
        helpfulness_evaluator
    ],
    experiment_prefix="v1.2-gpt4o"
)

Viewing Results

In the LangSmith UI:

Go to Datasets & Testing
Select your dataset
Click Compare Experiments

Compare across:

Different prompts
Different models
Code changes
Parameter variations

Evaluation Best Practices

Practice	Why
Version your evaluators	Track changes in scoring criteria
Use multiple evaluators	Different aspects need different metrics
Include baselines	Compare against a known good version
Log reasoning	Understand why scores were given
Test evaluators too	Ensure evaluators are consistent

Example: Complete Evaluation Pipeline

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

# 1. Define evaluators
evaluators = [
    check_length,
    custom_judge,
    LangChainStringEvaluator("criteria", config={
        "criteria": {"accuracy": "Is the response factually correct?"}
    })
]

# 2. Run evaluation
results = evaluate(
    target=my_app,
    data="support-qa-eval",
    evaluators=evaluators,
    experiment_prefix="prod-v2.1",
    max_concurrency=4
)

# 3. Check results
print(f"Mean score: {results.aggregate_metrics['quality_score']['mean']}")

Key Takeaways

Datasets are versioned: Track changes to your test data
Evaluators are functions: Simple Python with run and example args
LLM judges scale: Use LLMs for nuanced evaluation
Compare experiments: Track quality across versions
Automate in CI: Run evaluations on every PR

With LangSmith mastered, let's explore MLflow's approach to LLM evaluation. :::