LangSmith Deep Dive
Custom Evaluators & Datasets
3 min read
Built-in evaluators cover common cases, but production systems need custom evaluators tailored to your specific quality criteria.
Creating Evaluation Datasets
First, create a dataset to evaluate against:
from langsmith import Client
client = Client()
# Create a dataset
dataset = client.create_dataset(
dataset_name="support-qa-eval",
description="Customer support Q&A evaluation set"
)
# Add examples
client.create_examples(
inputs=[
{"question": "How do I reset my password?"},
{"question": "What are your business hours?"},
{"question": "Can I get a refund?"}
],
outputs=[
{"answer": "Click 'Forgot Password' on the login page..."},
{"answer": "We're open Monday-Friday, 9 AM to 5 PM..."},
{"answer": "Refunds are available within 30 days..."}
],
dataset_id=dataset.id
)
Writing Custom Evaluators
Simple Scoring Function
from langsmith.evaluation import evaluate
def check_length(run, example) -> dict:
"""Evaluator that checks response length."""
output = run.outputs.get("output", "")
is_appropriate = 50 <= len(output) <= 500
return {
"key": "appropriate_length",
"score": 1 if is_appropriate else 0,
"comment": f"Length: {len(output)} characters"
}
LLM-as-Judge Evaluator
from langsmith.evaluation import LangChainStringEvaluator
# Use an LLM to evaluate
helpfulness_evaluator = LangChainStringEvaluator(
"criteria",
config={
"criteria": {
"helpfulness": "Is the response helpful and actionable?"
}
}
)
Custom LLM Judge
from openai import OpenAI
judge_client = OpenAI()
def custom_judge(run, example) -> dict:
"""Custom LLM-as-Judge evaluator."""
question = example.inputs.get("question", "")
expected = example.outputs.get("answer", "")
actual = run.outputs.get("output", "")
response = judge_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": """You are an evaluation judge. Score the response on:
1. Accuracy (matches expected answer semantically)
2. Completeness (covers all key points)
3. Clarity (easy to understand)
Return JSON: {"score": 1-5, "reasoning": "..."}"""
}, {
"role": "user",
"content": f"Question: {question}\nExpected: {expected}\nActual: {actual}"
}],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
return {
"key": "quality_score",
"score": result["score"] / 5, # Normalize to 0-1
"comment": result["reasoning"]
}
Running Evaluations
from langsmith.evaluation import evaluate
def my_app(inputs: dict) -> dict:
"""Your LLM application."""
question = inputs["question"]
# Your logic here
return {"output": answer}
# Run evaluation
results = evaluate(
my_app,
data="support-qa-eval", # Dataset name
evaluators=[
check_length,
custom_judge,
helpfulness_evaluator
],
experiment_prefix="v1.2-gpt4o"
)
Viewing Results
In the LangSmith UI:
- Go to Datasets & Testing
- Select your dataset
- Click Compare Experiments
Compare across:
- Different prompts
- Different models
- Code changes
- Parameter variations
Evaluation Best Practices
| Practice | Why |
|---|---|
| Version your evaluators | Track changes in scoring criteria |
| Use multiple evaluators | Different aspects need different metrics |
| Include baselines | Compare against a known good version |
| Log reasoning | Understand why scores were given |
| Test evaluators too | Ensure evaluators are consistent |
Example: Complete Evaluation Pipeline
from langsmith import Client
from langsmith.evaluation import evaluate
client = Client()
# 1. Define evaluators
evaluators = [
check_length,
custom_judge,
LangChainStringEvaluator("criteria", config={
"criteria": {"accuracy": "Is the response factually correct?"}
})
]
# 2. Run evaluation
results = evaluate(
target=my_app,
data="support-qa-eval",
evaluators=evaluators,
experiment_prefix="prod-v2.1",
max_concurrency=4
)
# 3. Check results
print(f"Mean score: {results.aggregate_metrics['quality_score']['mean']}")
Key Takeaways
- Datasets are versioned: Track changes to your test data
- Evaluators are functions: Simple Python with run and example args
- LLM judges scale: Use LLMs for nuanced evaluation
- Compare experiments: Track quality across versions
- Automate in CI: Run evaluations on every PR
With LangSmith mastered, let's explore MLflow's approach to LLM evaluation. :::