W&B Weave for Evaluation
LLM-as-Judge in Weave
3 min read
When you need nuanced evaluation beyond simple matching, use LLMs as judges. Weave makes it easy to build LLM-powered scorers that evaluate quality, helpfulness, and other subjective criteria.
Why LLM-as-Judge in Weave?
| Benefit | Description |
|---|---|
| Nuanced evaluation | Assess tone, helpfulness, coherence |
| Scalable | Evaluate thousands of examples automatically |
| Customizable | Define your own criteria |
| Tracked | All judge calls logged and versioned |
Basic LLM Judge
Create a simple LLM-powered scorer:
import weave
from openai import OpenAI
weave.init('my-team/my-project')
client = OpenAI()
@weave.op()
def helpfulness_judge(output: str, question: str) -> dict:
"""Use an LLM to evaluate helpfulness."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": """You are an evaluation judge. Rate the helpfulness
of the response on a scale of 1-5.
1 = Not helpful at all
3 = Somewhat helpful
5 = Extremely helpful
Return only a JSON object: {"score": <number>, "reasoning": "<explanation>"}"""
}, {
"role": "user",
"content": f"Question: {question}\n\nResponse: {output}"
}],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
return {
"helpfulness_score": result["score"] / 5.0, # Normalize to 0-1
"reasoning": result["reasoning"]
}
Multi-Criteria Judge
Evaluate on multiple dimensions:
@weave.op()
def quality_judge(output: str, question: str, context: str = None) -> dict:
"""Evaluate response on multiple quality dimensions."""
criteria_prompt = """
Evaluate this response on these criteria (1-5 each):
1. **Accuracy**: Is the information correct?
2. **Completeness**: Does it fully answer the question?
3. **Clarity**: Is it easy to understand?
4. **Tone**: Is the tone appropriate and professional?
Return JSON: {
"accuracy": <1-5>,
"completeness": <1-5>,
"clarity": <1-5>,
"tone": <1-5>,
"overall": <1-5>,
"feedback": "<specific feedback>"
}
"""
response = client.chat.completions.create(
model="gpt-4o", # Use stronger model for judging
messages=[
{"role": "system", "content": criteria_prompt},
{"role": "user", "content": f"Question: {question}\nResponse: {output}"}
],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
# Normalize all scores to 0-1
return {
"accuracy": result["accuracy"] / 5.0,
"completeness": result["completeness"] / 5.0,
"clarity": result["clarity"] / 5.0,
"tone": result["tone"] / 5.0,
"overall": result["overall"] / 5.0,
"feedback": result["feedback"]
}
Pairwise Comparison Judge
Compare two responses directly:
@weave.op()
def pairwise_judge(response_a: str, response_b: str, question: str) -> dict:
"""Judge which response is better."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": """Compare these two responses to the same question.
Which is better? Return JSON:
{
"winner": "A" or "B" or "tie",
"reasoning": "<explanation>"
}"""
}, {
"role": "user",
"content": f"""Question: {question}
Response A: {response_a}
Response B: {response_b}"""
}],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
return {
"winner": result["winner"],
"a_wins": 1.0 if result["winner"] == "A" else 0.0,
"reasoning": result["reasoning"]
}
Using LLM Judges in Evaluation
Integrate judges into your evaluation pipeline:
import weave
weave.init('my-team/my-project')
# Your model
class SupportBot(weave.Model):
@weave.op()
def predict(self, question: str) -> str:
# Generate response
pass
# Evaluation with LLM judges
evaluation = weave.Evaluation(
dataset=weave.ref("support-test-cases").get(),
scorers=[
helpfulness_judge,
quality_judge
]
)
# Run evaluation
model = SupportBot()
results = await evaluation.evaluate(model)
# View aggregate scores
print(results.summary)
# {
# "helpfulness_score": 0.82,
# "accuracy": 0.88,
# "completeness": 0.75,
# ...
# }
Best Practices for LLM Judges
| Practice | Why |
|---|---|
| Use structured output | JSON ensures parseable results |
| Include reasoning | Helps debug scoring decisions |
| Normalize scores | 0-1 range for easy comparison |
| Use strong judge models | GPT-4 class for nuanced evaluation |
| Add examples | Few-shot improves consistency |
Calibrating Your Judge
Verify judge accuracy on known examples:
# Known good/bad examples for calibration
calibration_set = [
{"output": "excellent response", "expected_score": 0.9},
{"output": "poor response", "expected_score": 0.2},
{"output": "average response", "expected_score": 0.5}
]
# Check if judge scores match expectations
for example in calibration_set:
result = helpfulness_judge(example["output"], "test question")
assert abs(result["helpfulness_score"] - example["expected_score"]) < 0.2
Tracking Judge Consistency
All LLM judge calls are logged in Weave:
- See exact prompts sent to judge
- Review judge responses
- Identify inconsistent scoring
- Debug unexpected results
Tip: Log the judge's reasoning alongside scores. This helps debug why certain responses scored high or low.
With all three tools mastered, let's explore production monitoring patterns and next steps. :::