W&B Weave for Evaluation

LLM-as-Judge in Weave

3 min read

When you need nuanced evaluation beyond simple matching, use LLMs as judges. Weave makes it easy to build LLM-powered scorers that evaluate quality, helpfulness, and other subjective criteria.

Why LLM-as-Judge in Weave?

BenefitDescription
Nuanced evaluationAssess tone, helpfulness, coherence
ScalableEvaluate thousands of examples automatically
CustomizableDefine your own criteria
TrackedAll judge calls logged and versioned

Basic LLM Judge

Create a simple LLM-powered scorer:

import weave
from openai import OpenAI

weave.init('my-team/my-project')

client = OpenAI()

@weave.op()
def helpfulness_judge(output: str, question: str) -> dict:
    """Use an LLM to evaluate helpfulness."""
    response = client.chat.completions.create(
        model="gpt-5.4-mini",
        messages=[{
            "role": "system",
            "content": """You are an evaluation judge. Rate the helpfulness
            of the response on a scale of 1-5.

            1 = Not helpful at all
            3 = Somewhat helpful
            5 = Extremely helpful

            Return only a JSON object: {"score": <number>, "reasoning": "<explanation>"}"""
        }, {
            "role": "user",
            "content": f"Question: {question}\n\nResponse: {output}"
        }],
        response_format={"type": "json_object"}
    )

    result = json.loads(response.choices[0].message.content)
    return {
        "helpfulness_score": result["score"] / 5.0,  # Normalize to 0-1
        "reasoning": result["reasoning"]
    }

Multi-Criteria Judge

Evaluate on multiple dimensions:

@weave.op()
def quality_judge(output: str, question: str, context: str = None) -> dict:
    """Evaluate response on multiple quality dimensions."""
    criteria_prompt = """
    Evaluate this response on these criteria (1-5 each):

    1. **Accuracy**: Is the information correct?
    2. **Completeness**: Does it fully answer the question?
    3. **Clarity**: Is it easy to understand?
    4. **Tone**: Is the tone appropriate and professional?

    Return JSON: {
        "accuracy": <1-5>,
        "completeness": <1-5>,
        "clarity": <1-5>,
        "tone": <1-5>,
        "overall": <1-5>,
        "feedback": "<specific feedback>"
    }
    """

    response = client.chat.completions.create(
        model="gpt-5.4",  # Use stronger model for judging
        messages=[
            {"role": "system", "content": criteria_prompt},
            {"role": "user", "content": f"Question: {question}\nResponse: {output}"}
        ],
        response_format={"type": "json_object"}
    )

    result = json.loads(response.choices[0].message.content)

    # Normalize all scores to 0-1
    return {
        "accuracy": result["accuracy"] / 5.0,
        "completeness": result["completeness"] / 5.0,
        "clarity": result["clarity"] / 5.0,
        "tone": result["tone"] / 5.0,
        "overall": result["overall"] / 5.0,
        "feedback": result["feedback"]
    }

Pairwise Comparison Judge

Compare two responses directly:

@weave.op()
def pairwise_judge(response_a: str, response_b: str, question: str) -> dict:
    """Judge which response is better."""
    response = client.chat.completions.create(
        model="gpt-5.4",
        messages=[{
            "role": "system",
            "content": """Compare these two responses to the same question.
            Which is better? Return JSON:
            {
                "winner": "A" or "B" or "tie",
                "reasoning": "<explanation>"
            }"""
        }, {
            "role": "user",
            "content": f"""Question: {question}

            Response A: {response_a}

            Response B: {response_b}"""
        }],
        response_format={"type": "json_object"}
    )

    result = json.loads(response.choices[0].message.content)
    return {
        "winner": result["winner"],
        "a_wins": 1.0 if result["winner"] == "A" else 0.0,
        "reasoning": result["reasoning"]
    }

Using LLM Judges in Evaluation

Integrate judges into your evaluation pipeline:

import weave

weave.init('my-team/my-project')

# Your model
class SupportBot(weave.Model):
    @weave.op()
    def predict(self, question: str) -> str:
        # Generate response
        pass

# Evaluation with LLM judges
evaluation = weave.Evaluation(
    dataset=weave.ref("support-test-cases").get(),
    scorers=[
        helpfulness_judge,
        quality_judge
    ]
)

# Run evaluation
model = SupportBot()
results = await evaluation.evaluate(model)

# View aggregate scores
print(results.summary)
# {
#     "helpfulness_score": 0.82,
#     "accuracy": 0.88,
#     "completeness": 0.75,
#     ...
# }

Best Practices for LLM Judges

PracticeWhy
Use structured outputJSON ensures parseable results
Include reasoningHelps debug scoring decisions
Normalize scores0-1 range for easy comparison
Use strong judge modelsGPT-4o class for nuanced evaluation
Add examplesFew-shot improves consistency

Calibrating Your Judge

Verify judge accuracy on known examples:

# Known good/bad examples for calibration
calibration_set = [
    {"output": "excellent response", "expected_score": 0.9},
    {"output": "poor response", "expected_score": 0.2},
    {"output": "average response", "expected_score": 0.5}
]

# Check if judge scores match expectations
for example in calibration_set:
    result = helpfulness_judge(example["output"], "test question")
    assert abs(result["helpfulness_score"] - example["expected_score"]) < 0.2

Tracking Judge Consistency

All LLM judge calls are logged in Weave:

  • See exact prompts sent to judge
  • Review judge responses
  • Identify inconsistent scoring
  • Debug unexpected results

Tip: Log the judge's reasoning alongside scores. This helps debug why certain responses scored high or low.

With all three tools mastered, let's explore production monitoring patterns and next steps. :::

Quick check: how does this lesson land for you?

Quiz

Module 5: W&B Weave for Evaluation

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.