W&B Weave for Evaluation

LLM-as-Judge in Weave

3 min read

When you need nuanced evaluation beyond simple matching, use LLMs as judges. Weave makes it easy to build LLM-powered scorers that evaluate quality, helpfulness, and other subjective criteria.

Why LLM-as-Judge in Weave?

Benefit Description
Nuanced evaluation Assess tone, helpfulness, coherence
Scalable Evaluate thousands of examples automatically
Customizable Define your own criteria
Tracked All judge calls logged and versioned

Basic LLM Judge

Create a simple LLM-powered scorer:

import weave
from openai import OpenAI

weave.init('my-team/my-project')

client = OpenAI()

@weave.op()
def helpfulness_judge(output: str, question: str) -> dict:
    """Use an LLM to evaluate helpfulness."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """You are an evaluation judge. Rate the helpfulness
            of the response on a scale of 1-5.

            1 = Not helpful at all
            3 = Somewhat helpful
            5 = Extremely helpful

            Return only a JSON object: {"score": <number>, "reasoning": "<explanation>"}"""
        }, {
            "role": "user",
            "content": f"Question: {question}\n\nResponse: {output}"
        }],
        response_format={"type": "json_object"}
    )

    result = json.loads(response.choices[0].message.content)
    return {
        "helpfulness_score": result["score"] / 5.0,  # Normalize to 0-1
        "reasoning": result["reasoning"]
    }

Multi-Criteria Judge

Evaluate on multiple dimensions:

@weave.op()
def quality_judge(output: str, question: str, context: str = None) -> dict:
    """Evaluate response on multiple quality dimensions."""
    criteria_prompt = """
    Evaluate this response on these criteria (1-5 each):

    1. **Accuracy**: Is the information correct?
    2. **Completeness**: Does it fully answer the question?
    3. **Clarity**: Is it easy to understand?
    4. **Tone**: Is the tone appropriate and professional?

    Return JSON: {
        "accuracy": <1-5>,
        "completeness": <1-5>,
        "clarity": <1-5>,
        "tone": <1-5>,
        "overall": <1-5>,
        "feedback": "<specific feedback>"
    }
    """

    response = client.chat.completions.create(
        model="gpt-4o",  # Use stronger model for judging
        messages=[
            {"role": "system", "content": criteria_prompt},
            {"role": "user", "content": f"Question: {question}\nResponse: {output}"}
        ],
        response_format={"type": "json_object"}
    )

    result = json.loads(response.choices[0].message.content)

    # Normalize all scores to 0-1
    return {
        "accuracy": result["accuracy"] / 5.0,
        "completeness": result["completeness"] / 5.0,
        "clarity": result["clarity"] / 5.0,
        "tone": result["tone"] / 5.0,
        "overall": result["overall"] / 5.0,
        "feedback": result["feedback"]
    }

Pairwise Comparison Judge

Compare two responses directly:

@weave.op()
def pairwise_judge(response_a: str, response_b: str, question: str) -> dict:
    """Judge which response is better."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": """Compare these two responses to the same question.
            Which is better? Return JSON:
            {
                "winner": "A" or "B" or "tie",
                "reasoning": "<explanation>"
            }"""
        }, {
            "role": "user",
            "content": f"""Question: {question}

            Response A: {response_a}

            Response B: {response_b}"""
        }],
        response_format={"type": "json_object"}
    )

    result = json.loads(response.choices[0].message.content)
    return {
        "winner": result["winner"],
        "a_wins": 1.0 if result["winner"] == "A" else 0.0,
        "reasoning": result["reasoning"]
    }

Using LLM Judges in Evaluation

Integrate judges into your evaluation pipeline:

import weave

weave.init('my-team/my-project')

# Your model
class SupportBot(weave.Model):
    @weave.op()
    def predict(self, question: str) -> str:
        # Generate response
        pass

# Evaluation with LLM judges
evaluation = weave.Evaluation(
    dataset=weave.ref("support-test-cases").get(),
    scorers=[
        helpfulness_judge,
        quality_judge
    ]
)

# Run evaluation
model = SupportBot()
results = await evaluation.evaluate(model)

# View aggregate scores
print(results.summary)
# {
#     "helpfulness_score": 0.82,
#     "accuracy": 0.88,
#     "completeness": 0.75,
#     ...
# }

Best Practices for LLM Judges

Practice Why
Use structured output JSON ensures parseable results
Include reasoning Helps debug scoring decisions
Normalize scores 0-1 range for easy comparison
Use strong judge models GPT-4 class for nuanced evaluation
Add examples Few-shot improves consistency

Calibrating Your Judge

Verify judge accuracy on known examples:

# Known good/bad examples for calibration
calibration_set = [
    {"output": "excellent response", "expected_score": 0.9},
    {"output": "poor response", "expected_score": 0.2},
    {"output": "average response", "expected_score": 0.5}
]

# Check if judge scores match expectations
for example in calibration_set:
    result = helpfulness_judge(example["output"], "test question")
    assert abs(result["helpfulness_score"] - example["expected_score"]) < 0.2

Tracking Judge Consistency

All LLM judge calls are logged in Weave:

  • See exact prompts sent to judge
  • Review judge responses
  • Identify inconsistent scoring
  • Debug unexpected results

Tip: Log the judge's reasoning alongside scores. This helps debug why certain responses scored high or low.

With all three tools mastered, let's explore production monitoring patterns and next steps. :::

Quiz

Module 5: W&B Weave for Evaluation

Take Quiz