W&B Weave for Evaluation
Setting Up Evaluation Pipelines
3 min read
Weave provides a structured approach to evaluating LLM applications. You define scorers, create datasets, and run evaluations that track results automatically.
Evaluation Components
| Component | Purpose |
|---|---|
| Model | The LLM application to evaluate |
| Dataset | Test cases with inputs and expected outputs |
| Scorers | Functions that measure quality |
| Evaluation | Combines model, dataset, and scorers |
Creating a Scorer
Scorers are functions that evaluate model outputs:
import weave
weave.init('my-team/my-project')
@weave.op()
def accuracy_scorer(output: str, expected: str) -> dict:
"""Check if output matches expected answer."""
is_correct = expected.lower() in output.lower()
return {
"correct": is_correct,
"score": 1.0 if is_correct else 0.0
}
Defining a Dataset
Create a dataset with test cases:
import weave
# Define evaluation examples
dataset = [
{
"question": "What is the capital of France?",
"expected": "Paris"
},
{
"question": "What is 2 + 2?",
"expected": "4"
},
{
"question": "Who wrote Romeo and Juliet?",
"expected": "Shakespeare"
}
]
# Save as a Weave dataset
weave_dataset = weave.Dataset(
name="qa-test-set",
rows=dataset
)
weave.publish(weave_dataset)
Creating Your Model
Wrap your LLM application as a Weave Model:
import weave
from openai import OpenAI
class QAModel(weave.Model):
model_name: str = "gpt-4o-mini"
system_prompt: str = "Answer questions concisely."
@weave.op()
def predict(self, question: str) -> str:
client = OpenAI()
response = client.chat.completions.create(
model=self.model_name,
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
Running an Evaluation
Combine everything into an evaluation:
import weave
weave.init('my-team/my-project')
# Create model instance
model = QAModel()
# Load dataset
dataset = weave.ref("qa-test-set").get()
# Define scorers
@weave.op()
def contains_answer(output: str, expected: str) -> dict:
return {"contains_expected": expected.lower() in output.lower()}
# Create and run evaluation
evaluation = weave.Evaluation(
dataset=dataset,
scorers=[contains_answer]
)
# Run the evaluation
results = await evaluation.evaluate(model)
print(results)
Multiple Scorers
Add multiple scorers for comprehensive evaluation:
@weave.op()
def length_scorer(output: str) -> dict:
"""Check if response length is appropriate."""
word_count = len(output.split())
return {
"word_count": word_count,
"appropriate_length": 5 <= word_count <= 100
}
@weave.op()
def format_scorer(output: str) -> dict:
"""Check response formatting."""
return {
"has_period": output.strip().endswith('.'),
"capitalized": output[0].isupper() if output else False
}
# Use all scorers
evaluation = weave.Evaluation(
dataset=dataset,
scorers=[contains_answer, length_scorer, format_scorer]
)
Viewing Results
After running an evaluation:
- Go to your project in wandb.ai
- Navigate to Evaluations
- See aggregate metrics and per-example scores
- Compare across different model versions
Evaluation Results Structure
Evaluation Results
├── Summary Metrics
│ ├── contains_expected: 0.85 (85% correct)
│ ├── appropriate_length: 0.95
│ └── has_period: 0.90
├── Per-Example Results
│ ├── Example 1: scores for each scorer
│ ├── Example 2: scores for each scorer
│ └── ...
└── Model Info
├── Version
└── Parameters
Tip: Start with simple scorers like exact match or contains, then add more sophisticated scorers as you understand your evaluation needs.
Next, we'll learn how to compare experiments and track improvements over time. :::