W&B Weave for Evaluation
Comparing Experiments
3 min read
Weave's power shines when comparing different versions of your LLM application. Track improvements, identify regressions, and make data-driven decisions.
Why Compare Experiments?
| Scenario | What to Compare |
|---|---|
| Prompt engineering | Different prompt versions |
| Model selection | GPT-4 vs Claude vs Llama |
| Parameter tuning | Temperature, max tokens |
| System changes | RAG configurations, chain designs |
Versioned Models
Weave automatically versions your models:
import weave
weave.init('my-team/my-project')
class SupportBot(weave.Model):
model_name: str = "gpt-4o-mini"
temperature: float = 0.7
system_prompt: str = "You are a helpful support agent."
@weave.op()
def predict(self, question: str) -> str:
# Your LLM logic
pass
# Version 1: Default settings
bot_v1 = SupportBot()
# Version 2: Lower temperature
bot_v2 = SupportBot(temperature=0.3)
# Version 3: Updated prompt
bot_v3 = SupportBot(
temperature=0.3,
system_prompt="You are a friendly and efficient support agent. Be concise."
)
Running Comparative Evaluations
Evaluate multiple versions against the same dataset:
import weave
weave.init('my-team/my-project')
# Load dataset
dataset = weave.ref("support-test-cases").get()
# Define scorers
@weave.op()
def helpfulness_scorer(output: str, expected: str) -> dict:
# Your scoring logic
return {"helpfulness": score}
# Create evaluation
evaluation = weave.Evaluation(
dataset=dataset,
scorers=[helpfulness_scorer]
)
# Evaluate each version
results_v1 = await evaluation.evaluate(bot_v1)
results_v2 = await evaluation.evaluate(bot_v2)
results_v3 = await evaluation.evaluate(bot_v3)
Comparison View
In the W&B UI, compare evaluations:
Evaluation Comparison
────────────────────────────────────────────
Metric │ bot_v1 │ bot_v2 │ bot_v3
────────────────────────────────────────────
helpfulness │ 0.72 │ 0.78 │ 0.85
accuracy │ 0.80 │ 0.82 │ 0.88
response_time │ 1.2s │ 1.1s │ 1.0s
cost_per_query │ $0.003 │ $0.003 │ $0.003
────────────────────────────────────────────
A/B Analysis Patterns
Pattern 1: Prompt Variations
prompts = [
"Answer the question directly.",
"Provide a helpful and detailed answer.",
"Be concise but thorough in your response."
]
for i, prompt in enumerate(prompts):
model = SupportBot(system_prompt=prompt)
results = await evaluation.evaluate(model)
print(f"Prompt {i+1}: {results.summary}")
Pattern 2: Model Comparison
models_to_test = [
{"name": "gpt-4o-mini", "provider": "openai"},
{"name": "gpt-4o", "provider": "openai"},
{"name": "claude-3-sonnet", "provider": "anthropic"}
]
for model_config in models_to_test:
model = SupportBot(model_name=model_config["name"])
results = await evaluation.evaluate(model)
# Results tracked with model metadata
Pattern 3: Parameter Sweeps
temperatures = [0.0, 0.3, 0.5, 0.7, 1.0]
for temp in temperatures:
model = SupportBot(temperature=temp)
results = await evaluation.evaluate(model)
# Track correlation between temperature and quality
Tracking Over Time
Monitor quality across deployments:
# Tag evaluations with deployment info
@weave.op()
def run_production_eval():
model = SupportBot.load_production()
evaluation = weave.Evaluation(
dataset=weave.ref("prod-test-cases").get(),
scorers=[accuracy_scorer, latency_scorer]
)
results = await evaluation.evaluate(model)
# Results automatically timestamped and versioned
return results
Visualization Features
Weave UI provides:
| Feature | Description |
|---|---|
| Side-by-side | Compare outputs for same input |
| Metric charts | Visualize score distributions |
| Diff view | See what changed between versions |
| Drill-down | Inspect individual examples |
Best Practices
- Use consistent datasets: Same test cases for fair comparison
- Track metadata: Record model version, prompt version, date
- Run multiple times: Account for LLM variance
- Save baselines: Keep reference points for comparison
- Document changes: Note what changed between versions
Tip: Create a "golden" baseline evaluation early. Compare all future experiments against this baseline.
Next, we'll explore how to build LLM-as-Judge evaluators in Weave. :::