W&B Weave for Evaluation

Comparing Experiments

3 min read

Weave's power shines when comparing different versions of your LLM application. Track improvements, identify regressions, and make data-driven decisions.

Why Compare Experiments?

Scenario What to Compare
Prompt engineering Different prompt versions
Model selection GPT-4 vs Claude vs Llama
Parameter tuning Temperature, max tokens
System changes RAG configurations, chain designs

Versioned Models

Weave automatically versions your models:

import weave

weave.init('my-team/my-project')

class SupportBot(weave.Model):
    model_name: str = "gpt-4o-mini"
    temperature: float = 0.7
    system_prompt: str = "You are a helpful support agent."

    @weave.op()
    def predict(self, question: str) -> str:
        # Your LLM logic
        pass

# Version 1: Default settings
bot_v1 = SupportBot()

# Version 2: Lower temperature
bot_v2 = SupportBot(temperature=0.3)

# Version 3: Updated prompt
bot_v3 = SupportBot(
    temperature=0.3,
    system_prompt="You are a friendly and efficient support agent. Be concise."
)

Running Comparative Evaluations

Evaluate multiple versions against the same dataset:

import weave

weave.init('my-team/my-project')

# Load dataset
dataset = weave.ref("support-test-cases").get()

# Define scorers
@weave.op()
def helpfulness_scorer(output: str, expected: str) -> dict:
    # Your scoring logic
    return {"helpfulness": score}

# Create evaluation
evaluation = weave.Evaluation(
    dataset=dataset,
    scorers=[helpfulness_scorer]
)

# Evaluate each version
results_v1 = await evaluation.evaluate(bot_v1)
results_v2 = await evaluation.evaluate(bot_v2)
results_v3 = await evaluation.evaluate(bot_v3)

Comparison View

In the W&B UI, compare evaluations:

Evaluation Comparison
────────────────────────────────────────────
Metric          │ bot_v1  │ bot_v2  │ bot_v3
────────────────────────────────────────────
helpfulness     │ 0.72    │ 0.78    │ 0.85
accuracy        │ 0.80    │ 0.82    │ 0.88
response_time   │ 1.2s    │ 1.1s    │ 1.0s
cost_per_query  │ $0.003  │ $0.003  │ $0.003
────────────────────────────────────────────

A/B Analysis Patterns

Pattern 1: Prompt Variations

prompts = [
    "Answer the question directly.",
    "Provide a helpful and detailed answer.",
    "Be concise but thorough in your response."
]

for i, prompt in enumerate(prompts):
    model = SupportBot(system_prompt=prompt)
    results = await evaluation.evaluate(model)
    print(f"Prompt {i+1}: {results.summary}")

Pattern 2: Model Comparison

models_to_test = [
    {"name": "gpt-4o-mini", "provider": "openai"},
    {"name": "gpt-4o", "provider": "openai"},
    {"name": "claude-3-sonnet", "provider": "anthropic"}
]

for model_config in models_to_test:
    model = SupportBot(model_name=model_config["name"])
    results = await evaluation.evaluate(model)
    # Results tracked with model metadata

Pattern 3: Parameter Sweeps

temperatures = [0.0, 0.3, 0.5, 0.7, 1.0]

for temp in temperatures:
    model = SupportBot(temperature=temp)
    results = await evaluation.evaluate(model)
    # Track correlation between temperature and quality

Tracking Over Time

Monitor quality across deployments:

# Tag evaluations with deployment info
@weave.op()
def run_production_eval():
    model = SupportBot.load_production()
    evaluation = weave.Evaluation(
        dataset=weave.ref("prod-test-cases").get(),
        scorers=[accuracy_scorer, latency_scorer]
    )
    results = await evaluation.evaluate(model)

    # Results automatically timestamped and versioned
    return results

Visualization Features

Weave UI provides:

Feature Description
Side-by-side Compare outputs for same input
Metric charts Visualize score distributions
Diff view See what changed between versions
Drill-down Inspect individual examples

Best Practices

  1. Use consistent datasets: Same test cases for fair comparison
  2. Track metadata: Record model version, prompt version, date
  3. Run multiple times: Account for LLM variance
  4. Save baselines: Keep reference points for comparison
  5. Document changes: Note what changed between versions

Tip: Create a "golden" baseline evaluation early. Compare all future experiments against this baseline.

Next, we'll explore how to build LLM-as-Judge evaluators in Weave. :::

Quiz

Module 5: W&B Weave for Evaluation

Take Quiz