AI Product Metrics & Evaluation
A/B Testing AI Features
A/B testing AI features is more complex than traditional feature tests. AI introduces variability that traditional testing doesn't account for.
Why AI A/B Tests Are Different
| Traditional A/B Test | AI A/B Test |
|---|---|
| Same input → Same output | Same input → Different outputs possible |
| Feature either works or doesn't | Feature works with varying accuracy |
| User behavior is the variable | AI behavior + user behavior are variables |
| One-time measurement | Need continuous monitoring |
The Three Layers of AI Testing
Layer 1: Model Testing (Offline)
Before users see anything, test the model itself:
| Test Type | What It Measures | When to Use |
|---|---|---|
| Holdout testing | Accuracy on unseen data | Before any deployment |
| Cross-validation | Consistency across data splits | During development |
| Slice analysis | Performance by segment | Before production |
Layer 2: Shadow Testing
Deploy new model alongside production, compare results without affecting users:
User Request → Production Model → User sees this
↓
→ New Model → Log results (user doesn't see)
What to compare:
- Agreement rate between models
- Response time differences
- Edge case handling
Layer 3: Live A/B Testing
Once model passes offline and shadow tests, test with real users:
Users randomly assigned:
├── Control (50%): Production model
└── Treatment (50%): New model
Sample Size Challenges
AI experiments need larger sample sizes because:
- Model variability - Same input can produce different outputs
- User variability - Users interact differently with AI
- Segment effects - AI may perform differently for different users
Calculating Sample Size
For AI features, inflate traditional sample size by 1.5-2x:
Traditional sample size calculator:
n = (Z²α/2 × p × (1-p)) / E²
AI-adjusted:
n_ai = n × 1.5 to 2.0
Rule of thumb:
| Effect Size | Minimum Users Per Variant |
|---|---|
| Large (>20% improvement) | 5,000+ |
| Medium (10-20% improvement) | 15,000+ |
| Small (<10% improvement) | 50,000+ |
Controlling for AI Variability
Problem: Non-Deterministic Outputs
The same user with the same input might get different AI responses.
Solution 1: Seed Control
Fix the random seed per user so they get consistent responses:
user_seed = hash(user_id + experiment_id)
ai_response = model.predict(input, seed=user_seed)
Solution 2: Response Caching
Cache AI responses for the same input during the experiment:
cache_key = hash(user_id + input + variant)
if cache_key in cache:
return cache[cache_key]
else:
response = model.predict(input)
cache[cache_key] = response
return response
Solution 3: Longer Measurement Windows
AI variability smooths out over time. Run tests longer:
| Traditional Test | AI Test |
|---|---|
| 1-2 weeks typical | 3-4 weeks minimum |
| Stop at significance | Wait for stability |
What Metrics to Measure
Primary Metrics (Pick 1-2)
| Use Case | Primary Metric |
|---|---|
| Search | Click-through rate on top 3 |
| Recommendations | Conversion rate |
| Content generation | Task completion rate |
| Classification | Accuracy + user corrections |
Secondary Metrics (Monitor)
| Metric | Why It Matters |
|---|---|
| Latency | Slower AI kills engagement |
| Engagement time | Shows if users trust/use AI |
| Override rate | How often users change AI output |
| Error rate | System reliability |
Guardrail Metrics (Don't Make Worse)
| Metric | Threshold |
|---|---|
| Revenue per user | No decrease |
| User complaints | No increase |
| System errors | No increase |
Interpreting AI Test Results
When Results Are Clear
Significant improvement + Stable over time = Ship it
Control: 15% conversion
Treatment: 18% conversion
p-value: < 0.01
Stable for 2 weeks
→ Roll out to 100%
When Results Are Confusing
Common AI testing pitfalls:
| Observation | Possible Cause | Action |
|---|---|---|
| High variance in results | AI variability | Extend test duration |
| Great for some users, bad for others | Segment effects | Analyze by segment, consider personalization |
| Good initially, then declines | Novelty effect | Run longer, check for habituation |
| Statistical significance but small effect | May not be worth complexity | Calculate ROI including maintenance cost |
Segment Analysis
AI often performs differently across segments:
| Segment | Check For |
|---|---|
| New vs returning users | Different trust levels |
| Power vs casual users | Different expectations |
| High vs low engagement | Different tolerance for errors |
| Geographic | Different language/cultural fit |
Post-Test: Rolling Out AI Features
Gradual Rollout Plan
Don't go from test to 100%. Use stages:
| Stage | Traffic | Duration | Gate to Next |
|---|---|---|---|
| 10% | 1 week | Metrics stable | |
| 25% | 1 week | No degradation | |
| 50% | 1 week | Positive trend continues | |
| 100% | Ongoing | Continuous monitoring |
Rollback Triggers
Define when to stop rollout:
| Trigger | Action |
|---|---|
| Primary metric drops >10% | Pause, investigate |
| Guardrail metric violated | Rollback immediately |
| Error rate spikes | Rollback immediately |
| User complaints spike | Pause, investigate |
Key Takeaway
AI A/B tests require larger samples, longer durations, and careful control of AI variability. Don't rush results—AI behavior needs time to stabilize and reveal true patterns.
Next: Let's talk about the business side—managing AI costs and calculating ROI. :::