A/B Testing & Experimentation
A/B Test Design
Designing a solid experiment is half the battle. Poor design leads to inconclusive or misleading results, wasting time and resources.
Metric Selection
Choose metrics carefully - they drive decisions:
Primary Metric: The one metric that determines success/failure.
- Should directly measure what you care about
- Must be measurable within experiment timeframe
- Should have enough signal (not too rare)
Secondary Metrics: Additional insights without changing the decision.
Guardrail Metrics: Red flags that stop a launch.
Example - Testing a new recommendation algorithm:
| Type | Metric | Why |
|---|---|---|
| Primary | Click-through rate on recommendations | Direct measure of relevance |
| Secondary | Time spent on clicked items | Quality of recommendations |
| Guardrail | Overall page load time | Algorithm shouldn't slow things down |
| Guardrail | Revenue per user | Shouldn't hurt monetization |
Minimum Detectable Effect (MDE)
MDE is the smallest effect size you can reliably detect:
Tradeoffs:
- Smaller MDE → Need more users → Longer experiment
- Larger MDE → Need fewer users → Might miss real effects
How to choose MDE:
-
Business impact: What change would matter? A 0.1% lift in conversion might be worth millions for a large platform.
-
Feasibility: Based on baseline rate and traffic, what can you detect in a reasonable timeframe?
-
Expected effect: What do similar changes typically produce?
Interview question: "The PM wants to detect a 1% lift in conversion (from 5% to 5.05%). Your calculator says you need 3 million users per group. What do you do?"
Good answer: "I'd push back and discuss:
- Is 1% lift realistic? What evidence suggests this small an effect?
- Can we run for longer to accumulate users?
- Is there a proxy metric with lower variance?
- Should we focus on a more impactful change first?"
Test Duration
How long should an experiment run?
Factors to consider:
| Factor | Impact |
|---|---|
| Sample size needs | Primary driver of duration |
| Weekly patterns | Run full weeks (capture weekday/weekend) |
| Novelty effects | New features may spike then normalize |
| External events | Avoid holidays, launches, outages |
| Maturation | Some effects take time to develop |
Minimum recommendations:
- At least 1 full week (ideally 2)
- At least 1,000 conversions per variant
- Long enough for novelty to wear off (2+ weeks for major UI changes)
Traffic Allocation
How to split users between variants:
| Split | Use Case |
|---|---|
| 50/50 | Standard - maximizes statistical power |
| 90/10 | Testing risky changes, want to minimize exposure |
| Multi-arm | Testing multiple variants (A/B/C/D) |
Ramp-up strategy:
- Start with 1% traffic (catch major bugs)
- Increase to 10% (monitor metrics)
- Full 50/50 (run experiment)
Interview insight: "I always recommend a ramp-up phase for new features. Starting at 1% lets us catch implementation bugs before affecting many users."
Pre-Registration
Document your experiment design before running:
Pre-registration document:
1. Hypothesis: Clear prediction
2. Primary metric: One metric, how it's measured
3. Sample size: Calculation and assumptions
4. Duration: Start/end dates
5. Analysis plan: Statistical tests to use
6. Decision criteria: What leads to launch/no-launch
Why it matters:
- Prevents p-hacking (changing analysis to get significance)
- Documents assumptions for stakeholders
- Creates accountability
Pre-registration is increasingly expected at top companies. Mention it proactively to show you understand rigorous experimentation. :::