A/B Test Design

Designing a solid experiment is half the battle. Poor design leads to inconclusive or misleading results, wasting time and resources.

Metric Selection

Choose metrics carefully - they drive decisions:

Primary Metric: The one metric that determines success/failure.

Should directly measure what you care about
Must be measurable within experiment timeframe
Should have enough signal (not too rare)

Secondary Metrics: Additional insights without changing the decision.

Guardrail Metrics: Red flags that stop a launch.

Example - Testing a new recommendation algorithm:

Type	Metric	Why
Primary	Click-through rate on recommendations	Direct measure of relevance
Secondary	Time spent on clicked items	Quality of recommendations
Guardrail	Overall page load time	Algorithm shouldn't slow things down
Guardrail	Revenue per user	Shouldn't hurt monetization

Minimum Detectable Effect (MDE)

MDE is the smallest effect size you can reliably detect:

Tradeoffs:
- Smaller MDE → Need more users → Longer experiment
- Larger MDE → Need fewer users → Might miss real effects

How to choose MDE:

Business impact: What change would matter? A 0.1% lift in conversion might be worth millions for a large platform.
Feasibility: Based on baseline rate and traffic, what can you detect in a reasonable timeframe?
Expected effect: What do similar changes typically produce?

Interview question: "The PM wants to detect a 1% lift in conversion (from 5% to 5.05%). Your calculator says you need 3 million users per group. What do you do?"

Good answer: "I'd push back and discuss:

Is 1% lift realistic? What evidence suggests this small an effect?
Can we run for longer to accumulate users?
Is there a proxy metric with lower variance?
Should we focus on a more impactful change first?"

Test Duration

How long should an experiment run?

Factors to consider:

Factor	Impact
Sample size needs	Primary driver of duration
Weekly patterns	Run full weeks (capture weekday/weekend)
Novelty effects	New features may spike then normalize
External events	Avoid holidays, launches, outages
Maturation	Some effects take time to develop

Minimum recommendations:

At least 1 full week (ideally 2)
At least 1,000 conversions per variant
Long enough for novelty to wear off (2+ weeks for major UI changes)

Traffic Allocation

How to split users between variants:

Split	Use Case
50/50	Standard - maximizes statistical power
90/10	Testing risky changes, want to minimize exposure
Multi-arm	Testing multiple variants (A/B/C/D)

Ramp-up strategy:

Start with 1% traffic (catch major bugs)
Increase to 10% (monitor metrics)
Full 50/50 (run experiment)

Interview insight: "I always recommend a ramp-up phase for new features. Starting at 1% lets us catch implementation bugs before affecting many users."

Pre-Registration

Document your experiment design before running:

Pre-registration document:
1. Hypothesis: Clear prediction
2. Primary metric: One metric, how it's measured
3. Sample size: Calculation and assumptions
4. Duration: Start/end dates
5. Analysis plan: Statistical tests to use
6. Decision criteria: What leads to launch/no-launch

Why it matters:

Prevents p-hacking (changing analysis to get significance)
Documents assumptions for stakeholders
Creates accountability

Pre-registration is increasingly expected at top companies. Mention it proactively to show you understand rigorous experimentation. :::

Metric Selection

Minimum Detectable Effect (MDE)

Test Duration

Traffic Allocation

Pre-Registration

Quiz

Stay on the Nerd Track