A/B Testing & Experimentation

A/B Test Design

3 min read

Designing a solid experiment is half the battle. Poor design leads to inconclusive or misleading results, wasting time and resources.

Metric Selection

Choose metrics carefully - they drive decisions:

Primary Metric: The one metric that determines success/failure.

  • Should directly measure what you care about
  • Must be measurable within experiment timeframe
  • Should have enough signal (not too rare)

Secondary Metrics: Additional insights without changing the decision.

Guardrail Metrics: Red flags that stop a launch.

Example - Testing a new recommendation algorithm:

Type Metric Why
Primary Click-through rate on recommendations Direct measure of relevance
Secondary Time spent on clicked items Quality of recommendations
Guardrail Overall page load time Algorithm shouldn't slow things down
Guardrail Revenue per user Shouldn't hurt monetization

Minimum Detectable Effect (MDE)

MDE is the smallest effect size you can reliably detect:

Tradeoffs:
- Smaller MDE → Need more users → Longer experiment
- Larger MDE → Need fewer users → Might miss real effects

How to choose MDE:

  1. Business impact: What change would matter? A 0.1% lift in conversion might be worth millions for a large platform.

  2. Feasibility: Based on baseline rate and traffic, what can you detect in a reasonable timeframe?

  3. Expected effect: What do similar changes typically produce?

Interview question: "The PM wants to detect a 1% lift in conversion (from 5% to 5.05%). Your calculator says you need 3 million users per group. What do you do?"

Good answer: "I'd push back and discuss:

  1. Is 1% lift realistic? What evidence suggests this small an effect?
  2. Can we run for longer to accumulate users?
  3. Is there a proxy metric with lower variance?
  4. Should we focus on a more impactful change first?"

Test Duration

How long should an experiment run?

Factors to consider:

Factor Impact
Sample size needs Primary driver of duration
Weekly patterns Run full weeks (capture weekday/weekend)
Novelty effects New features may spike then normalize
External events Avoid holidays, launches, outages
Maturation Some effects take time to develop

Minimum recommendations:

  • At least 1 full week (ideally 2)
  • At least 1,000 conversions per variant
  • Long enough for novelty to wear off (2+ weeks for major UI changes)

Traffic Allocation

How to split users between variants:

Split Use Case
50/50 Standard - maximizes statistical power
90/10 Testing risky changes, want to minimize exposure
Multi-arm Testing multiple variants (A/B/C/D)

Ramp-up strategy:

  1. Start with 1% traffic (catch major bugs)
  2. Increase to 10% (monitor metrics)
  3. Full 50/50 (run experiment)

Interview insight: "I always recommend a ramp-up phase for new features. Starting at 1% lets us catch implementation bugs before affecting many users."

Pre-Registration

Document your experiment design before running:

Pre-registration document:
1. Hypothesis: Clear prediction
2. Primary metric: One metric, how it's measured
3. Sample size: Calculation and assumptions
4. Duration: Start/end dates
5. Analysis plan: Statistical tests to use
6. Decision criteria: What leads to launch/no-launch

Why it matters:

  • Prevents p-hacking (changing analysis to get significance)
  • Documents assumptions for stakeholders
  • Creates accountability

Pre-registration is increasingly expected at top companies. Mention it proactively to show you understand rigorous experimentation. :::

Quiz

Module 4: A/B Testing & Experimentation

Take Quiz