A/B Testing & Experimentation

A/B Test Design

3 min read

Designing a solid experiment is half the battle. Poor design leads to inconclusive or misleading results, wasting time and resources.

Metric Selection

Choose metrics carefully - they drive decisions:

Primary Metric: The one metric that determines success/failure.

  • Should directly measure what you care about
  • Must be measurable within experiment timeframe
  • Should have enough signal (not too rare)

Secondary Metrics: Additional insights without changing the decision.

Guardrail Metrics: Red flags that stop a launch.

Example - Testing a new recommendation algorithm:

TypeMetricWhy
PrimaryClick-through rate on recommendationsDirect measure of relevance
SecondaryTime spent on clicked itemsQuality of recommendations
GuardrailOverall page load timeAlgorithm shouldn't slow things down
GuardrailRevenue per userShouldn't hurt monetization

Minimum Detectable Effect (MDE)

MDE is the smallest effect size you can reliably detect:

Tradeoffs:
- Smaller MDE → Need more users → Longer experiment
- Larger MDE → Need fewer users → Might miss real effects

How to choose MDE:

  1. Business impact: What change would matter? A 0.1% lift in conversion might be worth millions for a large platform.

  2. Feasibility: Based on baseline rate and traffic, what can you detect in a reasonable timeframe?

  3. Expected effect: What do similar changes typically produce?

Interview question: "The PM wants to detect a 1% lift in conversion (from 5% to 5.05%). Your calculator says you need 3 million users per group. What do you do?"

Good answer: "I'd push back and discuss:

  1. Is 1% lift realistic? What evidence suggests this small an effect?
  2. Can we run for longer to accumulate users?
  3. Is there a proxy metric with lower variance?
  4. Should we focus on a more impactful change first?"

Test Duration

How long should an experiment run?

Factors to consider:

FactorImpact
Sample size needsPrimary driver of duration
Weekly patternsRun full weeks (capture weekday/weekend)
Novelty effectsNew features may spike then normalize
External eventsAvoid holidays, launches, outages
MaturationSome effects take time to develop

Minimum recommendations:

  • At least 1 full week (ideally 2)
  • At least 1,000 conversions per variant
  • Long enough for novelty to wear off (2+ weeks for major UI changes)

Traffic Allocation

How to split users between variants:

SplitUse Case
50/50Standard - maximizes statistical power
90/10Testing risky changes, want to minimize exposure
Multi-armTesting multiple variants (A/B/C/D)

Ramp-up strategy:

  1. Start with 1% traffic (catch major bugs)
  2. Increase to 10% (monitor metrics)
  3. Full 50/50 (run experiment)

Interview insight: "I always recommend a ramp-up phase for new features. Starting at 1% lets us catch implementation bugs before affecting many users."

Pre-Registration

Document your experiment design before running:

Pre-registration document:
1. Hypothesis: Clear prediction
2. Primary metric: One metric, how it's measured
3. Sample size: Calculation and assumptions
4. Duration: Start/end dates
5. Analysis plan: Statistical tests to use
6. Decision criteria: What leads to launch/no-launch

Why it matters:

  • Prevents p-hacking (changing analysis to get significance)
  • Documents assumptions for stakeholders
  • Creates accountability

Pre-registration is increasingly expected at top companies. Mention it proactively to show you understand rigorous experimentation. :::

Quick check: how does this lesson land for you?

Quiz

Module 4: A/B Testing & Experimentation

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.