A/B Testing & Experimentation

Common A/B Testing Pitfalls

4 min read

Knowing what can go wrong is as important as knowing how to run experiments. These pitfalls appear frequently in interviews and real-world testing.

The Peeking Problem

What it is: Checking results repeatedly and stopping when you see significance.

Why it's bad: Each peek inflates your false positive rate. With daily checks on a 2-week experiment, your true α might be 20-30%, not 5%.

Example:
- Day 3: p = 0.08 → Continue
- Day 5: p = 0.12 → Continue
- Day 7: p = 0.04 → "Significant! Ship it!"

Reality: You just got lucky on day 7. The true effect might be zero.

Solutions:

  1. Pre-commit to duration: Don't look at results until experiment ends
  2. Sequential testing: Use methods that account for multiple looks (e.g., group sequential designs)
  3. Adjust α: Use Pocock or O'Brien-Fleming boundaries

Interview answer: "I never stop an experiment early just because p < 0.05. I either use sequential testing with proper alpha-spending functions, or I commit to a fixed duration upfront."

Novelty and Primacy Effects

Novelty effect: Users engage more with new features simply because they're new. Effect fades over time.

Primacy effect: Users resist change and engage less initially. Effect fades as they adapt.

Effect Initial Response Long-term Example
Novelty Metric up Returns to baseline New button color gets more clicks
Primacy Metric down Returns to baseline Redesigned navigation confuses users

Detection strategies:

  • Run experiments for 2+ weeks minimum
  • Plot metrics over time within experiment
  • Compare week 1 vs week 2 within treatment group

Interview scenario: "We launched a new feature and saw 15% lift in week 1, but only 3% lift in week 2. What happened?"

Answer: "This looks like a novelty effect. Users were initially excited about the new feature, but engagement normalized. The true long-term effect is likely closer to 3%. I'd recommend running for another 1-2 weeks to confirm the steady-state effect."

Simpson's Paradox in Experiments

When aggregate results contradict segmented results:

Overall:
- Treatment: 10.5% conversion
- Control: 10.0% conversion
- Conclusion: Treatment wins!

By segment:
Mobile:   Treatment 8.0% < Control 8.5%
Desktop:  Treatment 11.0% < Control 12.0%
- Conclusion: Control wins in BOTH segments!

How this happens: Treatment group had more desktop users (who convert higher), creating a misleading aggregate.

Prevention:

  • Check balance across key segments
  • Stratified randomization (force equal distribution)
  • Always segment analysis by device, new/returning, etc.

Network Effects and Interference

When treatment affects control through user interactions:

Example - Messaging feature test:

  • Treatment users can share content in new format
  • They share with control users
  • Control users see the new format anyway
  • Experiment is contaminated

Solutions:

  • Cluster randomization: Randomize by geography, team, or social cluster
  • Market-level experiments: Different cities get different treatments
  • Time-based holdouts: Some percentage never gets the feature

Interview question: "How would you test a viral referral feature?"

Answer: "I'd use cluster randomization. Individual user randomization doesn't work because treated users refer control users. I might randomize by city or by connected user clusters, accepting that I'll need more time (fewer independent units = less power)."

Multiple Comparison Problem

Testing many metrics increases false positives:

If testing 20 metrics at α = 0.05:
Expected false positives = 20 × 0.05 = 1

On average, you'll find one "significant" result by chance alone.

Corrections:

  • Bonferroni: Divide α by number of tests (conservative)
  • Benjamini-Hochberg: Controls false discovery rate (less conservative)
  • Pre-specify primary metric: Only one metric determines launch decision

Interview tip: "I distinguish between primary metrics (require correction) and exploratory metrics (interpret with caution, no correction)."

Selection Bias

When groups aren't truly comparable:

Bias Type Example Problem
Survivor bias Only analyze users who completed onboarding Misses dropouts
Self-selection Users opt into beta Beta users are different
Timing bias Treatment rolled out during holiday Seasonal effects confound

Prevention: Always verify that treatment and control groups are balanced on pre-experiment characteristics (demographics, past behavior).

The best experimenters are paranoid. Always ask "What could make these results misleading?" :::

Quiz

Module 4: A/B Testing & Experimentation

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.