Common A/B Testing Pitfalls

Knowing what can go wrong is as important as knowing how to run experiments. These pitfalls appear frequently in interviews and real-world testing.

The Peeking Problem

What it is: Checking results repeatedly and stopping when you see significance.

Why it's bad: Each peek inflates your false positive rate. With daily checks on a 2-week experiment, your true α might be 20-30%, not 5%.

Example:
- Day 3: p = 0.08 → Continue
- Day 5: p = 0.12 → Continue
- Day 7: p = 0.04 → "Significant! Ship it!"

Reality: You just got lucky on day 7. The true effect might be zero.

Solutions:

Pre-commit to duration: Don't look at results until experiment ends
Sequential testing: Use methods that account for multiple looks (e.g., group sequential designs)
Adjust α: Use Pocock or O'Brien-Fleming boundaries

Interview answer: "I never stop an experiment early just because p < 0.05. I either use sequential testing with proper alpha-spending functions, or I commit to a fixed duration upfront."

Novelty and Primacy Effects

Novelty effect: Users engage more with new features simply because they're new. Effect fades over time.

Primacy effect: Users resist change and engage less initially. Effect fades as they adapt.

Effect	Initial Response	Long-term	Example
Novelty	Metric up	Returns to baseline	New button color gets more clicks
Primacy	Metric down	Returns to baseline	Redesigned navigation confuses users

Detection strategies:

Run experiments for 2+ weeks minimum
Plot metrics over time within experiment
Compare week 1 vs week 2 within treatment group

Interview scenario: "We launched a new feature and saw 15% lift in week 1, but only 3% lift in week 2. What happened?"

Answer: "This looks like a novelty effect. Users were initially excited about the new feature, but engagement normalized. The true long-term effect is likely closer to 3%. I'd recommend running for another 1-2 weeks to confirm the steady-state effect."

Simpson's Paradox in Experiments

When aggregate results contradict segmented results:

Overall:
- Treatment: 10.5% conversion
- Control: 10.0% conversion
- Conclusion: Treatment wins!

By segment:
Mobile:   Treatment 8.0% < Control 8.5%
Desktop:  Treatment 11.0% < Control 12.0%
- Conclusion: Control wins in BOTH segments!

How this happens: Treatment group had more desktop users (who convert higher), creating a misleading aggregate.

Prevention:

Check balance across key segments
Stratified randomization (force equal distribution)
Always segment analysis by device, new/returning, etc.

Network Effects and Interference

When treatment affects control through user interactions:

Example - Messaging feature test:

Treatment users can share content in new format
They share with control users
Control users see the new format anyway
Experiment is contaminated

Solutions:

Cluster randomization: Randomize by geography, team, or social cluster
Market-level experiments: Different cities get different treatments
Time-based holdouts: Some percentage never gets the feature

Interview question: "How would you test a viral referral feature?"

Answer: "I'd use cluster randomization. Individual user randomization doesn't work because treated users refer control users. I might randomize by city or by connected user clusters, accepting that I'll need more time (fewer independent units = less power)."

Multiple Comparison Problem

Testing many metrics increases false positives:

If testing 20 metrics at α = 0.05:
Expected false positives = 20 × 0.05 = 1

On average, you'll find one "significant" result by chance alone.

Corrections:

Bonferroni: Divide α by number of tests (conservative)
Benjamini-Hochberg: Controls false discovery rate (less conservative)
Pre-specify primary metric: Only one metric determines launch decision

Interview tip: "I distinguish between primary metrics (require correction) and exploratory metrics (interpret with caution, no correction)."

Selection Bias

When groups aren't truly comparable:

Bias Type	Example	Problem
Survivor bias	Only analyze users who completed onboarding	Misses dropouts
Self-selection	Users opt into beta	Beta users are different
Timing bias	Treatment rolled out during holiday	Seasonal effects confound

Prevention: Always verify that treatment and control groups are balanced on pre-experiment characteristics (demographics, past behavior).

The best experimenters are paranoid. Always ask "What could make these results misleading?" :::