A/B Testing & Experimentation
Common A/B Testing Pitfalls
Knowing what can go wrong is as important as knowing how to run experiments. These pitfalls appear frequently in interviews and real-world testing.
The Peeking Problem
What it is: Checking results repeatedly and stopping when you see significance.
Why it's bad: Each peek inflates your false positive rate. With daily checks on a 2-week experiment, your true α might be 20-30%, not 5%.
Example:
- Day 3: p = 0.08 → Continue
- Day 5: p = 0.12 → Continue
- Day 7: p = 0.04 → "Significant! Ship it!"
Reality: You just got lucky on day 7. The true effect might be zero.
Solutions:
- Pre-commit to duration: Don't look at results until experiment ends
- Sequential testing: Use methods that account for multiple looks (e.g., group sequential designs)
- Adjust α: Use Pocock or O'Brien-Fleming boundaries
Interview answer: "I never stop an experiment early just because p < 0.05. I either use sequential testing with proper alpha-spending functions, or I commit to a fixed duration upfront."
Novelty and Primacy Effects
Novelty effect: Users engage more with new features simply because they're new. Effect fades over time.
Primacy effect: Users resist change and engage less initially. Effect fades as they adapt.
| Effect | Initial Response | Long-term | Example |
|---|---|---|---|
| Novelty | Metric up | Returns to baseline | New button color gets more clicks |
| Primacy | Metric down | Returns to baseline | Redesigned navigation confuses users |
Detection strategies:
- Run experiments for 2+ weeks minimum
- Plot metrics over time within experiment
- Compare week 1 vs week 2 within treatment group
Interview scenario: "We launched a new feature and saw 15% lift in week 1, but only 3% lift in week 2. What happened?"
Answer: "This looks like a novelty effect. Users were initially excited about the new feature, but engagement normalized. The true long-term effect is likely closer to 3%. I'd recommend running for another 1-2 weeks to confirm the steady-state effect."
Simpson's Paradox in Experiments
When aggregate results contradict segmented results:
Overall:
- Treatment: 10.5% conversion
- Control: 10.0% conversion
- Conclusion: Treatment wins!
By segment:
Mobile: Treatment 8.0% < Control 8.5%
Desktop: Treatment 11.0% < Control 12.0%
- Conclusion: Control wins in BOTH segments!
How this happens: Treatment group had more desktop users (who convert higher), creating a misleading aggregate.
Prevention:
- Check balance across key segments
- Stratified randomization (force equal distribution)
- Always segment analysis by device, new/returning, etc.
Network Effects and Interference
When treatment affects control through user interactions:
Example - Messaging feature test:
- Treatment users can share content in new format
- They share with control users
- Control users see the new format anyway
- Experiment is contaminated
Solutions:
- Cluster randomization: Randomize by geography, team, or social cluster
- Market-level experiments: Different cities get different treatments
- Time-based holdouts: Some percentage never gets the feature
Interview question: "How would you test a viral referral feature?"
Answer: "I'd use cluster randomization. Individual user randomization doesn't work because treated users refer control users. I might randomize by city or by connected user clusters, accepting that I'll need more time (fewer independent units = less power)."
Multiple Comparison Problem
Testing many metrics increases false positives:
If testing 20 metrics at α = 0.05:
Expected false positives = 20 × 0.05 = 1
On average, you'll find one "significant" result by chance alone.
Corrections:
- Bonferroni: Divide α by number of tests (conservative)
- Benjamini-Hochberg: Controls false discovery rate (less conservative)
- Pre-specify primary metric: Only one metric determines launch decision
Interview tip: "I distinguish between primary metrics (require correction) and exploratory metrics (interpret with caution, no correction)."
Selection Bias
When groups aren't truly comparable:
| Bias Type | Example | Problem |
|---|---|---|
| Survivor bias | Only analyze users who completed onboarding | Misses dropouts |
| Self-selection | Users opt into beta | Beta users are different |
| Timing bias | Treatment rolled out during holiday | Seasonal effects confound |
Prevention: Always verify that treatment and control groups are balanced on pre-experiment characteristics (demographics, past behavior).
The best experimenters are paranoid. Always ask "What could make these results misleading?" :::