A/B Testing & Experimentation
Experimentation Fundamentals
A/B testing is how modern tech companies make data-driven decisions. Interviewers expect you to understand not just the statistics, but the practical design considerations.
What Makes a Good Experiment
A well-designed experiment has these components:
| Component | Description | Why It Matters |
|---|---|---|
| Randomization | Users randomly assigned to groups | Eliminates selection bias |
| Control Group | Baseline experience | Provides comparison point |
| Treatment Group | New experience | Tests the change |
| Clear Hypothesis | Specific prediction | Focuses analysis |
| Primary Metric | One key outcome | Prevents cherry-picking |
Randomization: The Foundation
Random assignment ensures groups are comparable:
Good randomization:
- User ID hash (deterministic, reproducible)
- Each user always sees same variant
- Balanced group sizes
Bad randomization:
- Time-based (Monday vs Tuesday users differ)
- Geographic (regions have different behaviors)
- Self-selection (users choose their experience)
Interview question: "How would you randomize users for a checkout flow experiment?"
Good answer: "I'd use a hash of the user ID modulo 100. Users with hash 0-49 see control, 50-99 see treatment. This ensures each user consistently sees the same variant across sessions, and the split is balanced."
The Experiment Unit
Choose what you're randomizing carefully:
| Unit | Use When | Considerations |
|---|---|---|
| User | Most experiments | Most common, straightforward |
| Session | Short-term tests | Same user may see different variants |
| Page view | Very granular | High noise, hard to interpret |
| Device | Cross-device tracking issues | User may have multiple devices |
| Cluster (geo, team) | Network effects expected | Fewer units, less power |
Network effects example: If testing a messaging feature, randomizing by user doesn't work - treated users interact with control users. Randomize by geography or social cluster instead.
Treatment Effects
What you're trying to measure:
Average Treatment Effect (ATE):
ATE = E[Y(treatment)] - E[Y(control)]
The average difference in outcome between treatment and control groups.
Example interpretation:
- Control conversion: 5.0%
- Treatment conversion: 5.5%
- ATE: +0.5 percentage points (10% relative lift)
Sample Size and Power
Before running any experiment, calculate required sample size:
Key parameters:
- α (significance level): Typically 0.05 (5% false positive rate)
- β (Type II error): Typically 0.20 (80% power)
- Minimum Detectable Effect (MDE): Smallest effect worth detecting
- Baseline rate: Current metric value
- Variance: How much the metric varies
Rule of thumb: Detecting a 1% relative change requires ~10x the sample of detecting a 10% relative change.
Sample size formula (simplified):
n ≈ 16 × σ² / δ²
Where:
- σ² = variance of the metric
- δ = minimum detectable effect
Interview tip: Always ask about expected effect size. If stakeholders expect a 1% lift but you can only detect 5%, the experiment can't answer their question.
Guardrail Metrics
Beyond your primary metric, monitor guardrails:
| Type | Example | Purpose |
|---|---|---|
| Business guardrails | Revenue, customer support tickets | Ensure no major harm |
| Engagement guardrails | Session length, pages per visit | Catch unintended effects |
| Technical guardrails | Latency, error rates | Ensure implementation quality |
Example scenario: A new checkout flow increases conversion by 2% but increases customer support tickets by 50%. The guardrail metric (support tickets) suggests investigating before launch.
Every experiment should have at least 2-3 guardrail metrics. They protect against optimizing one thing while breaking another. :::