A/B Testing & Experimentation

Experimentation Fundamentals

4 min read

A/B testing is how modern tech companies make data-driven decisions. Interviewers expect you to understand not just the statistics, but the practical design considerations.

What Makes a Good Experiment

A well-designed experiment has these components:

Component Description Why It Matters
Randomization Users randomly assigned to groups Eliminates selection bias
Control Group Baseline experience Provides comparison point
Treatment Group New experience Tests the change
Clear Hypothesis Specific prediction Focuses analysis
Primary Metric One key outcome Prevents cherry-picking

Randomization: The Foundation

Random assignment ensures groups are comparable:

Good randomization:
- User ID hash (deterministic, reproducible)
- Each user always sees same variant
- Balanced group sizes

Bad randomization:
- Time-based (Monday vs Tuesday users differ)
- Geographic (regions have different behaviors)
- Self-selection (users choose their experience)

Interview question: "How would you randomize users for a checkout flow experiment?"

Good answer: "I'd use a hash of the user ID modulo 100. Users with hash 0-49 see control, 50-99 see treatment. This ensures each user consistently sees the same variant across sessions, and the split is balanced."

The Experiment Unit

Choose what you're randomizing carefully:

Unit Use When Considerations
User Most experiments Most common, straightforward
Session Short-term tests Same user may see different variants
Page view Very granular High noise, hard to interpret
Device Cross-device tracking issues User may have multiple devices
Cluster (geo, team) Network effects expected Fewer units, less power

Network effects example: If testing a messaging feature, randomizing by user doesn't work - treated users interact with control users. Randomize by geography or social cluster instead.

Treatment Effects

What you're trying to measure:

Average Treatment Effect (ATE):

ATE = E[Y(treatment)] - E[Y(control)]

The average difference in outcome between treatment and control groups.

Example interpretation:

  • Control conversion: 5.0%
  • Treatment conversion: 5.5%
  • ATE: +0.5 percentage points (10% relative lift)

Sample Size and Power

Before running any experiment, calculate required sample size:

Key parameters:

  • α (significance level): Typically 0.05 (5% false positive rate)
  • β (Type II error): Typically 0.20 (80% power)
  • Minimum Detectable Effect (MDE): Smallest effect worth detecting
  • Baseline rate: Current metric value
  • Variance: How much the metric varies

Rule of thumb: Detecting a 1% relative change requires ~10x the sample of detecting a 10% relative change.

Sample size formula (simplified):
n ≈ 16 × σ² / δ²

Where:
- σ² = variance of the metric
- δ = minimum detectable effect

Interview tip: Always ask about expected effect size. If stakeholders expect a 1% lift but you can only detect 5%, the experiment can't answer their question.

Guardrail Metrics

Beyond your primary metric, monitor guardrails:

Type Example Purpose
Business guardrails Revenue, customer support tickets Ensure no major harm
Engagement guardrails Session length, pages per visit Catch unintended effects
Technical guardrails Latency, error rates Ensure implementation quality

Example scenario: A new checkout flow increases conversion by 2% but increases customer support tickets by 50%. The guardrail metric (support tickets) suggests investigating before launch.

Every experiment should have at least 2-3 guardrail metrics. They protect against optimizing one thing while breaking another. :::

Quiz

Module 4: A/B Testing & Experimentation

Take Quiz