A/B Testing & Experimentation

Experimentation Fundamentals

4 min read

A/B testing is how modern tech companies make data-driven decisions. Interviewers expect you to understand not just the statistics, but the practical design considerations.

What Makes a Good Experiment

A well-designed experiment has these components:

ComponentDescriptionWhy It Matters
RandomizationUsers randomly assigned to groupsEliminates selection bias
Control GroupBaseline experienceProvides comparison point
Treatment GroupNew experienceTests the change
Clear HypothesisSpecific predictionFocuses analysis
Primary MetricOne key outcomePrevents cherry-picking

Randomization: The Foundation

Random assignment ensures groups are comparable:

Good randomization:
- User ID hash (deterministic, reproducible)
- Each user always sees same variant
- Balanced group sizes

Bad randomization:
- Time-based (Monday vs Tuesday users differ)
- Geographic (regions have different behaviors)
- Self-selection (users choose their experience)

Interview question: "How would you randomize users for a checkout flow experiment?"

Good answer: "I'd use a hash of the user ID modulo 100. Users with hash 0-49 see control, 50-99 see treatment. This ensures each user consistently sees the same variant across sessions, and the split is balanced."

The Experiment Unit

Choose what you're randomizing carefully:

UnitUse WhenConsiderations
UserMost experimentsMost common, straightforward
SessionShort-term testsSame user may see different variants
Page viewVery granularHigh noise, hard to interpret
DeviceCross-device tracking issuesUser may have multiple devices
Cluster (geo, team)Network effects expectedFewer units, less power

Network effects example: If testing a messaging feature, randomizing by user doesn't work - treated users interact with control users. Randomize by geography or social cluster instead.

Treatment Effects

What you're trying to measure:

Average Treatment Effect (ATE):

ATE = E[Y(treatment)] - E[Y(control)]

The average difference in outcome between treatment and control groups.

Example interpretation:

  • Control conversion: 5.0%
  • Treatment conversion: 5.5%
  • ATE: +0.5 percentage points (10% relative lift)

Sample Size and Power

Before running any experiment, calculate required sample size:

Key parameters:

  • α (significance level): Typically 0.05 (5% false positive rate)
  • β (Type II error): Typically 0.20 (80% power)
  • Minimum Detectable Effect (MDE): Smallest effect worth detecting
  • Baseline rate: Current metric value
  • Variance: How much the metric varies

Rule of thumb: Detecting a 1% relative change requires ~10x the sample of detecting a 10% relative change.

Sample size formula (simplified):
n ≈ 16 × σ² / δ²

Where:
- σ² = variance of the metric
- δ = minimum detectable effect

Interview tip: Always ask about expected effect size. If stakeholders expect a 1% lift but you can only detect 5%, the experiment can't answer their question.

Guardrail Metrics

Beyond your primary metric, monitor guardrails:

TypeExamplePurpose
Business guardrailsRevenue, customer support ticketsEnsure no major harm
Engagement guardrailsSession length, pages per visitCatch unintended effects
Technical guardrailsLatency, error ratesEnsure implementation quality

Example scenario: A new checkout flow increases conversion by 2% but increases customer support tickets by 50%. The guardrail metric (support tickets) suggests investigating before launch.

Every experiment should have at least 2-3 guardrail metrics. They protect against optimizing one thing while breaking another. :::

Quick check: how does this lesson land for you?

Quiz

Module 4: A/B Testing & Experimentation

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.