Experimentation Fundamentals

A/B testing is how modern tech companies make data-driven decisions. Interviewers expect you to understand not just the statistics, but the practical design considerations.

What Makes a Good Experiment

A well-designed experiment has these components:

Component	Description	Why It Matters
Randomization	Users randomly assigned to groups	Eliminates selection bias
Control Group	Baseline experience	Provides comparison point
Treatment Group	New experience	Tests the change
Clear Hypothesis	Specific prediction	Focuses analysis
Primary Metric	One key outcome	Prevents cherry-picking

Randomization: The Foundation

Random assignment ensures groups are comparable:

Good randomization:
- User ID hash (deterministic, reproducible)
- Each user always sees same variant
- Balanced group sizes

Bad randomization:
- Time-based (Monday vs Tuesday users differ)
- Geographic (regions have different behaviors)
- Self-selection (users choose their experience)

Interview question: "How would you randomize users for a checkout flow experiment?"

Good answer: "I'd use a hash of the user ID modulo 100. Users with hash 0-49 see control, 50-99 see treatment. This ensures each user consistently sees the same variant across sessions, and the split is balanced."

The Experiment Unit

Choose what you're randomizing carefully:

Unit	Use When	Considerations
User	Most experiments	Most common, straightforward
Session	Short-term tests	Same user may see different variants
Page view	Very granular	High noise, hard to interpret
Device	Cross-device tracking issues	User may have multiple devices
Cluster (geo, team)	Network effects expected	Fewer units, less power

Network effects example: If testing a messaging feature, randomizing by user doesn't work - treated users interact with control users. Randomize by geography or social cluster instead.

Treatment Effects

What you're trying to measure:

Average Treatment Effect (ATE):

ATE = E[Y(treatment)] - E[Y(control)]

The average difference in outcome between treatment and control groups.

Example interpretation:

Control conversion: 5.0%
Treatment conversion: 5.5%
ATE: +0.5 percentage points (10% relative lift)

Sample Size and Power

Before running any experiment, calculate required sample size:

Key parameters:

α (significance level): Typically 0.05 (5% false positive rate)
β (Type II error): Typically 0.20 (80% power)
Minimum Detectable Effect (MDE): Smallest effect worth detecting
Baseline rate: Current metric value
Variance: How much the metric varies

Rule of thumb: Detecting a 1% relative change requires ~10x the sample of detecting a 10% relative change.

Sample size formula (simplified):
n ≈ 16 × σ² / δ²

Where:
- σ² = variance of the metric
- δ = minimum detectable effect

Interview tip: Always ask about expected effect size. If stakeholders expect a 1% lift but you can only detect 5%, the experiment can't answer their question.

Guardrail Metrics

Beyond your primary metric, monitor guardrails:

Type	Example	Purpose
Business guardrails	Revenue, customer support tickets	Ensure no major harm
Engagement guardrails	Session length, pages per visit	Catch unintended effects
Technical guardrails	Latency, error rates	Ensure implementation quality

Example scenario: A new checkout flow increases conversion by 2% but increases customer support tickets by 50%. The guardrail metric (support tickets) suggests investigating before launch.

Every experiment should have at least 2-3 guardrail metrics. They protect against optimizing one thing while breaking another. :::