A/B Testing & Experimentation

Analyzing Experiment Results

4 min read

When the experiment ends, the analysis begins. Your job is to extract truth from data while avoiding common interpretation mistakes.

The Analysis Framework

Follow this systematic approach:

1. Validity checks → Is the experiment trustworthy?
2. Primary metric → What does it show?
3. Statistical significance → Is it real?
4. Practical significance → Does it matter?
5. Segment analysis → Who benefits?
6. Guardrails → Any red flags?
7. Decision → Launch, iterate, or kill?

Validity Checks First

Before looking at results, verify the experiment ran correctly:

Check What to Look For Red Flag
Sample ratio 50/50 split achieved? >1% deviation
Pre-experiment metrics Groups balanced? Different baselines
Implementation Feature deployed correctly? Engineering bugs
Duration Full weeks completed? Partial weeks

Interview insight: "I always check sample ratio mismatch (SRM) first. If my 50/50 split ended up 52/48, something went wrong with randomization and the results aren't trustworthy."

Statistical vs Practical Significance

Two separate questions:

Statistical significance: Is the effect real (not random noise)?

  • Answer: p < 0.05 (typically)

Practical significance: Is the effect large enough to matter?

  • Answer: Depends on business context
Example:
- p = 0.01 (highly significant)
- Effect: +0.01% conversion (5.00% → 5.01%)
- 95% CI: [0.005%, 0.015%]

Statistically significant, but is +0.01% worth the engineering maintenance cost?

Interview question: "We found a significant result with p=0.02, but the lift is only 0.5%. Should we launch?"

Good answer: "I'd calculate the business impact. If 0.5% lift means $1M annual revenue, probably yes. If it means $10K but requires ongoing maintenance, maybe not. I'd also check if the confidence interval includes effects large enough to be clearly worthwhile."

Confidence Intervals Over p-Values

Confidence intervals provide more information:

Scenario p-value 95% CI Interpretation
A 0.02 [0.5%, 3.0%] Significant, effect likely 0.5-3%
B 0.02 [0.01%, 0.1%] Significant, but tiny effect
C 0.15 [-0.5%, 2.5%] Not significant, but could be meaningful

Pro tip: If the CI includes zero, the result is not significant. The width of the CI shows your precision.

Segment Analysis

Look beyond the aggregate:

Key segments to always check:

  • Device (mobile vs desktop)
  • New vs returning users
  • Geography (if relevant)
  • User tenure/maturity

Example finding:

Overall: +2% conversion (significant)

By device:
- Mobile: +5% conversion (significant)
- Desktop: -1% conversion (not significant)

Insight: The feature works well on mobile but may hurt desktop.
Consider mobile-only launch.

Interpreting Null Results

"Not significant" doesn't mean "no effect":

Possible interpretations:

  1. No true effect exists
  2. Effect exists but too small to detect
  3. Effect exists but we lacked power
  4. Effect exists in segments we didn't analyze

How to report:

Good: "We observed a +0.8% lift, but this was not statistically
significant (p=0.23, 95% CI: [-0.5%, 2.1%]). With our sample size,
we could only reliably detect effects ≥2%. We cannot conclude
whether the feature has a small positive effect or no effect."

Bad: "The feature doesn't work."

Making the Decision

Combine all evidence:

Signal Launch Don't Launch
Primary metric Significant lift Not significant or negative
Practical size Business-meaningful Too small to matter
Guardrails All healthy Any red flags
Segments Consistent or positive Harms key segments
Confidence Narrow CI, clear result Wide CI, uncertain

When it's ambiguous:

  • Run longer if more data would help
  • Consider limited launch (one segment)
  • Iterate on the feature and retest

Interview framework: "For this decision, I'd summarize: The treatment showed a [X%] lift in [primary metric] (p=[value], 95% CI: [range]). Guardrail metrics [were/weren't] impacted. Segment analysis revealed [findings]. My recommendation is [launch/don't launch/iterate] because [reasoning]."

Always tie statistical findings back to business impact. Numbers alone don't make decisions - context does. :::

Quiz

Module 4: A/B Testing & Experimentation

Take Quiz