A/B Testing & Experimentation
Analyzing Experiment Results
When the experiment ends, the analysis begins. Your job is to extract truth from data while avoiding common interpretation mistakes.
The Analysis Framework
Follow this systematic approach:
1. Validity checks → Is the experiment trustworthy?
2. Primary metric → What does it show?
3. Statistical significance → Is it real?
4. Practical significance → Does it matter?
5. Segment analysis → Who benefits?
6. Guardrails → Any red flags?
7. Decision → Launch, iterate, or kill?
Validity Checks First
Before looking at results, verify the experiment ran correctly:
| Check | What to Look For | Red Flag |
|---|---|---|
| Sample ratio | 50/50 split achieved? | >1% deviation |
| Pre-experiment metrics | Groups balanced? | Different baselines |
| Implementation | Feature deployed correctly? | Engineering bugs |
| Duration | Full weeks completed? | Partial weeks |
Interview insight: "I always check sample ratio mismatch (SRM) first. If my 50/50 split ended up 52/48, something went wrong with randomization and the results aren't trustworthy."
Statistical vs Practical Significance
Two separate questions:
Statistical significance: Is the effect real (not random noise)?
- Answer: p < 0.05 (typically)
Practical significance: Is the effect large enough to matter?
- Answer: Depends on business context
Example:
- p = 0.01 (highly significant)
- Effect: +0.01% conversion (5.00% → 5.01%)
- 95% CI: [0.005%, 0.015%]
Statistically significant, but is +0.01% worth the engineering maintenance cost?
Interview question: "We found a significant result with p=0.02, but the lift is only 0.5%. Should we launch?"
Good answer: "I'd calculate the business impact. If 0.5% lift means $1M annual revenue, probably yes. If it means $10K but requires ongoing maintenance, maybe not. I'd also check if the confidence interval includes effects large enough to be clearly worthwhile."
Confidence Intervals Over p-Values
Confidence intervals provide more information:
| Scenario | p-value | 95% CI | Interpretation |
|---|---|---|---|
| A | 0.02 | [0.5%, 3.0%] | Significant, effect likely 0.5-3% |
| B | 0.02 | [0.01%, 0.1%] | Significant, but tiny effect |
| C | 0.15 | [-0.5%, 2.5%] | Not significant, but could be meaningful |
Pro tip: If the CI includes zero, the result is not significant. The width of the CI shows your precision.
Segment Analysis
Look beyond the aggregate:
Key segments to always check:
- Device (mobile vs desktop)
- New vs returning users
- Geography (if relevant)
- User tenure/maturity
Example finding:
Overall: +2% conversion (significant)
By device:
- Mobile: +5% conversion (significant)
- Desktop: -1% conversion (not significant)
Insight: The feature works well on mobile but may hurt desktop.
Consider mobile-only launch.
Interpreting Null Results
"Not significant" doesn't mean "no effect":
Possible interpretations:
- No true effect exists
- Effect exists but too small to detect
- Effect exists but we lacked power
- Effect exists in segments we didn't analyze
How to report:
Good: "We observed a +0.8% lift, but this was not statistically
significant (p=0.23, 95% CI: [-0.5%, 2.1%]). With our sample size,
we could only reliably detect effects ≥2%. We cannot conclude
whether the feature has a small positive effect or no effect."
Bad: "The feature doesn't work."
Making the Decision
Combine all evidence:
| Signal | Launch | Don't Launch |
|---|---|---|
| Primary metric | Significant lift | Not significant or negative |
| Practical size | Business-meaningful | Too small to matter |
| Guardrails | All healthy | Any red flags |
| Segments | Consistent or positive | Harms key segments |
| Confidence | Narrow CI, clear result | Wide CI, uncertain |
When it's ambiguous:
- Run longer if more data would help
- Consider limited launch (one segment)
- Iterate on the feature and retest
Interview framework: "For this decision, I'd summarize: The treatment showed a [X%] lift in [primary metric] (p=[value], 95% CI: [range]). Guardrail metrics [were/weren't] impacted. Segment analysis revealed [findings]. My recommendation is [launch/don't launch/iterate] because [reasoning]."
Always tie statistical findings back to business impact. Numbers alone don't make decisions - context does. :::