Statistics & Probability

Hypothesis Testing

4 min read

Hypothesis testing is the foundation of data-driven decision making. Interviewers expect you to understand not just the mechanics, but the interpretation and limitations.

The Hypothesis Testing Framework

Every hypothesis test follows this structure:

  1. State hypotheses:

    • H₀ (Null): The default assumption (no effect, no difference)
    • H₁ (Alternative): What we're testing for
  2. Choose significance level (α): Typically 0.05 or 0.01

  3. Calculate test statistic: Based on sample data

  4. Make decision: Reject H₀ if p-value < α

Type I and Type II Errors

Decision H₀ True H₀ False
Reject H₀ Type I Error (α) Correct
Fail to Reject H₀ Correct Type II Error (β)

Type I Error (False Positive):

  • Rejecting H₀ when it's true
  • Probability = α (significance level)
  • "Crying wolf" - saying there's an effect when there isn't

Type II Error (False Negative):

  • Failing to reject H₀ when it's false
  • Probability = β
  • Missing a real effect

Power = 1 - β: The probability of correctly detecting an effect when it exists.

Interview question: "A/B test shows p=0.03. Is the new feature better?"

Good answer: "At α=0.05, we'd reject the null hypothesis that there's no difference. However, statistical significance doesn't mean practical significance. I'd also look at effect size and confidence intervals before recommending a launch."

p-Values: What They Really Mean

p-value = Probability of observing data this extreme (or more) IF the null hypothesis is true

Common misinterpretations to avoid:

Wrong Correct
p=0.03 means 3% chance H₀ is true p=0.03 means 3% chance of this data if H₀ is true
p=0.03 means 97% chance effect is real The effect size is a separate question
p=0.06 means "almost significant" Either significant or not - no "almost"

Interview trap: "We got p=0.06. Should we collect more data?"

Answer: This is p-hacking. You should decide sample size in advance based on power analysis. Collecting more data because p is close to significant inflates your false positive rate.

Confidence Intervals

A 95% confidence interval means: If we repeated this experiment many times, 95% of the intervals would contain the true parameter.

It does NOT mean: There's a 95% probability the true value is in this interval.

Construction for mean (large sample):

CI = x̄ ± z × (s / √n)

For 95% CI: z = 1.96
For 99% CI: z = 2.58

Interview insight: Confidence intervals are more informative than p-values because they show effect size and uncertainty.

Scenario A: p=0.04, 95% CI = [0.1%, 2.0%]
Scenario B: p=0.04, 95% CI = [5.0%, 25.0%]

Both are "significant," but Scenario B shows a practically meaningful effect.

Common Tests Quick Reference

Test Use Case Assumptions
t-test (one sample) Mean vs known value Normal distribution or n > 30
t-test (two sample) Compare two group means Independence, normality
Paired t-test Before/after same subjects Paired observations
Chi-square Categorical independence Expected count ≥ 5 per cell
ANOVA Compare 3+ group means Normality, equal variance

Interview Problem: A/B Test Analysis

Scenario: Control has 5% conversion rate. Treatment has 5.5% conversion rate. Each group has 10,000 users. Is this significant?

Solution:

H₀: p_treatment = p_control
H₁: p_treatment ≠ p_control

Pooled proportion: p = (500 + 550) / 20000 = 0.0525

Standard error: SE = √[p(1-p)(1/n₁ + 1/n₂)]
                   = √[0.0525 × 0.9475 × (1/10000 + 1/10000)]
                   = √[0.0498 × 0.0002]
                   = 0.00316

z = (0.055 - 0.05) / 0.00316 = 1.58

p-value ≈ 0.114 (two-tailed)

Conclusion: Not significant at α=0.05. The 10% relative improvement (0.5% absolute) could be due to chance.

Always connect statistical results to business implications. "Not significant" doesn't mean "no effect" - it means we can't distinguish from noise. :::

Quiz

Module 3: Statistics & Probability

Take Quiz