Statistics & Probability

Regression & Statistical Modeling

3 min read

Regression is a workhorse of data science. Interviewers test both the mechanics and your understanding of when results are trustworthy.

Linear Regression Fundamentals

The model:

y = β₀ + β₁x₁ + β₂x₂ + ... + ε

Where:
- β₀ = intercept
- βᵢ = coefficient for feature i
- ε = error term (residual)

Interpretation: "A one-unit increase in x₁ is associated with a β₁ change in y, holding other variables constant."

Linear Regression Assumptions

Know these by heart - interviewers love asking about them:

Assumption Violation Consequence
Linearity Curved relationship Biased predictions
Independence Autocorrelated errors Underestimated standard errors
Homoscedasticity Variance changes with X Invalid p-values
Normality Non-normal residuals Unreliable confidence intervals
No multicollinearity Correlated predictors Unstable coefficients

How to check:

  • Linearity: Residual vs fitted plot (should be random scatter)
  • Independence: Durbin-Watson test (DW ≈ 2 is good)
  • Homoscedasticity: Residual vs fitted plot (constant spread)
  • Normality: Q-Q plot, Shapiro-Wilk test
  • Multicollinearity: VIF > 10 indicates problem

R-Squared Interpretation

R² = Proportion of variance in y explained by the model

R² Value Interpretation
0.0 - 0.3 Weak explanatory power
0.3 - 0.6 Moderate
0.6 - 0.9 Strong
> 0.9 Very strong (or overfitting)

Interview trap: "High R² means a good model"

Reality:

  • R² always increases with more predictors (use Adjusted R² instead)
  • High R² doesn't mean predictions are accurate
  • High R² doesn't prove causation
  • For some domains (social science), R² = 0.3 is excellent

Multicollinearity

When predictors are highly correlated:

  • Individual coefficients become unreliable
  • Standard errors inflate
  • Signs may flip unexpectedly

Detection: Variance Inflation Factor (VIF)

VIF = 1 / (1 - R²ⱼ)

VIF > 5: Moderate concern
VIF > 10: Serious problem

Solutions:

  1. Remove one of the correlated variables
  2. Combine into a single variable (PCA)
  3. Use regularization (Ridge, Lasso)

Logistic Regression

For binary outcomes (yes/no, click/no-click):

log(p / (1-p)) = β₀ + β₁x₁ + ...

Where p = probability of positive outcome

Coefficient interpretation: "A one-unit increase in x₁ is associated with a β₁ increase in the log-odds of the outcome."

More intuitive: Use odds ratios = exp(β₁)

  • OR = 1.5: 50% higher odds per unit increase
  • OR = 0.8: 20% lower odds per unit increase

Interview Question Pattern

"You run a regression and find that ice cream sales predict crime rates (β = 0.7, p < 0.01). What do you conclude?"

Good answer: "This is a classic example of confounding. Both ice cream sales and crime rates are likely caused by a third variable - temperature. Hot weather increases both. I would:

  1. Control for temperature in the model
  2. Note that correlation doesn't imply causation
  3. Look at the residual relationship after controlling for confounders"

Regression tells you about associations, not causes. Always be ready to explain why a relationship might be spurious. :::

Quiz

Module 3: Statistics & Probability

Take Quiz