Statistics & Probability
Regression & Statistical Modeling
Regression is a workhorse of data science. Interviewers test both the mechanics and your understanding of when results are trustworthy.
Linear Regression Fundamentals
The model:
y = β₀ + β₁x₁ + β₂x₂ + ... + ε
Where:
- β₀ = intercept
- βᵢ = coefficient for feature i
- ε = error term (residual)
Interpretation: "A one-unit increase in x₁ is associated with a β₁ change in y, holding other variables constant."
Linear Regression Assumptions
Know these by heart - interviewers love asking about them:
| Assumption | Violation | Consequence |
|---|---|---|
| Linearity | Curved relationship | Biased predictions |
| Independence | Autocorrelated errors | Underestimated standard errors |
| Homoscedasticity | Variance changes with X | Invalid p-values |
| Normality | Non-normal residuals | Unreliable confidence intervals |
| No multicollinearity | Correlated predictors | Unstable coefficients |
How to check:
- Linearity: Residual vs fitted plot (should be random scatter)
- Independence: Durbin-Watson test (DW ≈ 2 is good)
- Homoscedasticity: Residual vs fitted plot (constant spread)
- Normality: Q-Q plot, Shapiro-Wilk test
- Multicollinearity: VIF > 10 indicates problem
R-Squared Interpretation
R² = Proportion of variance in y explained by the model
| R² Value | Interpretation |
|---|---|
| 0.0 - 0.3 | Weak explanatory power |
| 0.3 - 0.6 | Moderate |
| 0.6 - 0.9 | Strong |
| > 0.9 | Very strong (or overfitting) |
Interview trap: "High R² means a good model"
Reality:
- R² always increases with more predictors (use Adjusted R² instead)
- High R² doesn't mean predictions are accurate
- High R² doesn't prove causation
- For some domains (social science), R² = 0.3 is excellent
Multicollinearity
When predictors are highly correlated:
- Individual coefficients become unreliable
- Standard errors inflate
- Signs may flip unexpectedly
Detection: Variance Inflation Factor (VIF)
VIF = 1 / (1 - R²ⱼ)
VIF > 5: Moderate concern
VIF > 10: Serious problem
Solutions:
- Remove one of the correlated variables
- Combine into a single variable (PCA)
- Use regularization (Ridge, Lasso)
Logistic Regression
For binary outcomes (yes/no, click/no-click):
log(p / (1-p)) = β₀ + β₁x₁ + ...
Where p = probability of positive outcome
Coefficient interpretation: "A one-unit increase in x₁ is associated with a β₁ increase in the log-odds of the outcome."
More intuitive: Use odds ratios = exp(β₁)
- OR = 1.5: 50% higher odds per unit increase
- OR = 0.8: 20% lower odds per unit increase
Interview Question Pattern
"You run a regression and find that ice cream sales predict crime rates (β = 0.7, p < 0.01). What do you conclude?"
Good answer: "This is a classic example of confounding. Both ice cream sales and crime rates are likely caused by a third variable - temperature. Hot weather increases both. I would:
- Control for temperature in the model
- Note that correlation doesn't imply causation
- Look at the residual relationship after controlling for confounders"
Regression tells you about associations, not causes. Always be ready to explain why a relationship might be spurious. :::