ML Fundamentals Questions
Supervised Learning Deep Dive
Core Algorithms You Must Know
1. Linear Regression
Interview Question: "Explain how linear regression works and when you'd use it."
Answer Framework:
- What it is: Fits a linear relationship between features and continuous target
- Formula: y = w₁x₁ + w₂x₂ + ... + b
- How it learns: Minimizes Mean Squared Error (MSE) using normal equation or gradient descent
- When to use: Simple baseline, interpretable coefficients, linear relationships
- Limitations: Assumes linearity, sensitive to outliers, can't model complex patterns
Code Example:
import numpy as np
def fit_linear_regression(X, y):
"""
Fit using normal equation: w = (X^T X)^(-1) X^T y
Time: O(n * d^2) where n=samples, d=features
Space: O(d^2)
"""
X_with_bias = np.hstack([np.ones((X.shape[0], 1)), X])
weights = np.linalg.inv(X_with_bias.T @ X_with_bias) @ X_with_bias.T @ y
return weights
# Test
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])
weights = fit_linear_regression(X, y)
print(f"Intercept: {weights[0]:.2f}, Slope: {weights[1]:.2f}")
Common Follow-ups:
- "What if X^T X is not invertible?" → Use regularization (Ridge/Lasso) or SVD
- "How do you handle categorical features?" → One-hot encoding or target encoding
- "Linear regression vs logistic regression?" → Linear for continuous, logistic for binary classification
2. Logistic Regression
Interview Question: "How does logistic regression work for classification?"
Answer Framework:
- What it is: Linear model + sigmoid function for probability estimates
- Formula: P(y=1|x) = σ(w^T x) where σ(z) = 1/(1 + e^(-z))
- Loss function: Binary cross-entropy (log loss)
- Decision boundary: Linear (can be made non-linear with feature engineering)
- Output: Probabilities between 0 and 1
Key Insight: Despite "regression" in the name, it's a classification algorithm.
Implementation:
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def logistic_regression_predict(X, weights):
"""
Predict probabilities using logistic regression
X: (n_samples, n_features)
weights: (n_features + 1,) including bias
"""
X_with_bias = np.hstack([np.ones((X.shape[0], 1)), X])
logits = X_with_bias @ weights
probabilities = sigmoid(logits)
return probabilities
# Convert probabilities to class predictions
def predict_class(probabilities, threshold=0.5):
return (probabilities >= threshold).astype(int)
Common Interview Questions:
- "Why use sigmoid?" → Maps any real number to (0,1), differentiable for gradient descent
- "Multi-class logistic regression?" → Softmax regression (one-vs-all or multinomial)
- "How to choose threshold?" → Depends on precision/recall trade-off for your use case
3. Decision Trees
Interview Question: "Explain how decision trees make decisions and their pros/cons."
Answer Framework:
- How they work: Recursively split data based on features to maximize information gain
- Splitting criteria:
- Classification: Gini impurity or entropy
- Regression: Variance reduction
- Pros: Interpretable, handles non-linear relationships, no feature scaling needed
- Cons: Prone to overfitting, high variance, unstable (small data changes → different tree)
Gini Impurity Formula:
Gini = 1 - Σ(p_i)² for all classes i
Example Calculation:
def gini_impurity(labels):
"""
Calculate Gini impurity for a set of labels
Example: [0, 0, 1, 1, 1] -> 0.48
"""
from collections import Counter
counts = Counter(labels)
total = len(labels)
impurity = 1.0
for count in counts.values():
prob = count / total
impurity -= prob ** 2
return impurity
# Test
labels = [0, 0, 1, 1, 1]
print(f"Gini: {gini_impurity(labels):.2f}") # 0.48
Key Interview Points:
- "How to prevent overfitting?" → Limit max_depth, min_samples_split, pruning
- "Decision tree vs random forest?" → Single tree overfits; forest aggregates many trees for better generalization
- "Can handle missing values?" → Yes, with surrogate splits or imputation
4. Random Forests
Interview Question: "How do random forests improve on decision trees?"
Answer Framework:
- Key technique: Bagging (Bootstrap Aggregating)
- Process:
- Create N bootstrap samples (sample with replacement)
- Train decision tree on each, using random subset of features at each split
- Average predictions (regression) or vote (classification)
- Why it works: Reduces variance while maintaining low bias
- Feature randomness: Decorrelates trees, prevents dominant features from appearing in every tree
Code Concept:
def random_forest_predict(X, trees):
"""
Aggregate predictions from multiple trees
For classification: majority vote
For regression: average
"""
predictions = [tree.predict(X) for tree in trees]
# Classification: majority vote
from scipy import stats
final_pred = stats.mode(predictions, axis=0)[0]
# Regression alternative:
# final_pred = np.mean(predictions, axis=0)
return final_pred
Advantages Over Single Tree:
- Lower variance (less overfitting)
- More robust to noise
- Feature importance estimates
- Out-of-bag (OOB) error for validation
Trade-offs:
- Less interpretable than single tree
- Slower to train and predict
- More memory intensive
5. Gradient Boosting (XGBoost, LightGBM)
Interview Question: "Explain gradient boosting and when to use it over random forests."
Answer Framework:
- Key technique: Boosting (sequential ensemble)
- Process:
- Train weak learner (shallow tree)
- Calculate residuals (errors)
- Train next tree to predict residuals
- Add to ensemble with learning rate
- Repeat
- Difference from bagging: Sequential (each tree corrects previous) vs parallel
Pseudocode:
def gradient_boosting_concept(X, y, n_trees, learning_rate):
"""
Conceptual gradient boosting
F_0(x) = initial prediction (mean)
F_m(x) = F_{m-1}(x) + learning_rate * h_m(x)
where h_m(x) is a tree trained on residuals
"""
F = np.full(len(y), y.mean()) # Initial prediction
for m in range(n_trees):
residuals = y - F
tree_m = fit_tree(X, residuals) # Train on errors
F += learning_rate * tree_m.predict(X)
return F
When to Use:
-
Gradient Boosting (XGBoost, LightGBM):
- Tabular data competitions (Kaggle)
- Need highest accuracy
- Have time for hyperparameter tuning
- Features are heterogeneous
-
Random Forests:
- Want out-of-the-box performance
- Less tuning time
- More robust to hyperparameters
- Easier to parallelize
Key Interview Points:
- "How to prevent overfitting?" → Lower learning rate, max_depth, early stopping
- "XGBoost vs LightGBM?" → LightGBM faster for large datasets, uses histogram-based splits
- "Why learning rate?" → Prevents overfitting by making smaller updates
Algorithm Comparison Table
| Algorithm | Interpretability | Speed | Accuracy | Overfitting Risk | Hyperparameter Tuning |
|---|---|---|---|---|---|
| Linear Regression | High | Fast | Low-Medium | Low | Minimal |
| Logistic Regression | High | Fast | Medium | Low | Minimal |
| Decision Tree | Medium-High | Fast | Medium | High | Medium |
| Random Forest | Low | Medium | High | Medium | Medium |
| Gradient Boosting | Low | Slow | Very High | High if not tuned | High |
How to Answer "When would you use algorithm X?"
Template:
- Nature of data: Linear vs non-linear patterns, feature types
- Problem requirements: Interpretability, speed, accuracy priority
- Data size: Small datasets → simpler models; large → can use complex
- Baseline: Start simple (linear/logistic), then try ensembles if needed
Example Answer:
"For a credit scoring problem with 50 features and 100K samples, I'd start with logistic regression as a baseline for interpretability and speed. If accuracy isn't sufficient, I'd try random forest for robust performance with minimal tuning. For maximum accuracy in a Kaggle-style competition, I'd use XGBoost with careful cross-validation and hyperparameter tuning."
Key Takeaways
- Know the fundamentals - Be able to explain math and intuition
- Understand trade-offs - No algorithm is best for everything
- Implementation matters - Know how to code from scratch for interviews
- Relate to experience - Connect to projects you've worked on
What's Next?
In the next lesson, we'll cover neural networks and deep learning concepts that frequently appear in ML interviews.
:::