Supervised Learning Deep Dive

Core Algorithms You Must Know

1. Linear Regression

Interview Question: "Explain how linear regression works and when you'd use it."

Answer Framework:

What it is: Fits a linear relationship between features and continuous target
Formula: y = w₁x₁ + w₂x₂ + ... + b
How it learns: Minimizes Mean Squared Error (MSE) using normal equation or gradient descent
When to use: Simple baseline, interpretable coefficients, linear relationships
Limitations: Assumes linearity, sensitive to outliers, can't model complex patterns

Code Example:

import numpy as np

def fit_linear_regression(X, y):
    """
    Fit using normal equation: w = (X^T X)^(-1) X^T y

    Time: O(n * d^2) where n=samples, d=features
    Space: O(d^2)
    """
    X_with_bias = np.hstack([np.ones((X.shape[0], 1)), X])
    weights = np.linalg.inv(X_with_bias.T @ X_with_bias) @ X_with_bias.T @ y
    return weights

# Test
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])
weights = fit_linear_regression(X, y)
print(f"Intercept: {weights[0]:.2f}, Slope: {weights[1]:.2f}")

Common Follow-ups:

"What if X^T X is not invertible?" → Use regularization (Ridge/Lasso) or SVD
"How do you handle categorical features?" → One-hot encoding or target encoding
"Linear regression vs logistic regression?" → Linear for continuous, logistic for binary classification

2. Logistic Regression

Interview Question: "How does logistic regression work for classification?"

Answer Framework:

What it is: Linear model + sigmoid function for probability estimates
Formula: P(y=1|x) = σ(w^T x) where σ(z) = 1/(1 + e^(-z))
Loss function: Binary cross-entropy (log loss)
Decision boundary: Linear (can be made non-linear with feature engineering)
Output: Probabilities between 0 and 1

Key Insight: Despite "regression" in the name, it's a classification algorithm.

Implementation:

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def logistic_regression_predict(X, weights):
    """
    Predict probabilities using logistic regression

    X: (n_samples, n_features)
    weights: (n_features + 1,) including bias
    """
    X_with_bias = np.hstack([np.ones((X.shape[0], 1)), X])
    logits = X_with_bias @ weights
    probabilities = sigmoid(logits)
    return probabilities

# Convert probabilities to class predictions
def predict_class(probabilities, threshold=0.5):
    return (probabilities >= threshold).astype(int)

Common Interview Questions:

"Why use sigmoid?" → Maps any real number to (0,1), differentiable for gradient descent
"Multi-class logistic regression?" → Softmax regression (one-vs-all or multinomial)
"How to choose threshold?" → Depends on precision/recall trade-off for your use case

3. Decision Trees

Interview Question: "Explain how decision trees make decisions and their pros/cons."

Answer Framework:

How they work: Recursively split data based on features to maximize information gain
Splitting criteria:
- Classification: Gini impurity or entropy
- Regression: Variance reduction
Pros: Interpretable, handles non-linear relationships, no feature scaling needed
Cons: Prone to overfitting, high variance, unstable (small data changes → different tree)

Gini Impurity Formula:

Gini = 1 - Σ(p_i)² for all classes i

Example Calculation:

def gini_impurity(labels):
    """
    Calculate Gini impurity for a set of labels

    Example: [0, 0, 1, 1, 1] -> 0.48
    """
    from collections import Counter
    counts = Counter(labels)
    total = len(labels)
    impurity = 1.0

    for count in counts.values():
        prob = count / total
        impurity -= prob ** 2

    return impurity

# Test
labels = [0, 0, 1, 1, 1]
print(f"Gini: {gini_impurity(labels):.2f}")  # 0.48

Key Interview Points:

"How to prevent overfitting?" → Limit max_depth, min_samples_split, pruning
"Decision tree vs random forest?" → Single tree overfits; forest aggregates many trees for better generalization
"Can handle missing values?" → Yes, with surrogate splits or imputation

4. Random Forests

Interview Question: "How do random forests improve on decision trees?"

Answer Framework:

Key technique: Bagging (Bootstrap Aggregating)
Process:
1. Create N bootstrap samples (sample with replacement)
2. Train decision tree on each, using random subset of features at each split
3. Average predictions (regression) or vote (classification)
Why it works: Reduces variance while maintaining low bias
Feature randomness: Decorrelates trees, prevents dominant features from appearing in every tree

Code Concept:

def random_forest_predict(X, trees):
    """
    Aggregate predictions from multiple trees

    For classification: majority vote
    For regression: average
    """
    predictions = [tree.predict(X) for tree in trees]

    # Classification: majority vote
    from scipy import stats
    final_pred = stats.mode(predictions, axis=0)[0]

    # Regression alternative:
    # final_pred = np.mean(predictions, axis=0)

    return final_pred

Advantages Over Single Tree:

Lower variance (less overfitting)
More robust to noise
Feature importance estimates
Out-of-bag (OOB) error for validation

Trade-offs:

Less interpretable than single tree
Slower to train and predict
More memory intensive

5. Gradient Boosting (XGBoost, LightGBM)

Interview Question: "Explain gradient boosting and when to use it over random forests."

Answer Framework:

Key technique: Boosting (sequential ensemble)
Process:
1. Train weak learner (shallow tree)
2. Calculate residuals (errors)
3. Train next tree to predict residuals
4. Add to ensemble with learning rate
5. Repeat
Difference from bagging: Sequential (each tree corrects previous) vs parallel

Pseudocode:

def gradient_boosting_concept(X, y, n_trees, learning_rate):
    """
    Conceptual gradient boosting

    F_0(x) = initial prediction (mean)
    F_m(x) = F_{m-1}(x) + learning_rate * h_m(x)

    where h_m(x) is a tree trained on residuals
    """
    F = np.full(len(y), y.mean())  # Initial prediction

    for m in range(n_trees):
        residuals = y - F
        tree_m = fit_tree(X, residuals)  # Train on errors
        F += learning_rate * tree_m.predict(X)

    return F

When to Use:

Gradient Boosting (XGBoost, LightGBM):
- Tabular data competitions (Kaggle)
- Need highest accuracy
- Have time for hyperparameter tuning
- Features are heterogeneous
Random Forests:
- Want out-of-the-box performance
- Less tuning time
- More robust to hyperparameters
- Easier to parallelize

Key Interview Points:

"How to prevent overfitting?" → Lower learning rate, max_depth, early stopping
"XGBoost vs LightGBM?" → LightGBM faster for large datasets, uses histogram-based splits
"Why learning rate?" → Prevents overfitting by making smaller updates

Algorithm Comparison Table

Algorithm	Interpretability	Speed	Accuracy	Overfitting Risk	Hyperparameter Tuning
Linear Regression	High	Fast	Low-Medium	Low	Minimal
Logistic Regression	High	Fast	Medium	Low	Minimal
Decision Tree	Medium-High	Fast	Medium	High	Medium
Random Forest	Low	Medium	High	Medium	Medium
Gradient Boosting	Low	Slow	Very High	High if not tuned	High

How to Answer "When would you use algorithm X?"

Template:

Nature of data: Linear vs non-linear patterns, feature types
Problem requirements: Interpretability, speed, accuracy priority
Data size: Small datasets → simpler models; large → can use complex
Baseline: Start simple (linear/logistic), then try ensembles if needed

Example Answer:

"For a credit scoring problem with 50 features and 100K samples, I'd start with logistic regression as a baseline for interpretability and speed. If accuracy isn't sufficient, I'd try random forest for robust performance with minimal tuning. For maximum accuracy in a Kaggle-style competition, I'd use XGBoost with careful cross-validation and hyperparameter tuning."

Key Takeaways

Know the fundamentals - Be able to explain math and intuition
Understand trade-offs - No algorithm is best for everything
Implementation matters - Know how to code from scratch for interviews
Relate to experience - Connect to projects you've worked on

What's Next?

In the next lesson, we'll cover neural networks and deep learning concepts that frequently appear in ML interviews.

:::

Core Algorithms You Must Know

1. Linear Regression

2. Logistic Regression

3. Decision Trees

4. Random Forests

5. Gradient Boosting (XGBoost, LightGBM)

Algorithm Comparison Table

How to Answer "When would you use algorithm X?"

Key Takeaways

What's Next?

Quiz

Stay on the Nerd Track