ML Fundamentals Questions

Supervised Learning Deep Dive

5 min read

Core Algorithms You Must Know

1. Linear Regression

Interview Question: "Explain how linear regression works and when you'd use it."

Answer Framework:

  • What it is: Fits a linear relationship between features and continuous target
  • Formula: y = w₁x₁ + w₂x₂ + ... + b
  • How it learns: Minimizes Mean Squared Error (MSE) using normal equation or gradient descent
  • When to use: Simple baseline, interpretable coefficients, linear relationships
  • Limitations: Assumes linearity, sensitive to outliers, can't model complex patterns

Code Example:

import numpy as np

def fit_linear_regression(X, y):
    """
    Fit using normal equation: w = (X^T X)^(-1) X^T y

    Time: O(n * d^2) where n=samples, d=features
    Space: O(d^2)
    """
    X_with_bias = np.hstack([np.ones((X.shape[0], 1)), X])
    weights = np.linalg.inv(X_with_bias.T @ X_with_bias) @ X_with_bias.T @ y
    return weights

# Test
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])
weights = fit_linear_regression(X, y)
print(f"Intercept: {weights[0]:.2f}, Slope: {weights[1]:.2f}")

Common Follow-ups:

  • "What if X^T X is not invertible?" → Use regularization (Ridge/Lasso) or SVD
  • "How do you handle categorical features?" → One-hot encoding or target encoding
  • "Linear regression vs logistic regression?" → Linear for continuous, logistic for binary classification

2. Logistic Regression

Interview Question: "How does logistic regression work for classification?"

Answer Framework:

  • What it is: Linear model + sigmoid function for probability estimates
  • Formula: P(y=1|x) = σ(w^T x) where σ(z) = 1/(1 + e^(-z))
  • Loss function: Binary cross-entropy (log loss)
  • Decision boundary: Linear (can be made non-linear with feature engineering)
  • Output: Probabilities between 0 and 1

Key Insight: Despite "regression" in the name, it's a classification algorithm.

Implementation:

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def logistic_regression_predict(X, weights):
    """
    Predict probabilities using logistic regression

    X: (n_samples, n_features)
    weights: (n_features + 1,) including bias
    """
    X_with_bias = np.hstack([np.ones((X.shape[0], 1)), X])
    logits = X_with_bias @ weights
    probabilities = sigmoid(logits)
    return probabilities

# Convert probabilities to class predictions
def predict_class(probabilities, threshold=0.5):
    return (probabilities >= threshold).astype(int)

Common Interview Questions:

  • "Why use sigmoid?" → Maps any real number to (0,1), differentiable for gradient descent
  • "Multi-class logistic regression?" → Softmax regression (one-vs-all or multinomial)
  • "How to choose threshold?" → Depends on precision/recall trade-off for your use case

3. Decision Trees

Interview Question: "Explain how decision trees make decisions and their pros/cons."

Answer Framework:

  • How they work: Recursively split data based on features to maximize information gain
  • Splitting criteria:
    • Classification: Gini impurity or entropy
    • Regression: Variance reduction
  • Pros: Interpretable, handles non-linear relationships, no feature scaling needed
  • Cons: Prone to overfitting, high variance, unstable (small data changes → different tree)

Gini Impurity Formula:

Gini = 1 - Σ(p_i)² for all classes i

Example Calculation:

def gini_impurity(labels):
    """
    Calculate Gini impurity for a set of labels

    Example: [0, 0, 1, 1, 1] -> 0.48
    """
    from collections import Counter
    counts = Counter(labels)
    total = len(labels)
    impurity = 1.0

    for count in counts.values():
        prob = count / total
        impurity -= prob ** 2

    return impurity

# Test
labels = [0, 0, 1, 1, 1]
print(f"Gini: {gini_impurity(labels):.2f}")  # 0.48

Key Interview Points:

  • "How to prevent overfitting?" → Limit max_depth, min_samples_split, pruning
  • "Decision tree vs random forest?" → Single tree overfits; forest aggregates many trees for better generalization
  • "Can handle missing values?" → Yes, with surrogate splits or imputation

4. Random Forests

Interview Question: "How do random forests improve on decision trees?"

Answer Framework:

  • Key technique: Bagging (Bootstrap Aggregating)
  • Process:
    1. Create N bootstrap samples (sample with replacement)
    2. Train decision tree on each, using random subset of features at each split
    3. Average predictions (regression) or vote (classification)
  • Why it works: Reduces variance while maintaining low bias
  • Feature randomness: Decorrelates trees, prevents dominant features from appearing in every tree

Code Concept:

def random_forest_predict(X, trees):
    """
    Aggregate predictions from multiple trees

    For classification: majority vote
    For regression: average
    """
    predictions = [tree.predict(X) for tree in trees]

    # Classification: majority vote
    from scipy import stats
    final_pred = stats.mode(predictions, axis=0)[0]

    # Regression alternative:
    # final_pred = np.mean(predictions, axis=0)

    return final_pred

Advantages Over Single Tree:

  • Lower variance (less overfitting)
  • More robust to noise
  • Feature importance estimates
  • Out-of-bag (OOB) error for validation

Trade-offs:

  • Less interpretable than single tree
  • Slower to train and predict
  • More memory intensive

5. Gradient Boosting (XGBoost, LightGBM)

Interview Question: "Explain gradient boosting and when to use it over random forests."

Answer Framework:

  • Key technique: Boosting (sequential ensemble)
  • Process:
    1. Train weak learner (shallow tree)
    2. Calculate residuals (errors)
    3. Train next tree to predict residuals
    4. Add to ensemble with learning rate
    5. Repeat
  • Difference from bagging: Sequential (each tree corrects previous) vs parallel

Pseudocode:

def gradient_boosting_concept(X, y, n_trees, learning_rate):
    """
    Conceptual gradient boosting

    F_0(x) = initial prediction (mean)
    F_m(x) = F_{m-1}(x) + learning_rate * h_m(x)

    where h_m(x) is a tree trained on residuals
    """
    F = np.full(len(y), y.mean())  # Initial prediction

    for m in range(n_trees):
        residuals = y - F
        tree_m = fit_tree(X, residuals)  # Train on errors
        F += learning_rate * tree_m.predict(X)

    return F

When to Use:

  • Gradient Boosting (XGBoost, LightGBM):

    • Tabular data competitions (Kaggle)
    • Need highest accuracy
    • Have time for hyperparameter tuning
    • Features are heterogeneous
  • Random Forests:

    • Want out-of-the-box performance
    • Less tuning time
    • More robust to hyperparameters
    • Easier to parallelize

Key Interview Points:

  • "How to prevent overfitting?" → Lower learning rate, max_depth, early stopping
  • "XGBoost vs LightGBM?" → LightGBM faster for large datasets, uses histogram-based splits
  • "Why learning rate?" → Prevents overfitting by making smaller updates

Algorithm Comparison Table

Algorithm Interpretability Speed Accuracy Overfitting Risk Hyperparameter Tuning
Linear Regression High Fast Low-Medium Low Minimal
Logistic Regression High Fast Medium Low Minimal
Decision Tree Medium-High Fast Medium High Medium
Random Forest Low Medium High Medium Medium
Gradient Boosting Low Slow Very High High if not tuned High

How to Answer "When would you use algorithm X?"

Template:

  1. Nature of data: Linear vs non-linear patterns, feature types
  2. Problem requirements: Interpretability, speed, accuracy priority
  3. Data size: Small datasets → simpler models; large → can use complex
  4. Baseline: Start simple (linear/logistic), then try ensembles if needed

Example Answer:

"For a credit scoring problem with 50 features and 100K samples, I'd start with logistic regression as a baseline for interpretability and speed. If accuracy isn't sufficient, I'd try random forest for robust performance with minimal tuning. For maximum accuracy in a Kaggle-style competition, I'd use XGBoost with careful cross-validation and hyperparameter tuning."

Key Takeaways

  1. Know the fundamentals - Be able to explain math and intuition
  2. Understand trade-offs - No algorithm is best for everything
  3. Implementation matters - Know how to code from scratch for interviews
  4. Relate to experience - Connect to projects you've worked on

What's Next?

In the next lesson, we'll cover neural networks and deep learning concepts that frequently appear in ML interviews.

:::