ML Fundamentals Questions

Optimization & Training

5 min read

Gradient Descent Variants

Interview Question: "Explain different gradient descent algorithms and trade-offs."

1. Batch Gradient Descent (BGD)

# Update using entire dataset
for epoch in range(num_epochs):
    gradient = compute_gradient(X_all, y_all, weights)
    weights -= learning_rate * gradient
  • Pros: Stable convergence, exact gradient
  • Cons: Slow for large datasets, memory intensive
  • Use: Small datasets

2. Stochastic Gradient Descent (SGD)

# Update after each sample
for epoch in range(num_epochs):
    for x_i, y_i in shuffle(data):
        gradient = compute_gradient(x_i, y_i, weights)
        weights -= learning_rate * gradient
  • Pros: Fast updates, can escape local minima
  • Cons: Noisy updates, doesn't converge exactly
  • Use: Online learning, large datasets

3. Mini-Batch SGD

# Update using small batches (e.g., 32, 64, 128)
for epoch in range(num_epochs):
    for X_batch, y_batch in get_batches(X, y, batch_size=32):
        gradient = compute_gradient(X_batch, y_batch, weights)
        weights -= learning_rate * gradient
  • Pros: Best of both worlds, GPU-friendly
  • Cons: Requires batch size tuning
  • Use: Most common in practice

4. Momentum

# Accumulates velocity
velocity = 0
for iteration:
    gradient = compute_gradient(...)
    velocity = beta * velocity + gradient
    weights -= learning_rate * velocity
  • Benefit: Accelerates in relevant direction, dampens oscillations
  • Hyperparameter: β typically 0.9

5. Adam (Adaptive Moment Estimation)

m = 0  # First moment (mean)
v = 0  # Second moment (variance)

for iteration:
    gradient = compute_gradient(...)
    m = beta1 * m + (1 - beta1) * gradient
    v = beta2 * v + (1 - beta2) * gradient**2

    m_hat = m / (1 - beta1**t)  # Bias correction
    v_hat = v / (1 - beta2**t)

    weights -= learning_rate * m_hat / (sqrt(v_hat) + epsilon)
  • Pros: Adaptive learning rates, works well out-of-the-box
  • Defaults: β1=0.9, β2=0.999, ε=1e-8
  • Use: Default choice for most deep learning

Comparison:

Method Speed Memory Convergence Hyperparameters
BGD Slow High Smooth Few
SGD Fast Low Noisy Few
Mini-batch Medium Medium Good Batch size
Momentum Medium Medium Faster β, LR
Adam Medium Medium Fastest β1, β2, LR

Learning Rate Strategies

Interview Question: "How do you choose and adjust learning rate?"

1. Fixed Learning Rate

lr = 0.001  # Keep constant
  • Simple but suboptimal

2. Step Decay

lr = initial_lr * (decay_rate ** (epoch // drop_every))
# Example: 0.1 → 0.01 → 0.001

3. Exponential Decay

lr = initial_lr * exp(-decay_rate * epoch)

4. Cosine Annealing

lr = min_lr + 0.5 * (max_lr - min_lr) * (1 + cos(epoch / total_epochs * π))
  • Smooth decay following cosine curve
  • Used in state-of-the-art models

5. Learning Rate Warmup

# Start small, linearly increase to target
if epoch < warmup_epochs:
    lr = target_lr * (epoch / warmup_epochs)
else:
    lr = target_lr
  • Prevents early instability
  • Common in transformer training

Finding Initial LR

LR Range Test:

  1. Start with very small LR (1e-7)
  2. Increase exponentially each batch
  3. Plot loss vs LR
  4. Choose LR where loss decreases fastest
def find_lr(model, data, start_lr=1e-7, end_lr=10, num_iter=100):
    lrs = []
    losses = []
    lr = start_lr

    for i in range(num_iter):
        loss = train_step(model, data, lr)
        lrs.append(lr)
        losses.append(loss)
        lr *= (end_lr / start_lr) ** (1 / num_iter)

    # Plot and find steepest descent
    import matplotlib.pyplot as plt
    plt.plot(lrs, losses)
    plt.xscale('log')
    plt.show()

Loss Functions

Interview Question: "What loss function should you use for X task?"

Classification

Binary Cross-Entropy (Logistic Loss):

def binary_crossentropy(y_true, y_pred):
    return -np.mean(y_true * np.log(y_pred) +
                    (1 - y_true) * np.log(1 - y_pred))
  • Use: Binary classification with sigmoid output

Categorical Cross-Entropy:

def categorical_crossentropy(y_true, y_pred):
    return -np.sum(y_true * np.log(y_pred), axis=1).mean()
  • Use: Multi-class classification with softmax output

Focal Loss (for imbalanced data):

def focal_loss(y_true, y_pred, alpha=0.25, gamma=2):
    ce = -y_true * np.log(y_pred)
    return alpha * (1 - y_pred)**gamma * ce
  • Down-weights easy examples
  • Use: Severe class imbalance (object detection)

Regression

Mean Squared Error (MSE / L2):

def mse(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)
  • Sensitive to outliers
  • Penalizes large errors more

Mean Absolute Error (MAE / L1):

def mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))
  • Robust to outliers
  • Equal weight to all errors

Huber Loss (smooth L1):

def huber(y_true, y_pred, delta=1.0):
    error = y_true - y_pred
    is_small = np.abs(error) <= delta
    small_error = 0.5 * error**2
    large_error = delta * (np.abs(error) - 0.5 * delta)
    return np.where(is_small, small_error, large_error).mean()
  • Combines MSE and MAE
  • L2 for small errors, L1 for large

Training Best Practices

1. Data Preprocessing

# Normalization (mean=0, std=1)
X_train = (X_train - X_train.mean(axis=0)) / X_train.std(axis=0)

# Min-max scaling
X_train = (X_train - X_train.min()) / (X_train.max() - X_train.min())

2. Weight Initialization

Xavier/Glorot (for tanh/sigmoid):

W = np.random.randn(n_in, n_out) * np.sqrt(1 / n_in)

He Initialization (for ReLU):

W = np.random.randn(n_in, n_out) * np.sqrt(2 / n_in)

Poor initialization → vanishing/exploding gradients

3. Batch Normalization

  • Normalizes layer inputs
  • Reduces internal covariate shift
  • Allows higher learning rates

4. Gradient Clipping

# Prevent exploding gradients
gradients = clip(gradients, min=-threshold, max=threshold)

# Or norm clipping
if norm(gradients) > threshold:
    gradients = gradients * threshold / norm(gradients)

5. Early Stopping

best_val_loss = float('inf')
patience = 10
patience_counter = 0

for epoch in range(max_epochs):
    train(...)
    val_loss = validate(...)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        save_model(...)
        patience_counter = 0
    else:
        patience_counter += 1

    if patience_counter >= patience:
        print("Early stopping")
        break

Interview Debugging Scenarios

Q: "Training loss plateaus at high value. What's wrong?" A:

  • Learning rate too low → Increase
  • Bad initialization → Try He/Xavier
  • Vanishing gradient → Check activations, add batch norm
  • Model too simple → Increase capacity

Q: "Training loss decreases but validation loss increases. What's wrong?" A:

  • Overfitting
  • Solutions: Dropout, L2 regularization, more data, early stopping, data augmentation

Q: "Loss is NaN or explodes. What to check?" A:

  • Learning rate too high → Reduce
  • Exploding gradients → Gradient clipping
  • Division by zero → Add epsilon in denominators
  • Input not normalized → Normalize features

Q: "Model doesn't learn (loss stays constant)?" A:

  • Dead ReLUs → Try Leaky ReLU, check initialization
  • Learning rate too high or too low → LR range test
  • Wrong loss function → Verify task matches loss
  • Bug in code → Check gradient flow, print intermediate outputs

Key Takeaways

  1. Adam is default - Works well for most problems
  2. LR is most important hyperparameter - Use LR finder or start with 3e-4 (Karpathy's constant)
  3. Normalize inputs - Critical for convergence
  4. He init for ReLU - Prevents vanishing gradients
  5. Monitor both train and val loss - Early stopping prevents overfitting

:::