ML Fundamentals Questions
Optimization & Training
Gradient Descent Variants
Interview Question: "Explain different gradient descent algorithms and trade-offs."
1. Batch Gradient Descent (BGD)
# Update using entire dataset
for epoch in range(num_epochs):
gradient = compute_gradient(X_all, y_all, weights)
weights -= learning_rate * gradient
- Pros: Stable convergence, exact gradient
- Cons: Slow for large datasets, memory intensive
- Use: Small datasets
2. Stochastic Gradient Descent (SGD)
# Update after each sample
for epoch in range(num_epochs):
for x_i, y_i in shuffle(data):
gradient = compute_gradient(x_i, y_i, weights)
weights -= learning_rate * gradient
- Pros: Fast updates, can escape local minima
- Cons: Noisy updates, doesn't converge exactly
- Use: Online learning, large datasets
3. Mini-Batch SGD
# Update using small batches (e.g., 32, 64, 128)
for epoch in range(num_epochs):
for X_batch, y_batch in get_batches(X, y, batch_size=32):
gradient = compute_gradient(X_batch, y_batch, weights)
weights -= learning_rate * gradient
- Pros: Best of both worlds, GPU-friendly
- Cons: Requires batch size tuning
- Use: Most common in practice
4. Momentum
# Accumulates velocity
velocity = 0
for iteration:
gradient = compute_gradient(...)
velocity = beta * velocity + gradient
weights -= learning_rate * velocity
- Benefit: Accelerates in relevant direction, dampens oscillations
- Hyperparameter: β typically 0.9
5. Adam (Adaptive Moment Estimation)
m = 0 # First moment (mean)
v = 0 # Second moment (variance)
for iteration:
gradient = compute_gradient(...)
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient**2
m_hat = m / (1 - beta1**t) # Bias correction
v_hat = v / (1 - beta2**t)
weights -= learning_rate * m_hat / (sqrt(v_hat) + epsilon)
- Pros: Adaptive learning rates, works well out-of-the-box
- Defaults: β1=0.9, β2=0.999, ε=1e-8
- Use: Default choice for most deep learning
Comparison:
| Method | Speed | Memory | Convergence | Hyperparameters |
|---|---|---|---|---|
| BGD | Slow | High | Smooth | Few |
| SGD | Fast | Low | Noisy | Few |
| Mini-batch | Medium | Medium | Good | Batch size |
| Momentum | Medium | Medium | Faster | β, LR |
| Adam | Medium | Medium | Fastest | β1, β2, LR |
Learning Rate Strategies
Interview Question: "How do you choose and adjust learning rate?"
1. Fixed Learning Rate
lr = 0.001 # Keep constant
- Simple but suboptimal
2. Step Decay
lr = initial_lr * (decay_rate ** (epoch // drop_every))
# Example: 0.1 → 0.01 → 0.001
3. Exponential Decay
lr = initial_lr * exp(-decay_rate * epoch)
4. Cosine Annealing
lr = min_lr + 0.5 * (max_lr - min_lr) * (1 + cos(epoch / total_epochs * π))
- Smooth decay following cosine curve
- Used in state-of-the-art models
5. Learning Rate Warmup
# Start small, linearly increase to target
if epoch < warmup_epochs:
lr = target_lr * (epoch / warmup_epochs)
else:
lr = target_lr
- Prevents early instability
- Common in transformer training
Finding Initial LR
LR Range Test:
- Start with very small LR (1e-7)
- Increase exponentially each batch
- Plot loss vs LR
- Choose LR where loss decreases fastest
def find_lr(model, data, start_lr=1e-7, end_lr=10, num_iter=100):
lrs = []
losses = []
lr = start_lr
for i in range(num_iter):
loss = train_step(model, data, lr)
lrs.append(lr)
losses.append(loss)
lr *= (end_lr / start_lr) ** (1 / num_iter)
# Plot and find steepest descent
import matplotlib.pyplot as plt
plt.plot(lrs, losses)
plt.xscale('log')
plt.show()
Loss Functions
Interview Question: "What loss function should you use for X task?"
Classification
Binary Cross-Entropy (Logistic Loss):
def binary_crossentropy(y_true, y_pred):
return -np.mean(y_true * np.log(y_pred) +
(1 - y_true) * np.log(1 - y_pred))
- Use: Binary classification with sigmoid output
Categorical Cross-Entropy:
def categorical_crossentropy(y_true, y_pred):
return -np.sum(y_true * np.log(y_pred), axis=1).mean()
- Use: Multi-class classification with softmax output
Focal Loss (for imbalanced data):
def focal_loss(y_true, y_pred, alpha=0.25, gamma=2):
ce = -y_true * np.log(y_pred)
return alpha * (1 - y_pred)**gamma * ce
- Down-weights easy examples
- Use: Severe class imbalance (object detection)
Regression
Mean Squared Error (MSE / L2):
def mse(y_true, y_pred):
return np.mean((y_true - y_pred)**2)
- Sensitive to outliers
- Penalizes large errors more
Mean Absolute Error (MAE / L1):
def mae(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))
- Robust to outliers
- Equal weight to all errors
Huber Loss (smooth L1):
def huber(y_true, y_pred, delta=1.0):
error = y_true - y_pred
is_small = np.abs(error) <= delta
small_error = 0.5 * error**2
large_error = delta * (np.abs(error) - 0.5 * delta)
return np.where(is_small, small_error, large_error).mean()
- Combines MSE and MAE
- L2 for small errors, L1 for large
Training Best Practices
1. Data Preprocessing
# Normalization (mean=0, std=1)
X_train = (X_train - X_train.mean(axis=0)) / X_train.std(axis=0)
# Min-max scaling
X_train = (X_train - X_train.min()) / (X_train.max() - X_train.min())
2. Weight Initialization
Xavier/Glorot (for tanh/sigmoid):
W = np.random.randn(n_in, n_out) * np.sqrt(1 / n_in)
He Initialization (for ReLU):
W = np.random.randn(n_in, n_out) * np.sqrt(2 / n_in)
Poor initialization → vanishing/exploding gradients
3. Batch Normalization
- Normalizes layer inputs
- Reduces internal covariate shift
- Allows higher learning rates
4. Gradient Clipping
# Prevent exploding gradients
gradients = clip(gradients, min=-threshold, max=threshold)
# Or norm clipping
if norm(gradients) > threshold:
gradients = gradients * threshold / norm(gradients)
5. Early Stopping
best_val_loss = float('inf')
patience = 10
patience_counter = 0
for epoch in range(max_epochs):
train(...)
val_loss = validate(...)
if val_loss < best_val_loss:
best_val_loss = val_loss
save_model(...)
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= patience:
print("Early stopping")
break
Interview Debugging Scenarios
Q: "Training loss plateaus at high value. What's wrong?" A:
- Learning rate too low → Increase
- Bad initialization → Try He/Xavier
- Vanishing gradient → Check activations, add batch norm
- Model too simple → Increase capacity
Q: "Training loss decreases but validation loss increases. What's wrong?" A:
- Overfitting
- Solutions: Dropout, L2 regularization, more data, early stopping, data augmentation
Q: "Loss is NaN or explodes. What to check?" A:
- Learning rate too high → Reduce
- Exploding gradients → Gradient clipping
- Division by zero → Add epsilon in denominators
- Input not normalized → Normalize features
Q: "Model doesn't learn (loss stays constant)?" A:
- Dead ReLUs → Try Leaky ReLU, check initialization
- Learning rate too high or too low → LR range test
- Wrong loss function → Verify task matches loss
- Bug in code → Check gradient flow, print intermediate outputs
Key Takeaways
- Adam is default - Works well for most problems
- LR is most important hyperparameter - Use LR finder or start with 3e-4 (Karpathy's constant)
- Normalize inputs - Critical for convergence
- He init for ReLU - Prevents vanishing gradients
- Monitor both train and val loss - Early stopping prevents overfitting
:::