Optimization & Training

Gradient Descent Variants

Interview Question: "Explain different gradient descent algorithms and trade-offs."

1. Batch Gradient Descent (BGD)

# Update using entire dataset
for epoch in range(num_epochs):
    gradient = compute_gradient(X_all, y_all, weights)
    weights -= learning_rate * gradient

Pros: Stable convergence, exact gradient
Cons: Slow for large datasets, memory intensive
Use: Small datasets

2. Stochastic Gradient Descent (SGD)

# Update after each sample
for epoch in range(num_epochs):
    for x_i, y_i in shuffle(data):
        gradient = compute_gradient(x_i, y_i, weights)
        weights -= learning_rate * gradient

Pros: Fast updates, can escape local minima
Cons: Noisy updates, doesn't converge exactly
Use: Online learning, large datasets

3. Mini-Batch SGD

# Update using small batches (e.g., 32, 64, 128)
for epoch in range(num_epochs):
    for X_batch, y_batch in get_batches(X, y, batch_size=32):
        gradient = compute_gradient(X_batch, y_batch, weights)
        weights -= learning_rate * gradient

Pros: Best of both worlds, GPU-friendly
Cons: Requires batch size tuning
Use: Most common in practice

4. Momentum

# Accumulates velocity
velocity = 0
for iteration:
    gradient = compute_gradient(...)
    velocity = beta * velocity + gradient
    weights -= learning_rate * velocity

Benefit: Accelerates in relevant direction, dampens oscillations
Hyperparameter: β typically 0.9

5. Adam (Adaptive Moment Estimation)

m = 0  # First moment (mean)
v = 0  # Second moment (variance)

for iteration:
    gradient = compute_gradient(...)
    m = beta1 * m + (1 - beta1) * gradient
    v = beta2 * v + (1 - beta2) * gradient**2

    m_hat = m / (1 - beta1**t)  # Bias correction
    v_hat = v / (1 - beta2**t)

    weights -= learning_rate * m_hat / (sqrt(v_hat) + epsilon)

Pros: Adaptive learning rates, works well out-of-the-box
Defaults: β1=0.9, β2=0.999, ε=1e-8
Use: Default choice for most deep learning

Comparison:

Method	Speed	Memory	Convergence	Hyperparameters
BGD	Slow	High	Smooth	Few
SGD	Fast	Low	Noisy	Few
Mini-batch	Medium	Medium	Good	Batch size
Momentum	Medium	Medium	Faster	β, LR
Adam	Medium	Medium	Fastest	β1, β2, LR

Learning Rate Strategies

Interview Question: "How do you choose and adjust learning rate?"

1. Fixed Learning Rate

lr = 0.001  # Keep constant

Simple but suboptimal

2. Step Decay

lr = initial_lr * (decay_rate ** (epoch // drop_every))
# Example: 0.1 → 0.01 → 0.001

3. Exponential Decay

lr = initial_lr * exp(-decay_rate * epoch)

4. Cosine Annealing

lr = min_lr + 0.5 * (max_lr - min_lr) * (1 + cos(epoch / total_epochs * π))

Smooth decay following cosine curve
Used in state-of-the-art models

5. Learning Rate Warmup

# Start small, linearly increase to target
if epoch < warmup_epochs:
    lr = target_lr * (epoch / warmup_epochs)
else:
    lr = target_lr

Prevents early instability
Common in transformer training

Finding Initial LR

LR Range Test:

Start with very small LR (1e-7)
Increase exponentially each batch
Plot loss vs LR
Choose LR where loss decreases fastest

def find_lr(model, data, start_lr=1e-7, end_lr=10, num_iter=100):
    lrs = []
    losses = []
    lr = start_lr

    for i in range(num_iter):
        loss = train_step(model, data, lr)
        lrs.append(lr)
        losses.append(loss)
        lr *= (end_lr / start_lr) ** (1 / num_iter)

    # Plot and find steepest descent
    import matplotlib.pyplot as plt
    plt.plot(lrs, losses)
    plt.xscale('log')
    plt.show()

Loss Functions

Interview Question: "What loss function should you use for X task?"

Classification

Binary Cross-Entropy (Logistic Loss):

def binary_crossentropy(y_true, y_pred):
    return -np.mean(y_true * np.log(y_pred) +
                    (1 - y_true) * np.log(1 - y_pred))

Use: Binary classification with sigmoid output

Categorical Cross-Entropy:

def categorical_crossentropy(y_true, y_pred):
    return -np.sum(y_true * np.log(y_pred), axis=1).mean()

Use: Multi-class classification with softmax output

Focal Loss (for imbalanced data):

def focal_loss(y_true, y_pred, alpha=0.25, gamma=2):
    ce = -y_true * np.log(y_pred)
    return alpha * (1 - y_pred)**gamma * ce

Down-weights easy examples
Use: Severe class imbalance (object detection)

Regression

Mean Squared Error (MSE / L2):

def mse(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

Sensitive to outliers
Penalizes large errors more

Mean Absolute Error (MAE / L1):

def mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

Robust to outliers
Equal weight to all errors

Huber Loss (smooth L1):

def huber(y_true, y_pred, delta=1.0):
    error = y_true - y_pred
    is_small = np.abs(error) <= delta
    small_error = 0.5 * error**2
    large_error = delta * (np.abs(error) - 0.5 * delta)
    return np.where(is_small, small_error, large_error).mean()

Combines MSE and MAE
L2 for small errors, L1 for large

Training Best Practices

1. Data Preprocessing

# Normalization (mean=0, std=1)
X_train = (X_train - X_train.mean(axis=0)) / X_train.std(axis=0)

# Min-max scaling
X_train = (X_train - X_train.min()) / (X_train.max() - X_train.min())

2. Weight Initialization

Xavier/Glorot (for tanh/sigmoid):

W = np.random.randn(n_in, n_out) * np.sqrt(1 / n_in)

He Initialization (for ReLU):

W = np.random.randn(n_in, n_out) * np.sqrt(2 / n_in)

Poor initialization → vanishing/exploding gradients

3. Batch Normalization

Normalizes layer inputs
Reduces internal covariate shift
Allows higher learning rates

4. Gradient Clipping

# Prevent exploding gradients
gradients = clip(gradients, min=-threshold, max=threshold)

# Or norm clipping
if norm(gradients) > threshold:
    gradients = gradients * threshold / norm(gradients)

5. Early Stopping

best_val_loss = float('inf')
patience = 10
patience_counter = 0

for epoch in range(max_epochs):
    train(...)
    val_loss = validate(...)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        save_model(...)
        patience_counter = 0
    else:
        patience_counter += 1

    if patience_counter >= patience:
        print("Early stopping")
        break

Interview Debugging Scenarios

Q: "Training loss plateaus at high value. What's wrong?" A:

Learning rate too low → Increase
Bad initialization → Try He/Xavier
Vanishing gradient → Check activations, add batch norm
Model too simple → Increase capacity

Q: "Training loss decreases but validation loss increases. What's wrong?" A:

Overfitting
Solutions: Dropout, L2 regularization, more data, early stopping, data augmentation

Q: "Loss is NaN or explodes. What to check?" A:

Learning rate too high → Reduce
Exploding gradients → Gradient clipping
Division by zero → Add epsilon in denominators
Input not normalized → Normalize features

Q: "Model doesn't learn (loss stays constant)?" A:

Dead ReLUs → Try Leaky ReLU, check initialization
Learning rate too high or too low → LR range test
Wrong loss function → Verify task matches loss
Bug in code → Check gradient flow, print intermediate outputs

Key Takeaways

Adam is default - Works well for most problems
LR is most important hyperparameter - Use LR finder or start with 3e-4 (Karpathy's constant)
Normalize inputs - Critical for convergence
He init for ReLU - Prevents vanishing gradients
Monitor both train and val loss - Early stopping prevents overfitting

:::