ML Fundamentals Questions

Neural Networks & Deep Learning

5 min read

Core Concepts

1. Activation Functions

Interview Question: "Explain common activation functions and when to use each."

ReLU (Rectified Linear Unit):

f(x) = max(0, x)
  • Pros: Fast, helps with vanishing gradient, sparse activation
  • Cons: Dead neurons (neurons that always output 0)
  • Use: Hidden layers in most architectures

Sigmoid:

f(x) = 1 / (1 + e^(-x))
  • Pros: Smooth gradient, outputs in (0,1)
  • Cons: Vanishing gradient problem, not zero-centered
  • Use: Binary classification output layer

Tanh:

f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
  • Pros: Zero-centered, stronger gradients than sigmoid
  • Cons: Still vanishing gradient
  • Use: RNNs, sometimes hidden layers

Softmax (multi-class output):

f(x_i) = e^(x_i) / Σ(e^(x_j))
  • Use: Multi-class classification output layer

Implementation:

import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

2. Backpropagation

Interview Question: "Explain backpropagation."

Simple Answer:

  1. Forward pass: Calculate predictions
  2. Loss calculation: Compare with true labels
  3. Backward pass: Calculate gradients using chain rule
  4. Weight update: Adjust weights proportional to gradients

Formula (chain rule):

∂Loss/∂w = ∂Loss/∂output × ∂output/∂activation × ∂activation/∂weighted_sum × ∂weighted_sum/∂w

Conceptual Code:

def backprop_step(x, y_true, weights, activations):
    """
    Single backprop step

    Returns: gradients for weights
    """
    # Forward pass
    z = x @ weights  # Weighted sum
    a = relu(z)      # Activation
    loss = mse(a, y_true)

    # Backward pass
    d_loss = 2 * (a - y_true)  # MSE gradient
    d_relu = relu_derivative(z)
    d_weights = x.T @ (d_loss * d_relu)

    return d_weights

Key Points:

  • Efficiently computes all gradients in one backward pass
  • Chain rule propagates error from output to input
  • Vanishing/exploding gradients can occur in deep networks

3. Regularization Techniques

Interview Question: "How do you prevent overfitting in neural networks?"

Dropout:

  • Randomly set neurons to 0 during training (e.g., 50%)
  • Forces network to learn robust features
  • Only during training, not inference
def dropout(x, keep_prob=0.5, training=True):
    if not training:
        return x
    mask = np.random.binomial(1, keep_prob, size=x.shape) / keep_prob
    return x * mask

Batch Normalization:

  • Normalize layer inputs to zero mean, unit variance
  • Reduces internal covariate shift
  • Allows higher learning rates

L2 Regularization (Weight Decay):

  • Add penalty: Loss = MSE + λ * Σ(w²)
  • Encourages smaller weights
  • Prevents overfitting

Early Stopping:

  • Monitor validation loss
  • Stop training when validation loss stops improving

4. Convolutional Neural Networks (CNNs)

Interview Question: "How do CNNs work and why for images?"

Key Components:

  1. Convolution Layer: Applies filters to detect patterns
  2. Pooling Layer: Downsamples (max/average pooling)
  3. Fully Connected Layer: Final classification

Why CNNs for Images:

  • Parameter sharing: Same filter across image (fewer parameters)
  • Translation invariance: Detects features anywhere in image
  • Hierarchical features: Early layers → edges; deep layers → objects

Convolution Example:

def conv2d_single_filter(image, filter):
    """
    Simple 2D convolution

    image: (H, W)
    filter: (FH, FW)
    output: (H-FH+1, W-FW+1)
    """
    FH, FW = filter.shape
    H, W = image.shape
    output = np.zeros((H - FH + 1, W - FW + 1))

    for i in range(H - FH + 1):
        for j in range(W - FW + 1):
            region = image[i:i+FH, j:j+FW]
            output[i, j] = np.sum(region * filter)

    return output

Common Architectures:

  • VGG: Stacking 3x3 convolutions
  • ResNet: Skip connections solve vanishing gradient
  • MobileNet: Depthwise separable convolutions for efficiency

5. Recurrent Neural Networks (RNNs)

Interview Question: "Explain RNNs and their limitations."

How RNNs Work:

  • Process sequences by maintaining hidden state
  • Hidden state updated at each time step
  • Formula: h_t = tanh(W_h * h_{t-1} + W_x * x_t + b)

Limitations:

  1. Vanishing gradient: Gradients diminish over long sequences
  2. Exploding gradient: Gradients grow exponentially
  3. Long-term dependencies: Struggles to remember distant information

Solution: LSTM (Long Short-Term Memory):

  • Gates control information flow (forget, input, output)
  • Cell state carries long-term information
  • Better at long sequences

Conceptual LSTM:

def lstm_cell(x_t, h_prev, c_prev, weights):
    """
    Simplified LSTM cell

    h_prev: previous hidden state
    c_prev: previous cell state
    """
    # Forget gate: what to forget from cell state
    f_t = sigmoid(weights['Wf'] @ [h_prev, x_t])

    # Input gate: what new info to add
    i_t = sigmoid(weights['Wi'] @ [h_prev, x_t])
    c_tilde = tanh(weights['Wc'] @ [h_prev, x_t])

    # Update cell state
    c_t = f_t * c_prev + i_t * c_tilde

    # Output gate: what to output
    o_t = sigmoid(weights['Wo'] @ [h_prev, x_t])
    h_t = o_t * tanh(c_t)

    return h_t, c_t

6. Transformers

Interview Question: "Why do transformers work better than RNNs for NLP?"

Key Innovation: Attention Mechanism

  • Allows model to focus on relevant parts of input
  • Parallel processing (vs sequential in RNNs)
  • Better long-range dependencies

Self-Attention Formula:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Q = Query (what I'm looking for)
K = Key (what I have)
V = Value (what I return)

Why Better Than RNNs:

  1. Parallelization: Process entire sequence at once
  2. Long-range dependencies: Direct connections via attention
  3. No vanishing gradient: Shorter paths for gradients

Popular Models:

  • BERT: Bidirectional, masked language modeling, pre-trained
  • GPT: Autoregressive, next-token prediction
  • T5: Text-to-text framework

When to Use:

Task Best Architecture
Image Classification CNN (ResNet, EfficientNet)
Object Detection CNN (YOLO, Faster R-CNN)
Text Classification Transformer (BERT)
Machine Translation Transformer (T5, mBART)
Time Series (short) LSTM, GRU
Time Series (long) Transformer

Common Interview Questions

Q: "What's the vanishing gradient problem?" A: In deep networks with sigmoid/tanh, gradients become very small during backprop, preventing early layers from learning. Solutions: ReLU, ResNet skip connections, better initialization.

Q: "Batch size trade-offs?" A:

  • Large batch: Faster training (GPU), more stable gradients, worse generalization
  • Small batch: Better generalization, more noise in gradients, slower

Q: "How to handle class imbalance in deep learning?" A:

  • Weighted loss function
  • Oversampling minority / Undersampling majority
  • Data augmentation for minority class
  • Focal loss (from RetinaNet paper)

Q: "Transfer learning - when and how?" A:

  • When: Limited data for your task, related domain exists
  • How: Use pre-trained model (ImageNet, BERT), freeze early layers, fine-tune later layers
  • Similar task → Fine-tune fewer layers
  • Different task → Fine-tune more layers

Key Takeaways

  1. ReLU is default - Use for hidden layers unless you have a specific reason not to
  2. Regularization is crucial - Dropout, batch norm, L2 for preventing overfitting
  3. Architecture choice matters - CNN for images, Transformers for text/sequences
  4. Know the trade-offs - Can explain when to use each technique
  5. Transformers dominate NLP - Understand attention mechanism

What's Next?

Next lesson covers optimization techniques: gradient descent variants, learning rate schedules, and training best practices.

:::