ML Fundamentals Questions
Neural Networks & Deep Learning
Core Concepts
1. Activation Functions
Interview Question: "Explain common activation functions and when to use each."
ReLU (Rectified Linear Unit):
f(x) = max(0, x)
- Pros: Fast, helps with vanishing gradient, sparse activation
- Cons: Dead neurons (neurons that always output 0)
- Use: Hidden layers in most architectures
Sigmoid:
f(x) = 1 / (1 + e^(-x))
- Pros: Smooth gradient, outputs in (0,1)
- Cons: Vanishing gradient problem, not zero-centered
- Use: Binary classification output layer
Tanh:
f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
- Pros: Zero-centered, stronger gradients than sigmoid
- Cons: Still vanishing gradient
- Use: RNNs, sometimes hidden layers
Softmax (multi-class output):
f(x_i) = e^(x_i) / Σ(e^(x_j))
- Use: Multi-class classification output layer
Implementation:
import numpy as np
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
2. Backpropagation
Interview Question: "Explain backpropagation."
Simple Answer:
- Forward pass: Calculate predictions
- Loss calculation: Compare with true labels
- Backward pass: Calculate gradients using chain rule
- Weight update: Adjust weights proportional to gradients
Formula (chain rule):
∂Loss/∂w = ∂Loss/∂output × ∂output/∂activation × ∂activation/∂weighted_sum × ∂weighted_sum/∂w
Conceptual Code:
def backprop_step(x, y_true, weights, activations):
"""
Single backprop step
Returns: gradients for weights
"""
# Forward pass
z = x @ weights # Weighted sum
a = relu(z) # Activation
loss = mse(a, y_true)
# Backward pass
d_loss = 2 * (a - y_true) # MSE gradient
d_relu = relu_derivative(z)
d_weights = x.T @ (d_loss * d_relu)
return d_weights
Key Points:
- Efficiently computes all gradients in one backward pass
- Chain rule propagates error from output to input
- Vanishing/exploding gradients can occur in deep networks
3. Regularization Techniques
Interview Question: "How do you prevent overfitting in neural networks?"
Dropout:
- Randomly set neurons to 0 during training (e.g., 50%)
- Forces network to learn robust features
- Only during training, not inference
def dropout(x, keep_prob=0.5, training=True):
if not training:
return x
mask = np.random.binomial(1, keep_prob, size=x.shape) / keep_prob
return x * mask
Batch Normalization:
- Normalize layer inputs to zero mean, unit variance
- Reduces internal covariate shift
- Allows higher learning rates
L2 Regularization (Weight Decay):
- Add penalty: Loss = MSE + λ * Σ(w²)
- Encourages smaller weights
- Prevents overfitting
Early Stopping:
- Monitor validation loss
- Stop training when validation loss stops improving
4. Convolutional Neural Networks (CNNs)
Interview Question: "How do CNNs work and why for images?"
Key Components:
- Convolution Layer: Applies filters to detect patterns
- Pooling Layer: Downsamples (max/average pooling)
- Fully Connected Layer: Final classification
Why CNNs for Images:
- Parameter sharing: Same filter across image (fewer parameters)
- Translation invariance: Detects features anywhere in image
- Hierarchical features: Early layers → edges; deep layers → objects
Convolution Example:
def conv2d_single_filter(image, filter):
"""
Simple 2D convolution
image: (H, W)
filter: (FH, FW)
output: (H-FH+1, W-FW+1)
"""
FH, FW = filter.shape
H, W = image.shape
output = np.zeros((H - FH + 1, W - FW + 1))
for i in range(H - FH + 1):
for j in range(W - FW + 1):
region = image[i:i+FH, j:j+FW]
output[i, j] = np.sum(region * filter)
return output
Common Architectures:
- VGG: Stacking 3x3 convolutions
- ResNet: Skip connections solve vanishing gradient
- MobileNet: Depthwise separable convolutions for efficiency
5. Recurrent Neural Networks (RNNs)
Interview Question: "Explain RNNs and their limitations."
How RNNs Work:
- Process sequences by maintaining hidden state
- Hidden state updated at each time step
- Formula: h_t = tanh(W_h * h_{t-1} + W_x * x_t + b)
Limitations:
- Vanishing gradient: Gradients diminish over long sequences
- Exploding gradient: Gradients grow exponentially
- Long-term dependencies: Struggles to remember distant information
Solution: LSTM (Long Short-Term Memory):
- Gates control information flow (forget, input, output)
- Cell state carries long-term information
- Better at long sequences
Conceptual LSTM:
def lstm_cell(x_t, h_prev, c_prev, weights):
"""
Simplified LSTM cell
h_prev: previous hidden state
c_prev: previous cell state
"""
# Forget gate: what to forget from cell state
f_t = sigmoid(weights['Wf'] @ [h_prev, x_t])
# Input gate: what new info to add
i_t = sigmoid(weights['Wi'] @ [h_prev, x_t])
c_tilde = tanh(weights['Wc'] @ [h_prev, x_t])
# Update cell state
c_t = f_t * c_prev + i_t * c_tilde
# Output gate: what to output
o_t = sigmoid(weights['Wo'] @ [h_prev, x_t])
h_t = o_t * tanh(c_t)
return h_t, c_t
6. Transformers
Interview Question: "Why do transformers work better than RNNs for NLP?"
Key Innovation: Attention Mechanism
- Allows model to focus on relevant parts of input
- Parallel processing (vs sequential in RNNs)
- Better long-range dependencies
Self-Attention Formula:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Q = Query (what I'm looking for)
K = Key (what I have)
V = Value (what I return)
Why Better Than RNNs:
- Parallelization: Process entire sequence at once
- Long-range dependencies: Direct connections via attention
- No vanishing gradient: Shorter paths for gradients
Popular Models:
- BERT: Bidirectional, masked language modeling, pre-trained
- GPT: Autoregressive, next-token prediction
- T5: Text-to-text framework
When to Use:
| Task | Best Architecture |
|---|---|
| Image Classification | CNN (ResNet, EfficientNet) |
| Object Detection | CNN (YOLO, Faster R-CNN) |
| Text Classification | Transformer (BERT) |
| Machine Translation | Transformer (T5, mBART) |
| Time Series (short) | LSTM, GRU |
| Time Series (long) | Transformer |
Common Interview Questions
Q: "What's the vanishing gradient problem?" A: In deep networks with sigmoid/tanh, gradients become very small during backprop, preventing early layers from learning. Solutions: ReLU, ResNet skip connections, better initialization.
Q: "Batch size trade-offs?" A:
- Large batch: Faster training (GPU), more stable gradients, worse generalization
- Small batch: Better generalization, more noise in gradients, slower
Q: "How to handle class imbalance in deep learning?" A:
- Weighted loss function
- Oversampling minority / Undersampling majority
- Data augmentation for minority class
- Focal loss (from RetinaNet paper)
Q: "Transfer learning - when and how?" A:
- When: Limited data for your task, related domain exists
- How: Use pre-trained model (ImageNet, BERT), freeze early layers, fine-tune later layers
- Similar task → Fine-tune fewer layers
- Different task → Fine-tune more layers
Key Takeaways
- ReLU is default - Use for hidden layers unless you have a specific reason not to
- Regularization is crucial - Dropout, batch norm, L2 for preventing overfitting
- Architecture choice matters - CNN for images, Transformers for text/sequences
- Know the trade-offs - Can explain when to use each technique
- Transformers dominate NLP - Understand attention mechanism
What's Next?
Next lesson covers optimization techniques: gradient descent variants, learning rate schedules, and training best practices.
:::