Neural Networks & Deep Learning

Core Concepts

1. Activation Functions

Interview Question: "Explain common activation functions and when to use each."

ReLU (Rectified Linear Unit):

f(x) = max(0, x)

Pros: Fast, helps with vanishing gradient, sparse activation
Cons: Dead neurons (neurons that always output 0)
Use: Hidden layers in most architectures

Sigmoid:

f(x) = 1 / (1 + e^(-x))

Pros: Smooth gradient, outputs in (0,1)
Cons: Vanishing gradient problem, not zero-centered
Use: Binary classification output layer

Tanh:

f(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Pros: Zero-centered, stronger gradients than sigmoid
Cons: Still vanishing gradient
Use: RNNs, sometimes hidden layers

Softmax (multi-class output):

f(x_i) = e^(x_i) / Σ(e^(x_j))

Use: Multi-class classification output layer

Implementation:

import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

2. Backpropagation

Interview Question: "Explain backpropagation."

Simple Answer:

Forward pass: Calculate predictions
Loss calculation: Compare with true labels
Backward pass: Calculate gradients using chain rule
Weight update: Adjust weights proportional to gradients

Formula (chain rule):

∂Loss/∂w = ∂Loss/∂output × ∂output/∂activation × ∂activation/∂weighted_sum × ∂weighted_sum/∂w

Conceptual Code:

def backprop_step(x, y_true, weights, activations):
    """
    Single backprop step

    Returns: gradients for weights
    """
    # Forward pass
    z = x @ weights  # Weighted sum
    a = relu(z)      # Activation
    loss = mse(a, y_true)

    # Backward pass
    d_loss = 2 * (a - y_true)  # MSE gradient
    d_relu = relu_derivative(z)
    d_weights = x.T @ (d_loss * d_relu)

    return d_weights

Key Points:

Efficiently computes all gradients in one backward pass
Chain rule propagates error from output to input
Vanishing/exploding gradients can occur in deep networks

3. Regularization Techniques

Interview Question: "How do you prevent overfitting in neural networks?"

Dropout:

Randomly set neurons to 0 during training (e.g., 50%)
Forces network to learn robust features
Only during training, not inference

def dropout(x, keep_prob=0.5, training=True):
    if not training:
        return x
    mask = np.random.binomial(1, keep_prob, size=x.shape) / keep_prob
    return x * mask

Batch Normalization:

Normalize layer inputs to zero mean, unit variance
Reduces internal covariate shift
Allows higher learning rates

L2 Regularization (Weight Decay):

Add penalty: Loss = MSE + λ * Σ(w²)
Encourages smaller weights
Prevents overfitting

Early Stopping:

Monitor validation loss
Stop training when validation loss stops improving

4. Convolutional Neural Networks (CNNs)

Interview Question: "How do CNNs work and why for images?"

Key Components:

Convolution Layer: Applies filters to detect patterns
Pooling Layer: Downsamples (max/average pooling)
Fully Connected Layer: Final classification

Why CNNs for Images:

Parameter sharing: Same filter across image (fewer parameters)
Translation invariance: Detects features anywhere in image
Hierarchical features: Early layers → edges; deep layers → objects

Convolution Example:

def conv2d_single_filter(image, filter):
    """
    Simple 2D convolution

    image: (H, W)
    filter: (FH, FW)
    output: (H-FH+1, W-FW+1)
    """
    FH, FW = filter.shape
    H, W = image.shape
    output = np.zeros((H - FH + 1, W - FW + 1))

    for i in range(H - FH + 1):
        for j in range(W - FW + 1):
            region = image[i:i+FH, j:j+FW]
            output[i, j] = np.sum(region * filter)

    return output

Common Architectures:

VGG: Stacking 3x3 convolutions
ResNet: Skip connections solve vanishing gradient
MobileNet: Depthwise separable convolutions for efficiency

5. Recurrent Neural Networks (RNNs)

Interview Question: "Explain RNNs and their limitations."

How RNNs Work:

Process sequences by maintaining hidden state
Hidden state updated at each time step
Formula: h_t = tanh(W_h * h_{t-1} + W_x * x_t + b)

Limitations:

Vanishing gradient: Gradients diminish over long sequences
Exploding gradient: Gradients grow exponentially
Long-term dependencies: Struggles to remember distant information

Solution: LSTM (Long Short-Term Memory):

Gates control information flow (forget, input, output)
Cell state carries long-term information
Better at long sequences

Conceptual LSTM:

def lstm_cell(x_t, h_prev, c_prev, weights):
    """
    Simplified LSTM cell

    h_prev: previous hidden state
    c_prev: previous cell state
    """
    # Forget gate: what to forget from cell state
    f_t = sigmoid(weights['Wf'] @ [h_prev, x_t])

    # Input gate: what new info to add
    i_t = sigmoid(weights['Wi'] @ [h_prev, x_t])
    c_tilde = tanh(weights['Wc'] @ [h_prev, x_t])

    # Update cell state
    c_t = f_t * c_prev + i_t * c_tilde

    # Output gate: what to output
    o_t = sigmoid(weights['Wo'] @ [h_prev, x_t])
    h_t = o_t * tanh(c_t)

    return h_t, c_t

6. Transformers

Interview Question: "Why do transformers work better than RNNs for NLP?"

Key Innovation: Attention Mechanism

Allows model to focus on relevant parts of input
Parallel processing (vs sequential in RNNs)
Better long-range dependencies

Self-Attention Formula:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Q = Query (what I'm looking for)
K = Key (what I have)
V = Value (what I return)

Why Better Than RNNs:

Parallelization: Process entire sequence at once
Long-range dependencies: Direct connections via attention
No vanishing gradient: Shorter paths for gradients

Popular Models:

BERT: Bidirectional, masked language modeling, pre-trained
GPT: Autoregressive, next-token prediction
T5: Text-to-text framework

When to Use:

Task	Best Architecture
Image Classification	CNN (ResNet, EfficientNet)
Object Detection	CNN (YOLO, Faster R-CNN)
Text Classification	Transformer (BERT)
Machine Translation	Transformer (T5, mBART)
Time Series (short)	LSTM, GRU
Time Series (long)	Transformer

Q: "What's the vanishing gradient problem?" A: In deep networks with sigmoid/tanh, gradients become very small during backprop, preventing early layers from learning. Solutions: ReLU, ResNet skip connections, better initialization.

Q: "Batch size trade-offs?" A:

Large batch: Faster training (GPU), more stable gradients, worse generalization
Small batch: Better generalization, more noise in gradients, slower

Q: "How to handle class imbalance in deep learning?" A:

Weighted loss function
Oversampling minority / Undersampling majority
Data augmentation for minority class
Focal loss (from RetinaNet paper)

Q: "Transfer learning - when and how?" A:

When: Limited data for your task, related domain exists
How: Use pre-trained model (ImageNet, BERT), freeze early layers, fine-tune later layers
Similar task → Fine-tune fewer layers
Different task → Fine-tune more layers

Key Takeaways

ReLU is default - Use for hidden layers unless you have a specific reason not to
Regularization is crucial - Dropout, batch norm, L2 for preventing overfitting
Architecture choice matters - CNN for images, Transformers for text/sequences
Know the trade-offs - Can explain when to use each technique
Transformers dominate NLP - Understand attention mechanism

What's Next?

Next lesson covers optimization techniques: gradient descent variants, learning rate schedules, and training best practices.

:::