ai-ml

Neural Networks From Scratch: A Developer's Guide

March 28, 2026

#neural network #python #machine learning #deep learning #from scratch #optimization algorithms #regularization #activation functions

Neural Networks From Scratch: A Developer's Guide

TL;DR

This guide walks you through building a complete, working neural network from scratch in Python using only NumPy.
You'll implement backpropagation, understand how gradients flow, and see every line of the training loop.
We include working code for the Adam optimizer, L2 regularization, dropout, and mini-batch training.
You'll learn to evaluate models with precision, recall, F1-score, and ROC AUC — not just accuracy.
Understanding these internals makes you far more effective when you move to frameworks like PyTorch or TensorFlow.

What You'll Learn

Implement a single neuron and a multi-layer neural network from scratch in Python.
Understand the math behind backpropagation and implement it step by step.
Implement the Adam optimizer and compare it with vanilla gradient descent.
Apply L2 regularization and dropout to prevent overfitting.
Implement mini-batch gradient descent for efficient training.
Evaluate model performance with precision, recall, F1-score, and ROC AUC.
Trace gradient values through the network to debug vanishing/exploding gradients.

Prerequisites

Before diving in, you should be comfortable with:

Python Programming: Basic syntax, data structures (lists, dictionaries), functions, and classes.
NumPy: Array operations, broadcasting, and matrix multiplication (np.dot).
Linear Algebra: Vectors, matrices, dot products, and transpose operations.
Calculus: Derivatives, the chain rule, and the concept of a gradient.

Neural networks power image recognition, natural language processing, code generation, and more. Libraries like PyTorch and TensorFlow make building them straightforward, but they also hide the mechanics. When your model doesn't converge, when gradients explode, or when a custom layer behaves unexpectedly — that's when understanding the internals matters.

Building from scratch isn't just an academic exercise. It forces you to confront every detail: how weight initialization affects convergence, why certain activation functions cause vanishing gradients, and how optimizers like Adam actually work under the hood.

This guide gives you a fully working implementation you can run, modify, and learn from.

Building the Foundation: A Single Neuron

The fundamental building block of a neural network is the neuron. A neuron receives inputs, applies weights to those inputs, adds a bias, and passes the result through an activation function.

Mathematically:

output = activation_function(Σ(weight_i * input_i) + bias)

Let's implement this in Python:

import numpy as np

def sigmoid(x):
    """The sigmoid activation function."""
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def sigmoid_derivative(output):
    """Derivative of sigmoid, given the sigmoid output."""
    return output * (1 - output)

def neuron(inputs, weights, bias):
    """Calculates the output of a single neuron."""
    weighted_sum = np.dot(inputs, weights) + bias
    return sigmoid(weighted_sum)

# Example
inputs = np.array([0.5, 0.2, 0.9])
weights = np.array([0.1, 0.3, 0.2])
bias = 0.1

output = neuron(inputs, weights, bias)
print(f"Neuron output: {output:.4f}")  # 0.5963

Note the np.clip in sigmoid — this prevents numerical overflow with very large or very small values, which is a real issue you'll hit in practice.

Activation Functions: Choosing the Right One

Sigmoid isn't the only option. Different activation functions suit different problems:

def relu(x):
    """ReLU — most common for hidden layers."""
    return np.maximum(0, x)

def relu_derivative(x):
    """Derivative of ReLU."""
    return (x > 0).astype(float)

def softmax(x):
    """Softmax — used for multi-class output layers."""
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

When to use what:

ReLU for hidden layers — fast to compute, avoids the vanishing gradient problem that plagues sigmoid/tanh in deep networks.
Sigmoid for binary classification output layers (outputs a probability between 0 and 1).
Softmax for multi-class classification output layers (outputs sum to 1).
Tanh is sometimes used in hidden layers but has largely been replaced by ReLU variants.

The vanishing gradient problem is why sigmoid fell out of favor for hidden layers: when inputs are very large or very small, sigmoid's derivative approaches zero, which kills gradient flow in deep networks.

A Complete Multi-Layer Neural Network

Now let's build a full neural network class with forward pass, backpropagation, and training:

import numpy as np

class NeuralNetwork:
    def __init__(self, layer_sizes, learning_rate=0.01, l2_lambda=0.0, dropout_rate=0.0):
        """
        layer_sizes: list of ints, e.g. [784, 128, 64, 10]
        learning_rate: step size for weight updates
        l2_lambda: L2 regularization strength (0 = no regularization)
        dropout_rate: fraction of neurons to drop during training (0 = no dropout)
        """
        self.layer_sizes = layer_sizes
        self.learning_rate = learning_rate
        self.l2_lambda = l2_lambda
        self.dropout_rate = dropout_rate
        self.num_layers = len(layer_sizes) - 1

        # He initialization — better than random for ReLU networks
        self.weights = []
        self.biases = []
        for i in range(self.num_layers):
            scale = np.sqrt(2.0 / layer_sizes[i])  # He init
            w = np.random.randn(layer_sizes[i], layer_sizes[i + 1]) * scale
            b = np.zeros((1, layer_sizes[i + 1]))
            self.weights.append(w)
            self.biases.append(b)

    def forward(self, X, training=False):
        """Forward pass through the network."""
        self.activations = [X]
        self.z_values = []
        self.dropout_masks = []

        current = X
        for i in range(self.num_layers):
            z = current @ self.weights[i] + self.biases[i]
            self.z_values.append(z)

            # ReLU for hidden layers, sigmoid for output
            if i < self.num_layers - 1:
                a = relu(z)
                # Apply dropout during training only
                if training and self.dropout_rate > 0:
                    mask = (np.random.rand(*a.shape) > self.dropout_rate).astype(float)
                    a *= mask / (1 - self.dropout_rate)  # Inverted dropout scaling
                    self.dropout_masks.append(mask)
                else:
                    self.dropout_masks.append(np.ones_like(a))
            else:
                a = sigmoid(z)

            self.activations.append(a)
            current = a

        return current

    def backward(self, X, y):
        """Backpropagation — compute gradients for all weights and biases."""
        m = X.shape[0]  # batch size
        output = self.activations[-1]

        # Output layer error (binary cross-entropy derivative with sigmoid)
        delta = output - y

        weight_grads = []
        bias_grads = []

        for i in range(self.num_layers - 1, -1, -1):
            # Gradients for this layer
            dw = (self.activations[i].T @ delta) / m
            db = np.sum(delta, axis=0, keepdims=True) / m

            # Add L2 regularization gradient
            if self.l2_lambda > 0:
                dw += (self.l2_lambda / m) * self.weights[i]

            weight_grads.insert(0, dw)
            bias_grads.insert(0, db)

            # Propagate error to previous layer (skip for input layer)
            if i > 0:
                delta = (delta @ self.weights[i].T) * relu_derivative(self.z_values[i - 1])
                # Apply dropout mask
                if self.dropout_rate > 0:
                    delta *= self.dropout_masks[i - 1] / (1 - self.dropout_rate)

        return weight_grads, bias_grads

    def compute_loss(self, y_true, y_pred):
        """Binary cross-entropy loss with optional L2 regularization."""
        m = y_true.shape[0]
        # Clip to avoid log(0)
        y_pred = np.clip(y_pred, 1e-8, 1 - 1e-8)
        bce = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

        # L2 penalty
        if self.l2_lambda > 0:
            l2_penalty = sum(np.sum(w ** 2) for w in self.weights)
            bce += (self.l2_lambda / (2 * m)) * l2_penalty

        return bce

    def train_step(self, X, y):
        """One training step: forward, backward, update."""
        output = self.forward(X, training=True)
        weight_grads, bias_grads = self.backward(X, y)

        # Update weights with vanilla gradient descent
        for i in range(self.num_layers):
            self.weights[i] -= self.learning_rate * weight_grads[i]
            self.biases[i] -= self.learning_rate * bias_grads[i]

        return self.compute_loss(y, output)

    def predict(self, X, threshold=0.5):
        """Forward pass without dropout, returns class predictions."""
        probs = self.forward(X, training=False)
        return (probs >= threshold).astype(int)

Key implementation details:

He initialization (np.sqrt(2.0 / fan_in)) — using plain random values causes poor convergence in deep networks. He initialization is designed for ReLU activations and keeps the variance stable across layers.
Inverted dropout — we scale activations by 1 / (1 - dropout_rate) during training so that we don't need to scale during inference. This is the same approach PyTorch uses.
Gradient clipping via np.clip in sigmoid prevents NaN values during training.

Backpropagation: Step by Step

Backpropagation computes the gradient of the loss function with respect to every weight and bias in the network using the chain rule. Let's trace through exactly how it works.

For a network with layers [input → hidden → output]:

1. Forward pass — compute each layer's output:

z_hidden = X @ W1 + b1
a_hidden = relu(z_hidden)
z_output = a_hidden @ W2 + b2
a_output = sigmoid(z_output)

2. Output layer gradient — for binary cross-entropy loss with sigmoid, the derivative simplifies to:

delta_output = a_output - y_true

3. Weight gradients for output layer:

dW2 = (a_hidden.T @ delta_output) / batch_size
db2 = mean(delta_output)

4. Propagate error backward:

delta_hidden = (delta_output @ W2.T) * relu_derivative(z_hidden)

5. Weight gradients for hidden layer:

dW1 = (X.T @ delta_hidden) / batch_size
db1 = mean(delta_hidden)

6. Update all weights:

W1 -= learning_rate * dW1
W2 -= learning_rate * dW2

The chain rule is what makes this work. The gradient for W1 depends on how W1 affects z_hidden, how z_hidden affects a_hidden, how a_hidden affects z_output, and how z_output affects the loss. Each of these partial derivatives gets multiplied together — that's the chain rule in action.

Optimization: The Adam Optimizer

Vanilla gradient descent uses a fixed learning rate for all parameters. This is inefficient — some weights need large updates while others need small ones. Adam (Adaptive Moment Estimation) ¹ solves this by maintaining per-parameter learning rates based on first and second moment estimates of the gradients.

class AdamOptimizer:
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        """
        Default hyperparameters from the original Adam paper.
        These defaults work well for most problems — start here.
        """
        self.lr = learning_rate
        self.beta1 = beta1   # Exponential decay rate for first moment (mean)
        self.beta2 = beta2   # Exponential decay rate for second moment (variance)
        self.epsilon = epsilon  # Prevents division by zero
        self.t = 0  # Timestep counter
        self.m_w = None  # First moment estimates for weights
        self.v_w = None  # Second moment estimates for weights
        self.m_b = None
        self.v_b = None

    def initialize(self, weights, biases):
        """Initialize moment estimates to zeros matching weight shapes."""
        self.m_w = [np.zeros_like(w) for w in weights]
        self.v_w = [np.zeros_like(w) for w in weights]
        self.m_b = [np.zeros_like(b) for b in biases]
        self.v_b = [np.zeros_like(b) for b in biases]

    def update(self, weights, biases, weight_grads, bias_grads):
        """Perform one Adam update step."""
        if self.m_w is None:
            self.initialize(weights, biases)

        self.t += 1

        for i in range(len(weights)):
            # Update first moment (mean of gradients)
            self.m_w[i] = self.beta1 * self.m_w[i] + (1 - self.beta1) * weight_grads[i]
            self.m_b[i] = self.beta1 * self.m_b[i] + (1 - self.beta1) * bias_grads[i]

            # Update second moment (mean of squared gradients)
            self.v_w[i] = self.beta2 * self.v_w[i] + (1 - self.beta2) * (weight_grads[i] ** 2)
            self.v_b[i] = self.beta2 * self.v_b[i] + (1 - self.beta2) * (bias_grads[i] ** 2)

            # Bias correction — critical in early steps when moments are near zero
            m_w_hat = self.m_w[i] / (1 - self.beta1 ** self.t)
            v_w_hat = self.v_w[i] / (1 - self.beta2 ** self.t)
            m_b_hat = self.m_b[i] / (1 - self.beta1 ** self.t)
            v_b_hat = self.v_b[i] / (1 - self.beta2 ** self.t)

            # Update parameters
            weights[i] -= self.lr * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)
            biases[i] -= self.lr * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)

Why Adam works well: Vanilla gradient descent struggles with loss landscapes that have different curvatures in different directions (common in neural networks). Adam adapts the effective learning rate for each parameter individually. Parameters with consistently large gradients get smaller effective learning rates, and vice versa. The bias correction terms are important because without them, the moment estimates are biased toward zero in the first few iterations.

A note on "local minima": You'll often hear that optimizers help "escape local minima." In practice, research has shown that the bigger problem in high-dimensional neural networks is saddle points — places where the gradient is zero but the point is neither a minimum nor a maximum ². Adam and momentum-based methods help navigate through saddle points efficiently.

Regularization: Preventing Overfitting

When your model achieves 99% training accuracy but only 70% on test data, it's overfitting — memorizing the training data rather than learning generalizable patterns.

L2 Regularization adds a penalty to the loss function proportional to the squared magnitude of the weights:

total_loss = cross_entropy_loss + (lambda / 2m) * Σ(w²)

This discourages any single weight from becoming too large. The gradient contribution is simply (lambda / m) * w, which is already included in our NeuralNetwork class above.

Dropout randomly zeroes out a fraction of neuron activations during each training step. This forces the network to learn redundant representations — no single neuron can become a critical point of failure. At inference time, all neurons are active but we don't need to scale because we used inverted dropout during training.

How to choose regularization strength:

Start with l2_lambda=0.001 and dropout_rate=0.2
If still overfitting, increase them gradually
If underfitting (low training accuracy), reduce them
Monitor the gap between training and validation loss — regularization should narrow this gap

Mini-Batch Training

Training on the entire dataset per update (batch gradient descent) is slow. Updating after every single sample (stochastic gradient descent) is noisy. Mini-batch gradient descent is the practical middle ground:

def train(network, X, y, epochs=100, batch_size=32, optimizer=None):
    """Train with mini-batches and optional Adam optimizer."""
    m = X.shape[0]
    history = []

    for epoch in range(epochs):
        # Shuffle data each epoch
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]

        epoch_loss = 0
        num_batches = 0

        for start in range(0, m, batch_size):
            end = min(start + batch_size, m)
            X_batch = X_shuffled[start:end]
            y_batch = y_shuffled[start:end]

            # Forward and backward pass
            output = network.forward(X_batch, training=True)
            weight_grads, bias_grads = network.backward(X_batch, y_batch)

            # Update weights
            if optimizer:
                optimizer.update(network.weights, network.biases, weight_grads, bias_grads)
            else:
                for i in range(network.num_layers):
                    network.weights[i] -= network.learning_rate * weight_grads[i]
                    network.biases[i] -= network.learning_rate * bias_grads[i]

            epoch_loss += network.compute_loss(y_batch, output)
            num_batches += 1

        avg_loss = epoch_loss / num_batches
        history.append(avg_loss)

        if epoch % 10 == 0:
            predictions = network.predict(X)
            accuracy = np.mean(predictions == y)
            print(f"Epoch {epoch:4d} | Loss: {avg_loss:.4f} | Accuracy: {accuracy:.4f}")

    return history

Batch size guidelines:

32 — good default for most problems
64–128 — slightly faster training, often works just as well
256+ — may need learning rate warmup to converge properly
Powers of 2 are conventional (for GPU memory alignment) but not required for CPU training

Putting It All Together

Here's a complete working example that trains on a synthetic dataset:

# Generate synthetic dataset — two interleaving half-moons
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
y = y.reshape(-1, 1).astype(float)

# Split into train/test
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Create network: 2 inputs → 64 hidden → 32 hidden → 1 output
network = NeuralNetwork(
    layer_sizes=[2, 64, 32, 1],
    learning_rate=0.001,
    l2_lambda=0.001,
    dropout_rate=0.1
)

optimizer = AdamOptimizer(learning_rate=0.001)

# Train
history = train(network, X_train, y_train, epochs=200, batch_size=32, optimizer=optimizer)

# Evaluate
test_preds = network.predict(X_test)
test_accuracy = np.mean(test_preds == y_test)
print(f"\nTest Accuracy: {test_accuracy:.4f}")

Evaluating Model Performance: Beyond Accuracy

Accuracy alone can be misleading. If 95% of your data belongs to class A, a model that always predicts class A gets 95% accuracy while being useless.

def evaluate(y_true, y_pred, y_probs=None):
    """Compute classification metrics."""
    tp = np.sum((y_pred == 1) & (y_true == 1))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    fn = np.sum((y_pred == 0) & (y_true == 1))
    tn = np.sum((y_pred == 0) & (y_true == 0))

    accuracy = (tp + tn) / (tp + fp + fn + tn)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    print(f"Accuracy:  {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall:    {recall:.4f}")
    print(f"F1-score:  {f1:.4f}")

    # ROC AUC (simplified — requires probability scores)
    if y_probs is not None:
        thresholds = np.linspace(0, 1, 100)
        tpr_list, fpr_list = [], []
        for t in thresholds:
            preds = (y_probs >= t).astype(int)
            tp_t = np.sum((preds == 1) & (y_true == 1))
            fp_t = np.sum((preds == 1) & (y_true == 0))
            fn_t = np.sum((preds == 0) & (y_true == 1))
            tn_t = np.sum((preds == 0) & (y_true == 0))
            tpr_list.append(tp_t / (tp_t + fn_t) if (tp_t + fn_t) > 0 else 0)
            fpr_list.append(fp_t / (fp_t + tn_t) if (fp_t + tn_t) > 0 else 0)
        auc = -np.trapz(tpr_list, fpr_list)  # Numerical integration
        print(f"ROC AUC:   {auc:.4f}")

# Usage
test_probs = network.forward(X_test, training=False)
test_preds = (test_probs >= 0.5).astype(int)
evaluate(y_test, test_preds, test_probs)

Precision — of all positive predictions, how many were actually positive? Important when false positives are costly (spam detection).
Recall — of all actual positives, how many did we catch? Important when false negatives are costly (disease detection).
F1-score — harmonic mean of precision and recall. Useful when you need to balance both.
ROC AUC — measures how well the model separates classes across all thresholds. 0.5 = random guessing, 1.0 = perfect separation.

Debugging: Tracing Gradients Through the Network

When your network isn't training, the problem is almost always in the gradients. Here's how to inspect them:

def debug_gradients(network, X_sample, y_sample):
    """Print gradient statistics per layer to diagnose training issues."""
    network.forward(X_sample, training=True)
    weight_grads, bias_grads = network.backward(X_sample, y_sample)

    print("Layer | Weight Grad Mean | Weight Grad Std | Max Abs Grad")
    print("-" * 62)
    for i, (wg, bg) in enumerate(zip(weight_grads, bias_grads)):
        print(f"  {i:2d}  | {np.mean(np.abs(wg)):14.8f} | {np.std(wg):14.8f} | {np.max(np.abs(wg)):12.8f}")

    # Warnings
    for i, wg in enumerate(weight_grads):
        mean_abs = np.mean(np.abs(wg))
        if mean_abs < 1e-7:
            print(f"\n  WARNING: Layer {i} gradients are near zero (vanishing gradient).")
            print(f"  Try: ReLU activation, He initialization, or fewer layers.")
        elif mean_abs > 100:
            print(f"\n  WARNING: Layer {i} gradients are very large (exploding gradient).")
            print(f"  Try: gradient clipping, lower learning rate, or batch normalization.")

Common gradient problems:

Vanishing gradients (all near zero) — gradients shrink as they flow backward through many layers. Fix: use ReLU instead of sigmoid for hidden layers, use He initialization.
Exploding gradients (very large values) — gradients grow exponentially. Fix: gradient clipping, lower learning rate, or batch normalization.
Dead ReLU neurons (gradient is exactly zero for many neurons) — happens when neurons output zero for all inputs. Fix: use Leaky ReLU or lower learning rate.

From Scratch vs. Libraries: When to Use What

Building from scratch teaches you the mechanics. For production, use a framework:

Aspect	From Scratch	PyTorch / TensorFlow
GPU acceleration	Manual (hard)	Automatic
Automatic differentiation	Manual backprop	Built-in autograd
Pre-built layers	Write everything	Conv, LSTM, Transformer, etc.
Training speed	Slow (CPU-bound NumPy)	10-100x faster with GPU
Debugging understanding	Excellent	Requires framework knowledge
Custom operations	Full control	Possible but more complex

The knowledge from this guide directly transfers. When PyTorch's loss.backward() runs, it's doing exactly what our backward() method does — just faster and with automatic differentiation instead of manual gradient computation.

Conclusion

You now have a working neural network built entirely from scratch — no framework magic, just NumPy and math. You've implemented:

Forward and backward passes with the chain rule
The Adam optimizer with bias correction
L2 regularization and inverted dropout
Mini-batch training with shuffling
Evaluation metrics beyond accuracy
Gradient debugging tools

The next step is to take this understanding into PyTorch or TensorFlow. You'll find that every concept — loss functions, optimizers, regularization, batch training — maps directly to framework APIs. The difference is that you'll now understand what's happening underneath.

Kingma, D.P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR). ↩
Dauphin, Y. et al. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems (NeurIPS). ↩