PyTorch Guide: Tensors to Production in 2026

May 4, 2026

PyTorch Guide: Tensors to Production in 2026

TL;DR

  • PyTorch is an open-source deep learning framework built around dynamic ("eager") computation graphs and a Pythonic API. It was first released in January 2017 by Facebook AI Research and now lives under the PyTorch Foundation at the Linux Foundation.12
  • The current stable release is PyTorch 2.11.0 (March 2026), supports Python 3.10 through 3.14, and ships installable wheels with bundled CUDA runtime libraries via pip.34
  • torch.compile is the headline PyTorch 2.x feature — across 165 open-source models in the official benchmark, it produced an average speedup of 20% at float32 and 36% at AMP precision while working on 93% of models tested.5
  • For computer vision transfer learning, use the current weights= API (torchvision.models.resnet18(weights=ResNet18_Weights.DEFAULT)). The legacy pretrained=True parameter has been deprecated since torchvision 0.13 (June 2022).6
  • This guide walks through tensors, autograd, building and training a model, transfer learning with the current API, GPU acceleration, torch.compile, and deployment paths (TorchScript, ONNX, TorchServe).

What You'll Learn

  1. Install PyTorch correctly for CPU or CUDA on a current Python (3.10–3.14) environment.
  2. Work with tensors — creation, arithmetic, reshaping, and moving data between CPU and GPU.
  3. Use autograd to compute gradients automatically and understand how it powers training.
  4. Build neural networks using nn.Module and the nn.functional API.
  5. Load and batch data with Dataset and DataLoader, including built-in torchvision.datasets.
  6. Write a complete training loop with validation, learning-rate scheduling, and gradient clipping.
  7. Do transfer learning with the current torchvision multi-weight API.
  8. Speed up training with mixed precision (torch.amp) and torch.compile.
  9. Deploy a trained model via TorchScript or ONNX, and serve it with TorchServe.
  10. Troubleshoot the common failure modes (CUDA OOM, device mismatches, mode bugs, vanishing gradients).

Prerequisites

You should be comfortable with Python (functions, classes, imports, virtual environments) and have at least skimmed NumPy. You don't need a math PhD, but knowing what a vector, matrix, derivative, and chain rule are will make autograd feel intuitive rather than magical.

For hardware: a CPU-only laptop is fine for the examples in this guide. To do real training at scale, you'll want an NVIDIA GPU with a recent driver — PyTorch's pip wheels bundle the CUDA runtime libraries so you don't need a separate system-wide CUDA Toolkit installation, only a compatible NVIDIA driver.4

PyTorch grew out of the older Lua-based Torch framework. Facebook AI Research (now Meta AI) ported the core ideas to Python and released PyTorch 0.1 on January 19, 2017.12 Two design choices set it apart from contemporaries: dynamic ("define-by-run") computation graphs that build as your Python code executes, and a deliberately Pythonic API that maps cleanly to NumPy.

The framework hit its first production-stable milestone with PyTorch 1.0 on December 7, 2018, which introduced TorchScript for production deployment.7 In September 2022, Meta transferred PyTorch to the Linux Foundation under a new vendor-neutral PyTorch Foundation, with founding members AMD, AWS, Google Cloud, Meta, Microsoft Azure, and NVIDIA.2 The Foundation expanded again in February 2026, adding members across academia and AI infrastructure including Carnegie Mellon University and Monash University.8

The most consequential recent release is PyTorch 2.0 (March 15, 2023), which introduced torch.compile — a graph-capture-and-compile path that wraps an eager model and produces faster code with no source rewrite.5 As of this writing the latest stable release is PyTorch 2.11.0 (March 2026).3

The framework is genuinely large now: the official pytorch/pytorch repository has roughly 99,000 GitHub stars,9 and PyTorch underpins a wide ecosystem including Hugging Face Transformers, PyTorch Lightning, fast.ai, PyTorch Geometric, and the official integrations layered into ONNX and TorchServe.

Installation

The recommended install path for nearly everyone is pip from the official PyTorch index. Pick the right index URL for your CUDA version (or use the CPU-only build):

# CPU only (any platform)
pip install torch torchvision torchaudio

# CUDA 12.1 wheels (Linux / Windows with NVIDIA GPU + driver)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Always pull the current command from pytorch.org/get-started/locally — the supported CUDA versions and Python range change with each release. The current PyTorch supports Python 3.10–3.14.4

Verify the install:

import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available:  {torch.cuda.is_available()}")
print(f"CUDA version:    {torch.version.cuda}")
if torch.cuda.is_available():
    print(f"Device:          {torch.cuda.get_device_name(0)}")

If torch.cuda.is_available() returns False on a machine you expect to have a GPU, the most common cause is a driver that's too old for the bundled CUDA runtime — update the NVIDIA driver, not the CUDA Toolkit.

Tensors

Tensors are PyTorch's core data structure: an N-dimensional array, like a NumPy ndarray, with optional GPU placement and gradient tracking.

import torch

# Creation
scalar = torch.tensor(3.14)
vector = torch.tensor([1.0, 2.0, 3.0])
matrix = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)

zeros  = torch.zeros(2, 3)         # 2x3 of zeros
ones   = torch.ones(2, 3)          # 2x3 of ones
randn  = torch.randn(2, 3)         # standard normal
arange = torch.arange(12).view(3, 4)  # 0..11 reshaped to 3x4

# Element-wise and matmul
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(a + b)          # tensor([5., 7., 9.])
print(a * b)          # element-wise
print(torch.dot(a, b))  # scalar dot product

m1 = torch.randn(2, 3)
m2 = torch.randn(3, 4)
print(torch.matmul(m1, m2).shape)  # torch.Size([2, 4])

# Shapes and dtype
t = torch.randn(2, 3)
print(t.shape, t.dtype, t.device)

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
t_gpu = t.to(device)

Two practical points that catch beginners:

  • view vs reshape: view requires the tensor to be contiguous in memory; reshape will fall back to copying if needed. Prefer reshape unless you specifically know the tensor is contiguous.
  • NumPy interop is zero-copy on CPU: t.numpy() and torch.from_numpy(arr) share memory. Modify one and you modify the other. This breaks once a tensor moves to GPU.

Autograd

PyTorch tracks operations on tensors that have requires_grad=True and computes gradients with .backward(). This is the engine behind every training loop you'll write.

import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x + 1
y.backward()
print(x.grad)  # tensor(7.) — d/dx(x² + 3x + 1) at x=2 is 2*2 + 3

For training, you typically don't manage requires_grad by hand — nn.Module parameters have it set automatically, and you call .backward() on the loss to populate .grad on every learnable parameter.

# Hand-rolled linear regression with autograd
x_data = torch.tensor([[1.0], [2.0], [3.0]])
y_data = torch.tensor([[2.0], [4.0], [6.0]])

w = torch.tensor([[0.5]], requires_grad=True)
b = torch.tensor([[0.0]], requires_grad=True)

lr = 0.05
for _ in range(200):
    pred = x_data @ w + b
    loss = ((pred - y_data) ** 2).mean()
    loss.backward()
    with torch.no_grad():
        w -= lr * w.grad
        b -= lr * b.grad
        w.grad.zero_()
        b.grad.zero_()

print(f"w ≈ {w.item():.3f}, b ≈ {b.item():.3f}")  # ~ w=2, b=0

Two patterns to remember:

  • Wrap parameter updates in with torch.no_grad(): so they don't get tracked.
  • Zero gradients between steps. Forgetting optimizer.zero_grad() (or, in the manual case above, w.grad.zero_()) is the most common autograd bug — gradients accumulate across .backward() calls by design, which is useful for gradient accumulation but a bug when you don't want it.

Building Neural Networks

In practice you don't manage tensors and gradients yourself — you subclass nn.Module, declare layers in __init__, and define the forward pass:

import torch
import torch.nn as nn
import torch.nn.functional as F

class MNISTClassifier(nn.Module):
    def __init__(self, hidden: int = 128, num_classes: int = 10):
        super().__init__()
        self.fc1 = nn.Linear(28 * 28, hidden)
        self.fc2 = nn.Linear(hidden, num_classes)
        self.dropout = nn.Dropout(0.3)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x.view(x.size(0), -1)   # flatten N×1×28×28 → N×784
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        return self.fc2(x)          # raw logits

model = MNISTClassifier()
print(model)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

A few notes on conventions:

  • The forward pass returns logits, not softmax probabilities. nn.CrossEntropyLoss expects logits and applies log-softmax internally.
  • nn.Sequential is fine for linear stacks but doesn't let you express skip connections, branches, or anything dynamic — once your model has any conditional logic, prefer a real nn.Module.
  • Initialize bias to a sensible value if you care: PyTorch defaults are often fine, but for deeper networks you'll want explicit nn.init.kaiming_normal_ (for ReLU) or nn.init.xavier_uniform_ (for tanh/sigmoid) calls inside __init__.

Datasets and DataLoaders

PyTorch separates "what is one example" (Dataset) from "how do I batch and shuffle" (DataLoader). Most computer-vision tutorials use a built-in torchvision.datasets class:

from torch.utils.data import DataLoader
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),  # MNIST mean/std
])

train_set = datasets.MNIST("./data", train=True,  download=True, transform=transform)
test_set  = datasets.MNIST("./data", train=False, download=True, transform=transform)

train_loader = DataLoader(train_set, batch_size=64, shuffle=True,  num_workers=2, pin_memory=True)
test_loader  = DataLoader(test_set,  batch_size=512, shuffle=False, num_workers=2, pin_memory=True)

For your own data, subclass Dataset:

from torch.utils.data import Dataset

class CSVDataset(Dataset):
    def __init__(self, df, feature_cols, label_col, transform=None):
        self.X = df[feature_cols].to_numpy(dtype="float32")
        self.y = df[label_col].to_numpy(dtype="int64")
        self.transform = transform

    def __len__(self) -> int:
        return len(self.y)

    def __getitem__(self, idx: int):
        x = torch.from_numpy(self.X[idx])
        y = int(self.y[idx])
        if self.transform is not None:
            x = self.transform(x)
        return x, y

Practical tips: set num_workers to a small positive number (2–8) on Linux/macOS for parallel data loading, and use pin_memory=True when training on GPU to speed up host-to-device transfers.

A Complete Training Loop

Putting the pieces together for our MNIST classifier:

import torch
import torch.nn as nn
import torch.optim as optim

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MNISTClassifier().to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)

def evaluate(model, loader, device) -> tuple[float, float]:
    model.eval()
    total_loss, correct, total = 0.0, 0, 0
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            logits = model(x)
            total_loss += criterion(logits, y).item() * x.size(0)
            correct += (logits.argmax(dim=1) == y).sum().item()
            total += x.size(0)
    return total_loss / total, correct / total

EPOCHS = 5
for epoch in range(EPOCHS):
    model.train()
    running_loss = 0.0
    for x, y in train_loader:
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        logits = model(x)
        loss = criterion(logits, y)
        loss.backward()
        # Optional: clip gradients to stabilize deeper networks
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        running_loss += loss.item() * x.size(0)

    scheduler.step()
    train_loss = running_loss / len(train_loader.dataset)
    val_loss, val_acc = evaluate(model, test_loader, device)
    print(
        f"epoch {epoch+1:>2}/{EPOCHS} | "
        f"train_loss={train_loss:.4f} | "
        f"val_loss={val_loss:.4f} | "
        f"val_acc={val_acc:.4f}"
    )

torch.save(model.state_dict(), "mnist_classifier.pt")

What this loop covers in one place: device placement, training/eval mode flips, optimizer zero-grad + step, gradient clipping, learning-rate scheduling, and a clean eval function with torch.no_grad(). The state_dict save format is the recommended way to persist weights — saving the whole model object pickles class definitions and tends to break across refactors.

Transfer Learning (Current API)

For computer-vision work you almost never train from scratch — you load a model pre-trained on ImageNet and fine-tune. The current torchvision API uses weights=, not pretrained=True. The latter has been deprecated since torchvision 0.13 (June 2022) and emits a UserWarning.6

import torch
import torch.nn as nn
import torchvision
from torchvision.models import ResNet18_Weights

# Load pre-trained ResNet18 with the multi-weight API
weights = ResNet18_Weights.DEFAULT          # current best weights
model = torchvision.models.resnet18(weights=weights)

# The weights enum bundles the matching preprocessing transforms — no
# more guessing the right Normalize() means and stds.
preprocess = weights.transforms()

# Replace the final classifier head for our 10-class task
in_features = model.fc.in_features
model.fc = nn.Linear(in_features, 10)

# Optionally freeze everything except the new head for fast fine-tuning
for name, param in model.named_parameters():
    param.requires_grad = name.startswith("fc.")

# Now `preprocess` is the right transform to put inside your dataset
# (or apply on each tensor before forward). Train as usual.

Two things this API gets right that the old one didn't:

  • weights.transforms() returns the exact preprocessing pipeline the network was trained with. You don't have to copy ImageNet means and stds from a Stack Overflow answer.
  • weights.meta exposes class labels and other metadata, so you can map model output indices to human-readable labels without a separate file.

GPU Acceleration and Mixed Precision

Moving to GPU is mechanical: send the model and each batch to the device. Mixed-precision training is the easiest way to make that GPU faster without touching your model code.

import torch
from torch.amp import autocast, GradScaler

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
use_amp = device.type == "cuda"
scaler = GradScaler(device="cuda", enabled=use_amp)

for x, y in train_loader:
    x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
    optimizer.zero_grad()
    with autocast(device_type=device.type, dtype=torch.float16, enabled=use_amp):
        logits = model(x)
        loss = criterion(logits, y)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

The enabled=use_amp guard turns autocast and the gradient scaler into no-ops when you fall back to CPU, so the same training loop works on either device.

autocast runs ops in lower precision (FP16 or BF16) where it's safe, and GradScaler rescales the loss so small gradient values don't underflow to zero. On modern NVIDIA cards (Ampere/Hopper), AMP usually delivers a noticeable training speedup with negligible accuracy loss. PyTorch's official benchmark for torch.compile reported roughly 20% faster on average at float32 and 36% faster on average at AMP precision across 165 open-source models.5

torch.compile (PyTorch 2.x)

torch.compile wraps a model and compiles its forward pass with a graph-capture-and-replay backend. It's additive — your eager code keeps working — and it's the recommended optimization step for any production-bound PyTorch model in 2.x.

model = MNISTClassifier().to(device)
compiled_model = torch.compile(model)  # one line

# Use compiled_model exactly like model
logits = compiled_model(x)

The first call after compilation is slow (it captures and compiles the graph), then subsequent calls are fast. You can pick a compile mode for different trade-offs:

compiled_model = torch.compile(model, mode="reduce-overhead")  # cuts Python interpreter overhead
compiled_model = torch.compile(model, mode="max-autotune")     # best runtime speed, longest compile

Two practical caveats: torch.compile doesn't yet match every model architecture (the official 2.0 announcement reports 93% success across 165 tested models),5 and any Python-side control flow that depends on runtime tensor values may trigger graph breaks that hurt performance. Profile with torch._dynamo.explain(model)(example_input) if you suspect breaks.

Deployment

Three production paths, each with a different trade-off:

TorchScript keeps you in the PyTorch ecosystem. torch.jit.script parses your nn.Module into a serializable IR; torch.jit.trace runs an example through and records the operations.

model.eval()
scripted = torch.jit.script(model)          # preserves control flow
# or
example = torch.randn(1, 1, 28, 28).to(device)
traced = torch.jit.trace(model, example)    # simpler but doesn't preserve control flow

scripted.save("mnist_classifier.ts")

ONNX lets you serve the model from a non-PyTorch runtime — ONNX Runtime, TensorRT, or another framework that ingests ONNX:

example = torch.randn(1, 1, 28, 28).to(device)
torch.onnx.export(
    model,
    example,
    "mnist_classifier.onnx",
    opset_version=17,
    input_names=["input"],
    output_names=["logits"],
    dynamic_axes={"input": {0: "batch"}, "logits": {0: "batch"}},
)

TorchServe is the official PyTorch model server. It packages a model into a .mar archive and serves it over HTTP/gRPC with versioning, batching, and metrics out of the box. The simplest path is to feed it the TorchScript artifact you saved above — --model-file is only required when archiving an eager-mode state_dict:

torch-model-archiver \
    --model-name mnist \
    --version 1.0 \
    --serialized-file mnist_classifier.ts \
    --handler image_classifier \
    --export-path model_store

torchserve --start --ncs --model-store model_store --models mnist=mnist.mar

Pick the path that fits your stack. ONNX is the most portable, TorchServe is the lowest-effort production server, and TorchScript sits between them.

Troubleshooting

CUDA out of memory. Reduce batch size first, then turn on AMP, then turn on gradient checkpointing (torch.utils.checkpoint.checkpoint). torch.cuda.empty_cache() is rarely the answer — it returns cached memory to the driver but doesn't free what's still referenced by Python.

Tensors on different devices. The classic RuntimeError: expected all tensors on same device error means a model is on GPU and a batch is still on CPU (or vice-versa). Move both: x = x.to(device), y = y.to(device), model = model.to(device). Print x.device if you're not sure where something lives.

Model not learning. Three usual suspects, in order: forgot optimizer.zero_grad() (gradients accumulate); forgot to flip model.train() / model.eval() (dropout and batchnorm behave differently); learning rate too high (training loss explodes) or too low (no movement).

Reproducibility. Set seeds and disable nondeterministic CUDA kernels for runs you need to compare:

torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Note that fully bit-exact reproducibility across hardware and PyTorch versions is not guaranteed — small numerical differences are normal.

Vanishing or exploding gradients. Add torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) between loss.backward() and optimizer.step(), switch ReLU initialization to nn.init.kaiming_normal_, or add nn.BatchNorm / residual connections. For RNNs specifically, gradient clipping is essentially mandatory.

Slow data loading. If GPU utilization sits below 50% and the bottleneck is the input pipeline, raise num_workers on your DataLoader, set pin_memory=True, and avoid Python-level work in __getitem__ — preprocess offline if you can.

Frequently Asked Questions

They've converged a lot. PyTorch dominates research (most state-of-the-art papers ship reference implementations in PyTorch first) and has caught up on production tooling. Google's on-device runtime — renamed from TensorFlow Lite to LiteRT in September 2024 10 — still has strong tooling for mobile and embedded targets, and Google Cloud has some TF-specific deployment paths. For a new project today, either is a defensible choice; PyTorch's larger research ecosystem usually tips the scale.

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.