PyTorch Guide: Tensors to Production in 2026
May 4, 2026
TL;DR
- PyTorch is an open-source deep learning framework built around dynamic ("eager") computation graphs and a Pythonic API. It was first released in January 2017 by Facebook AI Research and now lives under the PyTorch Foundation at the Linux Foundation.12
- The current stable release is PyTorch 2.11.0 (March 2026), supports Python 3.10 through 3.14, and ships installable wheels with bundled CUDA runtime libraries via pip.34
torch.compileis the headline PyTorch 2.x feature — across 165 open-source models in the official benchmark, it produced an average speedup of 20% at float32 and 36% at AMP precision while working on 93% of models tested.5- For computer vision transfer learning, use the current
weights=API (torchvision.models.resnet18(weights=ResNet18_Weights.DEFAULT)). The legacypretrained=Trueparameter has been deprecated since torchvision 0.13 (June 2022).6 - This guide walks through tensors, autograd, building and training a model, transfer learning with the current API, GPU acceleration,
torch.compile, and deployment paths (TorchScript, ONNX, TorchServe).
What You'll Learn
- Install PyTorch correctly for CPU or CUDA on a current Python (3.10–3.14) environment.
- Work with tensors — creation, arithmetic, reshaping, and moving data between CPU and GPU.
- Use autograd to compute gradients automatically and understand how it powers training.
- Build neural networks using
nn.Moduleand thenn.functionalAPI. - Load and batch data with
DatasetandDataLoader, including built-intorchvision.datasets. - Write a complete training loop with validation, learning-rate scheduling, and gradient clipping.
- Do transfer learning with the current torchvision multi-weight API.
- Speed up training with mixed precision (
torch.amp) andtorch.compile. - Deploy a trained model via TorchScript or ONNX, and serve it with TorchServe.
- Troubleshoot the common failure modes (CUDA OOM, device mismatches, mode bugs, vanishing gradients).
Prerequisites
You should be comfortable with Python (functions, classes, imports, virtual environments) and have at least skimmed NumPy. You don't need a math PhD, but knowing what a vector, matrix, derivative, and chain rule are will make autograd feel intuitive rather than magical.
For hardware: a CPU-only laptop is fine for the examples in this guide. To do real training at scale, you'll want an NVIDIA GPU with a recent driver — PyTorch's pip wheels bundle the CUDA runtime libraries so you don't need a separate system-wide CUDA Toolkit installation, only a compatible NVIDIA driver.4
PyTorch grew out of the older Lua-based Torch framework. Facebook AI Research (now Meta AI) ported the core ideas to Python and released PyTorch 0.1 on January 19, 2017.12 Two design choices set it apart from contemporaries: dynamic ("define-by-run") computation graphs that build as your Python code executes, and a deliberately Pythonic API that maps cleanly to NumPy.
The framework hit its first production-stable milestone with PyTorch 1.0 on December 7, 2018, which introduced TorchScript for production deployment.7 In September 2022, Meta transferred PyTorch to the Linux Foundation under a new vendor-neutral PyTorch Foundation, with founding members AMD, AWS, Google Cloud, Meta, Microsoft Azure, and NVIDIA.2 The Foundation expanded again in February 2026, adding members across academia and AI infrastructure including Carnegie Mellon University and Monash University.8
The most consequential recent release is PyTorch 2.0 (March 15, 2023), which introduced torch.compile — a graph-capture-and-compile path that wraps an eager model and produces faster code with no source rewrite.5 As of this writing the latest stable release is PyTorch 2.11.0 (March 2026).3
The framework is genuinely large now: the official pytorch/pytorch repository has roughly 99,000 GitHub stars,9 and PyTorch underpins a wide ecosystem including Hugging Face Transformers, PyTorch Lightning, fast.ai, PyTorch Geometric, and the official integrations layered into ONNX and TorchServe.
Installation
The recommended install path for nearly everyone is pip from the official PyTorch index. Pick the right index URL for your CUDA version (or use the CPU-only build):
# CPU only (any platform)
pip install torch torchvision torchaudio
# CUDA 12.1 wheels (Linux / Windows with NVIDIA GPU + driver)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Always pull the current command from pytorch.org/get-started/locally — the supported CUDA versions and Python range change with each release. The current PyTorch supports Python 3.10–3.14.4
Verify the install:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
if torch.cuda.is_available():
print(f"Device: {torch.cuda.get_device_name(0)}")
If torch.cuda.is_available() returns False on a machine you expect to have a GPU, the most common cause is a driver that's too old for the bundled CUDA runtime — update the NVIDIA driver, not the CUDA Toolkit.
Tensors
Tensors are PyTorch's core data structure: an N-dimensional array, like a NumPy ndarray, with optional GPU placement and gradient tracking.
import torch
# Creation
scalar = torch.tensor(3.14)
vector = torch.tensor([1.0, 2.0, 3.0])
matrix = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
zeros = torch.zeros(2, 3) # 2x3 of zeros
ones = torch.ones(2, 3) # 2x3 of ones
randn = torch.randn(2, 3) # standard normal
arange = torch.arange(12).view(3, 4) # 0..11 reshaped to 3x4
# Element-wise and matmul
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(a + b) # tensor([5., 7., 9.])
print(a * b) # element-wise
print(torch.dot(a, b)) # scalar dot product
m1 = torch.randn(2, 3)
m2 = torch.randn(3, 4)
print(torch.matmul(m1, m2).shape) # torch.Size([2, 4])
# Shapes and dtype
t = torch.randn(2, 3)
print(t.shape, t.dtype, t.device)
# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
t_gpu = t.to(device)
Two practical points that catch beginners:
viewvsreshape:viewrequires the tensor to be contiguous in memory;reshapewill fall back to copying if needed. Preferreshapeunless you specifically know the tensor is contiguous.- NumPy interop is zero-copy on CPU:
t.numpy()andtorch.from_numpy(arr)share memory. Modify one and you modify the other. This breaks once a tensor moves to GPU.
Autograd
PyTorch tracks operations on tensors that have requires_grad=True and computes gradients with .backward(). This is the engine behind every training loop you'll write.
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x + 1
y.backward()
print(x.grad) # tensor(7.) — d/dx(x² + 3x + 1) at x=2 is 2*2 + 3
For training, you typically don't manage requires_grad by hand — nn.Module parameters have it set automatically, and you call .backward() on the loss to populate .grad on every learnable parameter.
# Hand-rolled linear regression with autograd
x_data = torch.tensor([[1.0], [2.0], [3.0]])
y_data = torch.tensor([[2.0], [4.0], [6.0]])
w = torch.tensor([[0.5]], requires_grad=True)
b = torch.tensor([[0.0]], requires_grad=True)
lr = 0.05
for _ in range(200):
pred = x_data @ w + b
loss = ((pred - y_data) ** 2).mean()
loss.backward()
with torch.no_grad():
w -= lr * w.grad
b -= lr * b.grad
w.grad.zero_()
b.grad.zero_()
print(f"w ≈ {w.item():.3f}, b ≈ {b.item():.3f}") # ~ w=2, b=0
Two patterns to remember:
- Wrap parameter updates in
with torch.no_grad():so they don't get tracked. - Zero gradients between steps. Forgetting
optimizer.zero_grad()(or, in the manual case above,w.grad.zero_()) is the most common autograd bug — gradients accumulate across.backward()calls by design, which is useful for gradient accumulation but a bug when you don't want it.
Building Neural Networks
In practice you don't manage tensors and gradients yourself — you subclass nn.Module, declare layers in __init__, and define the forward pass:
import torch
import torch.nn as nn
import torch.nn.functional as F
class MNISTClassifier(nn.Module):
def __init__(self, hidden: int = 128, num_classes: int = 10):
super().__init__()
self.fc1 = nn.Linear(28 * 28, hidden)
self.fc2 = nn.Linear(hidden, num_classes)
self.dropout = nn.Dropout(0.3)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = x.view(x.size(0), -1) # flatten N×1×28×28 → N×784
x = F.relu(self.fc1(x))
x = self.dropout(x)
return self.fc2(x) # raw logits
model = MNISTClassifier()
print(model)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
A few notes on conventions:
- The forward pass returns logits, not softmax probabilities.
nn.CrossEntropyLossexpects logits and applies log-softmax internally. nn.Sequentialis fine for linear stacks but doesn't let you express skip connections, branches, or anything dynamic — once your model has any conditional logic, prefer a realnn.Module.- Initialize bias to a sensible value if you care: PyTorch defaults are often fine, but for deeper networks you'll want explicit
nn.init.kaiming_normal_(for ReLU) ornn.init.xavier_uniform_(for tanh/sigmoid) calls inside__init__.
Datasets and DataLoaders
PyTorch separates "what is one example" (Dataset) from "how do I batch and shuffle" (DataLoader). Most computer-vision tutorials use a built-in torchvision.datasets class:
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)), # MNIST mean/std
])
train_set = datasets.MNIST("./data", train=True, download=True, transform=transform)
test_set = datasets.MNIST("./data", train=False, download=True, transform=transform)
train_loader = DataLoader(train_set, batch_size=64, shuffle=True, num_workers=2, pin_memory=True)
test_loader = DataLoader(test_set, batch_size=512, shuffle=False, num_workers=2, pin_memory=True)
For your own data, subclass Dataset:
from torch.utils.data import Dataset
class CSVDataset(Dataset):
def __init__(self, df, feature_cols, label_col, transform=None):
self.X = df[feature_cols].to_numpy(dtype="float32")
self.y = df[label_col].to_numpy(dtype="int64")
self.transform = transform
def __len__(self) -> int:
return len(self.y)
def __getitem__(self, idx: int):
x = torch.from_numpy(self.X[idx])
y = int(self.y[idx])
if self.transform is not None:
x = self.transform(x)
return x, y
Practical tips: set num_workers to a small positive number (2–8) on Linux/macOS for parallel data loading, and use pin_memory=True when training on GPU to speed up host-to-device transfers.
A Complete Training Loop
Putting the pieces together for our MNIST classifier:
import torch
import torch.nn as nn
import torch.optim as optim
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MNISTClassifier().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)
def evaluate(model, loader, device) -> tuple[float, float]:
model.eval()
total_loss, correct, total = 0.0, 0, 0
with torch.no_grad():
for x, y in loader:
x, y = x.to(device), y.to(device)
logits = model(x)
total_loss += criterion(logits, y).item() * x.size(0)
correct += (logits.argmax(dim=1) == y).sum().item()
total += x.size(0)
return total_loss / total, correct / total
EPOCHS = 5
for epoch in range(EPOCHS):
model.train()
running_loss = 0.0
for x, y in train_loader:
x, y = x.to(device), y.to(device)
optimizer.zero_grad()
logits = model(x)
loss = criterion(logits, y)
loss.backward()
# Optional: clip gradients to stabilize deeper networks
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
running_loss += loss.item() * x.size(0)
scheduler.step()
train_loss = running_loss / len(train_loader.dataset)
val_loss, val_acc = evaluate(model, test_loader, device)
print(
f"epoch {epoch+1:>2}/{EPOCHS} | "
f"train_loss={train_loss:.4f} | "
f"val_loss={val_loss:.4f} | "
f"val_acc={val_acc:.4f}"
)
torch.save(model.state_dict(), "mnist_classifier.pt")
What this loop covers in one place: device placement, training/eval mode flips, optimizer zero-grad + step, gradient clipping, learning-rate scheduling, and a clean eval function with torch.no_grad(). The state_dict save format is the recommended way to persist weights — saving the whole model object pickles class definitions and tends to break across refactors.
Transfer Learning (Current API)
For computer-vision work you almost never train from scratch — you load a model pre-trained on ImageNet and fine-tune. The current torchvision API uses weights=, not pretrained=True. The latter has been deprecated since torchvision 0.13 (June 2022) and emits a UserWarning.6
import torch
import torch.nn as nn
import torchvision
from torchvision.models import ResNet18_Weights
# Load pre-trained ResNet18 with the multi-weight API
weights = ResNet18_Weights.DEFAULT # current best weights
model = torchvision.models.resnet18(weights=weights)
# The weights enum bundles the matching preprocessing transforms — no
# more guessing the right Normalize() means and stds.
preprocess = weights.transforms()
# Replace the final classifier head for our 10-class task
in_features = model.fc.in_features
model.fc = nn.Linear(in_features, 10)
# Optionally freeze everything except the new head for fast fine-tuning
for name, param in model.named_parameters():
param.requires_grad = name.startswith("fc.")
# Now `preprocess` is the right transform to put inside your dataset
# (or apply on each tensor before forward). Train as usual.
Two things this API gets right that the old one didn't:
weights.transforms()returns the exact preprocessing pipeline the network was trained with. You don't have to copy ImageNet means and stds from a Stack Overflow answer.weights.metaexposes class labels and other metadata, so you can map model output indices to human-readable labels without a separate file.
GPU Acceleration and Mixed Precision
Moving to GPU is mechanical: send the model and each batch to the device. Mixed-precision training is the easiest way to make that GPU faster without touching your model code.
import torch
from torch.amp import autocast, GradScaler
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
use_amp = device.type == "cuda"
scaler = GradScaler(device="cuda", enabled=use_amp)
for x, y in train_loader:
x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
optimizer.zero_grad()
with autocast(device_type=device.type, dtype=torch.float16, enabled=use_amp):
logits = model(x)
loss = criterion(logits, y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
The enabled=use_amp guard turns autocast and the gradient scaler into no-ops when you fall back to CPU, so the same training loop works on either device.
autocast runs ops in lower precision (FP16 or BF16) where it's safe, and GradScaler rescales the loss so small gradient values don't underflow to zero. On modern NVIDIA cards (Ampere/Hopper), AMP usually delivers a noticeable training speedup with negligible accuracy loss. PyTorch's official benchmark for torch.compile reported roughly 20% faster on average at float32 and 36% faster on average at AMP precision across 165 open-source models.5
torch.compile (PyTorch 2.x)
torch.compile wraps a model and compiles its forward pass with a graph-capture-and-replay backend. It's additive — your eager code keeps working — and it's the recommended optimization step for any production-bound PyTorch model in 2.x.
model = MNISTClassifier().to(device)
compiled_model = torch.compile(model) # one line
# Use compiled_model exactly like model
logits = compiled_model(x)
The first call after compilation is slow (it captures and compiles the graph), then subsequent calls are fast. You can pick a compile mode for different trade-offs:
compiled_model = torch.compile(model, mode="reduce-overhead") # cuts Python interpreter overhead
compiled_model = torch.compile(model, mode="max-autotune") # best runtime speed, longest compile
Two practical caveats: torch.compile doesn't yet match every model architecture (the official 2.0 announcement reports 93% success across 165 tested models),5 and any Python-side control flow that depends on runtime tensor values may trigger graph breaks that hurt performance. Profile with torch._dynamo.explain(model)(example_input) if you suspect breaks.
Deployment
Three production paths, each with a different trade-off:
TorchScript keeps you in the PyTorch ecosystem. torch.jit.script parses your nn.Module into a serializable IR; torch.jit.trace runs an example through and records the operations.
model.eval()
scripted = torch.jit.script(model) # preserves control flow
# or
example = torch.randn(1, 1, 28, 28).to(device)
traced = torch.jit.trace(model, example) # simpler but doesn't preserve control flow
scripted.save("mnist_classifier.ts")
ONNX lets you serve the model from a non-PyTorch runtime — ONNX Runtime, TensorRT, or another framework that ingests ONNX:
example = torch.randn(1, 1, 28, 28).to(device)
torch.onnx.export(
model,
example,
"mnist_classifier.onnx",
opset_version=17,
input_names=["input"],
output_names=["logits"],
dynamic_axes={"input": {0: "batch"}, "logits": {0: "batch"}},
)
TorchServe is the official PyTorch model server. It packages a model into a .mar archive and serves it over HTTP/gRPC with versioning, batching, and metrics out of the box. The simplest path is to feed it the TorchScript artifact you saved above — --model-file is only required when archiving an eager-mode state_dict:
torch-model-archiver \
--model-name mnist \
--version 1.0 \
--serialized-file mnist_classifier.ts \
--handler image_classifier \
--export-path model_store
torchserve --start --ncs --model-store model_store --models mnist=mnist.mar
Pick the path that fits your stack. ONNX is the most portable, TorchServe is the lowest-effort production server, and TorchScript sits between them.
Troubleshooting
CUDA out of memory. Reduce batch size first, then turn on AMP, then turn on gradient checkpointing (torch.utils.checkpoint.checkpoint). torch.cuda.empty_cache() is rarely the answer — it returns cached memory to the driver but doesn't free what's still referenced by Python.
Tensors on different devices. The classic RuntimeError: expected all tensors on same device error means a model is on GPU and a batch is still on CPU (or vice-versa). Move both: x = x.to(device), y = y.to(device), model = model.to(device). Print x.device if you're not sure where something lives.
Model not learning. Three usual suspects, in order: forgot optimizer.zero_grad() (gradients accumulate); forgot to flip model.train() / model.eval() (dropout and batchnorm behave differently); learning rate too high (training loss explodes) or too low (no movement).
Reproducibility. Set seeds and disable nondeterministic CUDA kernels for runs you need to compare:
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Note that fully bit-exact reproducibility across hardware and PyTorch versions is not guaranteed — small numerical differences are normal.
Vanishing or exploding gradients. Add torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) between loss.backward() and optimizer.step(), switch ReLU initialization to nn.init.kaiming_normal_, or add nn.BatchNorm / residual connections. For RNNs specifically, gradient clipping is essentially mandatory.
Slow data loading. If GPU utilization sits below 50% and the bottleneck is the input pipeline, raise num_workers on your DataLoader, set pin_memory=True, and avoid Python-level work in __getitem__ — preprocess offline if you can.