TensorFlow Guide: From Zero to Hero (2026 Edition)

May 3, 2026

TensorFlow Guide: From Zero to Hero (2026 Edition)

TL;DR

  • TensorFlow 2 is eager-by-default, Keras-first, and the same code that runs on a laptop scales to multi-GPU and production.1
  • Build models with Keras Sequential (linear stacks), Functional (branches and shared layers), or by subclassing tf.keras.Model for full control.23
  • Pipe data with tf.data and always end the pipeline with .prefetch(tf.data.AUTOTUNE) — feeding NumPy arrays directly leaves GPUs idle.45
  • Train with model.fit() for 90% of cases. Reach for tf.GradientTape only when you need a non-standard loss, multiple optimizers, or custom logging.6
  • Distribute with tf.distribute.MirroredStrategy (single host, multi-GPU) or MultiWorkerMirroredStrategy (multi-host) — same model code, just wrap construction in with strategy.scope():.7
  • Ship through a SavedModel: TF Serving for servers, TF Lite for mobile/edge, TF.js for browsers.8910
  • This guide assumes TensorFlow 2.x on Python 3.10–3.12. Check current API docs at tensorflow.org/api_docs for the exact version.

What You'll Learn

  1. Set up TensorFlow locally or in Google Colab and verify GPU acceleration in one line.
  2. Manipulate tensors — the core data structure — and understand the eager-vs-graph trade-off.
  3. Build neural networks with all three Keras APIs and pick the right one for the job.
  4. Stream large datasets efficiently with tf.data so the GPU stays saturated.
  5. Train and evaluate models with model.fit() plus callbacks for early stopping, learning-rate schedules, and TensorBoard logging.
  6. Write custom training loops with tf.GradientTape when the high-level API isn't enough.
  7. Scale training across multiple GPUs and machines using tf.distribute.
  8. Export and serve models in production through TF Serving (servers) or TF Lite (devices).

Each section ends with a runnable code block and a link to the relevant page in the official TensorFlow documentation.


Prerequisites

You'll get the most out of this guide if you have:

  • Python: comfortable with functions, classes, list comprehensions, and basic NumPy. Python 3.10+ recommended; TensorFlow 2.x supports Python 3.9–3.12.11
  • Math: linear algebra (vectors, matrices, dot products), calculus (gradients, partial derivatives), and basic probability. You don't need a PhD; you need to know what a gradient is and why it points uphill.
  • Machine learning basics: training/validation/test split, what overfitting looks like, why we use a held-out test set. If those terms are new, work through Andrew Ng's Machine Learning course on Coursera before this guide — it's still the best free intro.

You don't need a GPU to start. Google Colab gives you a free Tesla T4 for short sessions, and most of this guide's examples run there in under a minute.


Setting Up TensorFlow

Local install

# Create a clean virtual environment first
python3 -m venv tf-env
source tf-env/bin/activate     # Windows: tf-env\Scripts\activate

# Install TensorFlow (latest stable)
pip install --upgrade pip
pip install tensorflow

# For GPU support on Linux (CUDA-enabled wheel includes the GPU pieces)
pip install tensorflow[and-cuda]

pip install tensorflow ships the CPU-only wheel on macOS (Apple Silicon uses the Metal plugin, installed separately) and the CPU+GPU wheel on Linux when you append [and-cuda].11 On Windows, native GPU support was dropped in TF 2.11 — use WSL2 + the Linux wheel for CUDA.

Verify the install

import tensorflow as tf

print("TensorFlow version:", tf.__version__)
print("GPUs visible:", tf.config.list_physical_devices("GPU"))
print("Built with CUDA:", tf.test.is_built_with_cuda())

If GPUs visible returns an empty list and you expected one, your driver/CUDA versions don't match the TF wheel. The mapping table on tensorflow.org/install/source#gpu is authoritative.11

Google Colab (zero-install path)

Skip everything above and open a notebook at colab.research.google.com. TensorFlow is pre-installed; switch to a GPU runtime via Runtime → Change runtime type → GPU. Then:

import tensorflow as tf
tf.config.list_physical_devices("GPU")
# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Colab is the friction-free way to follow this guide. Every example below runs there unchanged.


TensorFlow Fundamentals

Tensors

A tensor is an n-dimensional array with a dtype (e.g. float32) and a shape (e.g. (32, 224, 224, 3)). It's the only data structure TensorFlow operations consume.

import tensorflow as tf

# Scalar (rank 0)
scalar = tf.constant(3.14)

# Vector (rank 1)
vector = tf.constant([1.0, 2.0, 3.0])

# Matrix (rank 2)
matrix = tf.constant([[1, 2, 3],
                      [4, 5, 6]], dtype=tf.float32)

# 4D tensor — typical image batch (batch, height, width, channels)
images = tf.zeros((32, 224, 224, 3), dtype=tf.float32)

print(matrix.shape, matrix.dtype)
# (2, 3) <dtype: 'float32'>

Element-wise math, reductions, and broadcasting work the way they do in NumPy:

a = tf.constant([1.0, 2.0, 3.0])
b = tf.constant([10.0, 20.0, 30.0])

a + b          # [11., 22., 33.]
a * b          # [10., 40., 90.]
tf.reduce_sum(a * b)  # 140.0  (dot product)
tf.linalg.matmul(matrix, tf.transpose(matrix))  # 2x2 result

Two things to know that NumPy doesn't have:

  1. Tensors are immutable. Every operation produces a new tensor. To mutate, use tf.Variable (next section).
  2. Tensors live on a device (CPU or GPU). They get placed automatically. tensor.device tells you where, and with tf.device("/GPU:0"): forces placement.

→ Full tensor reference: tensorflow.org/api_docs/python/tf/Tensor1

Variables

tf.Variable is mutable state — used for model weights and any value that gets updated during training.

# Initialize from a tensor
w = tf.Variable(tf.random.normal((4, 3)), name="weights")

# Update in place — assign(), assign_add(), assign_sub()
w.assign(tf.zeros_like(w))     # set to zeros
w.assign_add(tf.ones_like(w))  # increment by 1

# Read like a tensor
print(w.numpy())

When you build a Keras layer, the layer's parameters are tf.Variable objects under the hood. You rarely create them by hand; the framework does.

tensorflow.org/guide/variable

Eager vs. graph execution

TensorFlow 2 runs eagerly by default: every Python line executes immediately and returns a tensor you can print(). This is what made TF 2 a usable library — TF 1's lazy graph mode was notorious.12

But eager mode has overhead. For production training, you wrap a function with @tf.function and TF traces it into a graph the first time it's called, then replays the graph on every subsequent call:

@tf.function
def train_step(x, y, model, loss_fn, optimizer):
    with tf.GradientTape() as tape:
        predictions = model(x, training=True)
        loss = loss_fn(y, predictions)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss

The same function works without @tf.function (eager) or with it (graph). Use eager while debugging; add @tf.function once it works to get the speedup.

tensorflow.org/guide/intro_to_graphs12


Building Models with Keras

Keras is the high-level API in tf.keras. It has three styles, each with a clear sweet spot.

1. Sequential — for linear stacks

Use when your model is a list of layers, one after another. Most beginner-level networks fit this shape.

import tensorflow as tf
from tensorflow.keras import layers, Sequential

model = Sequential([
    layers.Input(shape=(28, 28, 1)),
    layers.Conv2D(32, 3, activation="relu"),
    layers.MaxPool2D(),
    layers.Conv2D(64, 3, activation="relu"),
    layers.MaxPool2D(),
    layers.Flatten(),
    layers.Dense(128, activation="relu"),
    layers.Dropout(0.3),
    layers.Dense(10, activation="softmax"),
])

model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

model.summary()

tensorflow.org/guide/keras/sequential_model2

2. Functional — for branches, multi-input, residual connections

Use when your model has more than one input, multiple outputs, or a layer feeds two paths (skip connections, branching).

inputs = tf.keras.Input(shape=(224, 224, 3))

x = layers.Conv2D(32, 3, activation="relu")(inputs)
x = layers.MaxPool2D()(x)
x = layers.Conv2D(64, 3, activation="relu")(x)

# Skip connection: add the original 32-channel features back
skip = layers.Conv2D(64, 1)(layers.MaxPool2D()(inputs))
x = layers.Add()([x, skip])

x = layers.GlobalAveragePooling2D()(x)
outputs = layers.Dense(10, activation="softmax")(x)

model = tf.keras.Model(inputs, outputs, name="mini_resnet")
model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

tensorflow.org/guide/keras/functional3

3. Subclassing — for full control

Use when your forward pass has dynamic structure (e.g. a loop whose length depends on input data). It's the most flexible and the hardest to debug — reach for it last.

class TwoTowerModel(tf.keras.Model):
    def __init__(self, num_classes):
        super().__init__()
        self.text_dense = layers.Dense(64, activation="relu")
        self.image_dense = layers.Dense(64, activation="relu")
        self.combine = layers.Concatenate()
        self.classifier = layers.Dense(num_classes, activation="softmax")

    def call(self, inputs, training=False):
        text, image = inputs
        t = self.text_dense(text)
        i = self.image_dense(image)
        return self.classifier(self.combine([t, i]))


model = TwoTowerModel(num_classes=5)

tensorflow.org/guide/keras/custom_layers_and_models

Picking between the three

Model shapeAPI
Linear stack of layersSequential
Multi-input / multi-output / skip connectionsFunctional
Loops, conditionals, dynamic shape in call()Subclassing

Default to Functional. It handles everything Sequential can plus most of what Subclassing can, while still serializing cleanly to a SavedModel.


Working with Data: tf.data

Loading data is where most TensorFlow projects waste time. The single biggest performance lever is using tf.data end-to-end and never feeding NumPy arrays directly to model.fit() for anything larger than a toy dataset.4

From in-memory arrays

import numpy as np

X = np.random.rand(10000, 32).astype("float32")
y = np.random.randint(0, 10, size=10000)

ds = (tf.data.Dataset.from_tensor_slices((X, y))
      .shuffle(buffer_size=10000, seed=42)
      .batch(32)
      .prefetch(tf.data.AUTOTUNE))

for batch_x, batch_y in ds.take(1):
    print(batch_x.shape, batch_y.shape)
# (32, 32) (32,)

From files (images)

ds = tf.keras.utils.image_dataset_from_directory(
    "data/train",
    image_size=(224, 224),
    batch_size=32,
    label_mode="int",
)

# Always finish with prefetch — overlaps data prep with training step
ds = ds.prefetch(tf.data.AUTOTUNE)

From the catalog: tensorflow_datasets

import tensorflow_datasets as tfds

(train_ds, val_ds), info = tfds.load(
    "cifar10",
    split=["train", "test"],
    as_supervised=True,
    with_info=True,
)

def normalize(image, label):
    image = tf.cast(image, tf.float32) / 255.0
    return image, label

train_ds = (train_ds
            .map(normalize, num_parallel_calls=tf.data.AUTOTUNE)
            .cache()
            .shuffle(10000)
            .batch(64)
            .prefetch(tf.data.AUTOTUNE))

val_ds = (val_ds
          .map(normalize, num_parallel_calls=tf.data.AUTOTUNE)
          .batch(64)
          .cache()
          .prefetch(tf.data.AUTOTUNE))

→ TFDS catalog: tensorflow.org/datasets/catalog/overview13

The pipeline rules of thumb

  1. .shuffle(buffer) before .batch() so each batch is mixed.
  2. .cache() if the data fits in memory after preprocessing — saves re-doing .map() every epoch.
  3. .map(..., num_parallel_calls=tf.data.AUTOTUNE) to parallelize preprocessing across CPU cores.
  4. .prefetch(tf.data.AUTOTUNE) as the last op — overlaps the next batch's prep with the current batch's training step.
  5. Augment inside the model (using layers.RandomFlip, layers.RandomRotation, etc.), not in the dataset, so augmentation runs on the GPU and only during training.

→ Performance guide: tensorflow.org/guide/data_performance5


Training and Evaluation

model.fit() and callbacks

callbacks = [
    tf.keras.callbacks.EarlyStopping(
        monitor="val_loss",
        patience=5,
        restore_best_weights=True,
    ),
    tf.keras.callbacks.ReduceLROnPlateau(
        monitor="val_loss",
        factor=0.5,
        patience=2,
        min_lr=1e-6,
    ),
    tf.keras.callbacks.TensorBoard(log_dir="./logs"),
    tf.keras.callbacks.ModelCheckpoint(
        filepath="best_model.keras",
        save_best_only=True,
        monitor="val_loss",
    ),
]

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=50,
    callbacks=callbacks,
    verbose=2,
)

Three patterns that pay off every time:

  • Always use EarlyStopping with restore_best_weights=True. Otherwise you keep whichever weights the model had at the last epoch — usually overfit.
  • Always log to TensorBoard. It's free and lets you compare runs visually.
  • Always save the best checkpoint (not the last one). The two diverge by epoch 5–10 in any non-trivial run.

tensorflow.org/api_docs/python/tf/keras/callbacks

Custom training with tf.GradientTape

When model.fit() doesn't fit — multiple optimizers, GAN-style adversarial loops, custom gradient manipulation, RL — write your own training step.

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
train_acc = tf.keras.metrics.SparseCategoricalAccuracy()


@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        logits = model(x, training=True)
        loss = loss_fn(y, logits)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    train_acc.update_state(y, logits)
    return loss


for epoch in range(10):
    train_acc.reset_state()
    for x, y in train_ds:
        loss = train_step(x, y)
    print(f"Epoch {epoch+1}: loss={loss.numpy():.4f} acc={train_acc.result().numpy():.4f}")

Two things to know:

  • The @tf.function decorator is what makes this fast. Without it, you're back to eager mode and lose ~10× throughput on a GPU.
  • tape.gradient must be called before the tape goes out of scope; once the with block ends, the tape is gone.

tensorflow.org/api_docs/python/tf/GradientTape6

Mixed precision

Mixed precision keeps weights in float32 (for stability) but does the math in float16 or bfloat16 (for speed and memory). On Ampere-or-newer NVIDIA GPUs, it's roughly a 2–3× training speedup with negligible accuracy loss.14

from tensorflow.keras import mixed_precision

mixed_precision.set_global_policy("mixed_float16")
# Build the model AFTER setting the policy

# Output layer should stay in float32 for numerical stability:
outputs = layers.Dense(10, activation="softmax", dtype="float32")(x)

That's the whole change. The framework handles loss scaling automatically when you compile with the default model.compile(optimizer="adam", ...) — it wraps the optimizer with LossScaleOptimizer for you.

tensorflow.org/guide/mixed_precision14


Project: CIFAR-10 Image Classifier with Transfer Learning

A complete runnable example. Open Colab, paste this in, switch to a GPU runtime, and you'll have a 90%+ test accuracy classifier in about 5 minutes.

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras import layers

# 1. Data
(train_ds, test_ds), info = tfds.load(
    "cifar10",
    split=["train", "test"],
    as_supervised=True,
    with_info=True,
)

NUM_CLASSES = info.features["label"].num_classes
IMG_SIZE = 224  # MobileNetV2's preferred input size
BATCH = 64

def preprocess(image, label):
    image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))
    image = tf.keras.applications.mobilenet_v2.preprocess_input(image)
    return image, label

train_ds = (train_ds
            .map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
            .cache()
            .shuffle(10000)
            .batch(BATCH)
            .prefetch(tf.data.AUTOTUNE))

test_ds = (test_ds
           .map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
           .batch(BATCH)
           .cache()
           .prefetch(tf.data.AUTOTUNE))

# 2. Model — frozen MobileNetV2 backbone + new classifier head
base = tf.keras.applications.MobileNetV2(
    input_shape=(IMG_SIZE, IMG_SIZE, 3),
    include_top=False,
    weights="imagenet",
)
base.trainable = False

inputs = tf.keras.Input(shape=(IMG_SIZE, IMG_SIZE, 3))
x = tf.keras.Sequential([
    layers.RandomFlip("horizontal"),
    layers.RandomRotation(0.05),
])(inputs)
x = base(x, training=False)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(NUM_CLASSES, activation="softmax")(x)

model = tf.keras.Model(inputs, outputs)

model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

# 3. Train head
history = model.fit(
    train_ds,
    validation_data=test_ds,
    epochs=5,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True),
    ],
)

# 4. Fine-tune: unfreeze the top 30 layers of the backbone, very small LR
base.trainable = True
for layer in base.layers[:-30]:
    layer.trainable = False

model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

history_ft = model.fit(
    train_ds,
    validation_data=test_ds,
    epochs=5,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True),
    ],
)

# 5. Evaluate and save
loss, acc = model.evaluate(test_ds, verbose=0)
print(f"Test accuracy: {acc:.4f}")

model.save("cifar10_mobilenet.keras")

Why this works:

  • Frozen backbone first trains only the classifier head against general-purpose features — fast and stable.
  • Then unfreeze and fine-tune at 1e-5 lets the upper layers specialize without destroying the pretrained features.
  • Augmentation lives in the model (RandomFlip, RandomRotation) so it runs on the GPU and only during training (training=True).

→ Transfer learning guide: tensorflow.org/guide/keras/transfer_learning


Distributed Training

The TF distribution API lets you scale training across devices and machines without changing your model code. You wrap construction in a strategy scope; the framework handles the rest.7

Multi-GPU on a single machine: MirroredStrategy

strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices: {strategy.num_replicas_in_sync}")

with strategy.scope():
    model = build_my_model()         # same code as before
    model.compile(optimizer="adam",
                  loss="sparse_categorical_crossentropy",
                  metrics=["accuracy"])

# Important: scale the global batch size with the number of replicas
GLOBAL_BATCH = 64 * strategy.num_replicas_in_sync
ds = ds.batch(GLOBAL_BATCH).prefetch(tf.data.AUTOTUNE)

model.fit(ds, epochs=10)

Multi-host: MultiWorkerMirroredStrategy

Same pattern, plus a TF_CONFIG environment variable on each worker that lists every worker's address and the index of the current one:

# On worker 0:
export TF_CONFIG='{
  "cluster": {"worker": ["worker0.example:12345", "worker1.example:12345"]},
  "task": {"type": "worker", "index": 0}
}'
python train.py

# On worker 1: same JSON, but "index": 1
strategy = tf.distribute.MultiWorkerMirroredStrategy()
with strategy.scope():
    model = build_my_model()
    model.compile(...)
model.fit(...)

TPU: TPUStrategy

On Colab, switch to a TPU runtime, then:

resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)

Strategy selection

HardwareStrategy
Single GPUNone — TensorFlow uses it automatically
Multiple GPUs, one machineMirroredStrategy
Multiple machinesMultiWorkerMirroredStrategy
Cloud TPUTPUStrategy
Massive embedding tables (recsys)ParameterServerStrategy

tensorflow.org/guide/distributed_training7


Going to Production

The SavedModel format is the universal export. Both TF Serving and TF Lite consume it.89

Export

# Keras-native (preferred for TF 2.x):
model.save("export/model.keras")          # single-file format

# Or the SavedModel directory format (what TF Serving expects):
model.export("export/saved_model")

model.export() produces a directory like:

export/saved_model/
├── saved_model.pb
├── variables/
│   ├── variables.data-00000-of-00001
│   └── variables.index
└── assets/

Reload with:

loaded = tf.saved_model.load("export/saved_model")
predictions = loaded(input_tensor)

TF Serving (HTTP/gRPC servers)

Run TF Serving in a Docker container, mount your SavedModel directory, hit it over REST.

# Save the model to a versioned directory: export/saved_model/1/...
docker pull tensorflow/serving

docker run -p 8501:8501 \
  --mount type=bind,source="$(pwd)/export/saved_model",target=/models/my_model \
  -e MODEL_NAME=my_model \
  -t tensorflow/serving

Send a prediction:

curl -X POST http://localhost:8501/v1/models/my_model:predict \
  -d '{"instances": [[1.0, 2.0, 3.0, 4.0]]}'

tensorflow.org/tfx/serving/serving_basic9

TF Lite (mobile and edge)

converter = tf.lite.TFLiteConverter.from_saved_model("export/saved_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]   # int8 quantization

# For full int8 (smallest + fastest), provide a representative dataset:
def representative_dataset():
    for sample in train_ds.take(100):
        yield [sample[0]]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

tflite_model = converter.convert()
with open("model.tflite", "wb") as f:
    f.write(tflite_model)

Loading on-device (Python is also one of the supported runtimes):

interpreter = tf.lite.Interpreter(model_path="model.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

interpreter.set_tensor(input_details[0]["index"], input_data)
interpreter.invoke()
predictions = interpreter.get_tensor(output_details[0]["index"])

The same .tflite file deploys to Android (via the AAR), iOS (CocoaPods), Raspberry Pi, and microcontrollers (TF Lite Micro, a separate runtime).

tensorflow.org/lite/guide10

TF.js (browser inference)

pip install tensorflowjs
tensorflowjs_converter \
  --input_format=tf_saved_model \
  export/saved_model \
  web_model

Then in the browser:

import * as tf from '@tensorflow/tfjs';
const model = await tf.loadGraphModel('/web_model/model.json');
const prediction = model.predict(tf.tensor(input));

tensorflow.org/js


TensorFlow Ecosystem at a Glance

You'll meet these in real projects:

LibraryWhat it's for
tf.kerasThe standard model API. You'll use it daily.
tf.dataDataset pipelines. Same.
tensorflow_datasets (TFDS)Catalog of pre-prepared public datasets (MNIST, CIFAR, ImageNet, GLUE, COCO, etc.).13
tensorflow_hubPre-trained models you can drop into a Keras model with one line.
tf.distributeMulti-GPU, multi-host, TPU training without changing model code.7
TFXEnd-to-end production pipelines: validation, training, serving, monitoring.15
TensorBoardTraining visualization, profiling, embedding projections.
tf.liteOn-device inference (mobile, embedded, edge).10
tensorflow_servingHTTP/gRPC server for batch and online inference.9
tensorflow_jsRun models in the browser.

You don't need all of them. Most projects ship with tf.keras + tf.data + one of tf.lite / tensorflow_serving for production.


Common Patterns and Pitfalls

A few things that trip up everyone the first time.

Don't feed NumPy arrays to model.fit() for anything serious

It works for toy examples but the GPU will sit at 30% utilization. Wrap the arrays in a tf.data.Dataset and add .prefetch(tf.data.AUTOTUNE).

training=True matters

Some layers (Dropout, BatchNormalization) behave differently during training vs. inference. model.fit() handles this for you. In a custom training loop, always pass training=True to model(...) during training and False during evaluation.

tf.function retraces are expensive

Every time you call a @tf.function-decorated function with a new input shape or dtype, TF traces a new graph. If your batch size varies, pad to a fixed shape or use input_signature to fix the trace.

Save the .keras file, not just the weights

model.save_weights(...) only saves the parameters. model.save("file.keras") saves architecture + weights + optimizer state — the only one you can reload without rebuilding the model.

Don't tf.constant in a loop

tf.constant allocates each call. If you're calling it inside a training loop with the same value, hoist it out — or pass the value as a function argument.

Use tf.device sparingly

Manual device placement (with tf.device("/GPU:0")) is rarely needed in TF 2 — automatic placement is reliable. Reach for it only when you're squeezing the last 10% out of a multi-GPU pipeline.


The Bottom Line

TensorFlow rewards a small set of habits and punishes deviation:

  1. Build with Keras Functional, fall back to subclassing only when you need dynamic structure.
  2. Pipe data with tf.data + AUTOTUNE prefetch, never NumPy arrays.
  3. Train with model.fit() + callbacks, drop into tf.GradientTape only for genuinely custom needs.
  4. Use @tf.function for any hot path — it's a 5–10× speedup for free.
  5. Export to SavedModel, then deploy through TF Serving or TF Lite.
  6. Trust the official guides — TensorFlow's documentation is unusually good. Every link in this article goes to a page worth reading.

Once those six are second nature, the rest of the framework — distributed training, mixed precision, custom layers, TFX pipelines — is incremental. You're already most of the way there.

Now go open Colab and train something.



Footnotes

  1. TensorFlow API reference — tensorflow.org/api_docs/python/tf. Authoritative source for every public symbol mentioned in this guide. 2

  2. TensorFlow guide — The Sequential model. 2

  3. TensorFlow guide — The Functional API. 2

  4. TensorFlow guide — tf.data: Build TensorFlow input pipelines. 2

  5. TensorFlow guide — Better performance with the tf.data API. 2

  6. TensorFlow API reference — tf.GradientTape. 2

  7. TensorFlow guide — Distributed training with TensorFlow. 2 3 4

  8. TensorFlow guide — Using the SavedModel format. 2

  9. TensorFlow Serving — Architecture overview and serving basics. 2 3 4

  10. TensorFlow Lite — LiteRT (TF Lite) developer guide. 2 3

  11. TensorFlow install — Pip install instructions and platform support, and the version-compatibility table. 2 3

  12. TensorFlow guide — Introduction to graphs and tf.function. 2

  13. TensorFlow Datasets — Catalog overview and the TFDS API reference. 2

  14. TensorFlow guide — Mixed precision. 2

  15. TFX — The TFX user guide.

Frequently Asked Questions

A: Both are excellent. TensorFlow has the edge for production deployment (TF Serving, TF Lite, TFX), mobile (TF Lite is more mature than PyTorch Mobile/ExecuTorch), and TPUs (TF is first-class on Google Cloud TPUs). PyTorch has the edge for research velocity and is more popular in academia. If you're unsure, pick the one your team uses.

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.