A: No, but anything beyond MNIST-scale is painful on CPU. Google Colab gives you a free GPU; that's the friction-free way to learn. For sustained training, a single rented A100 hour on a cloud GPU service is cheaper than buying hardware until you train regularly.

Q: Why does my training loss go down but validation loss go up?

A: Overfitting. Your model is memorizing the training set. Add Dropout , weight regularization ( kernel_regularizer=regularizers.l2(1e-4) ), more data augmentation, or stop earlier ( EarlyStopping(restore_best_weights=True) ).

Q: When should I write a custom training loop instead of using model.fit()?

A: When fit() can't express what you need — multiple optimizers, GAN-style alternation, RL, custom gradient manipulation, or extremely fine-grained logging. For 90% of supervised problems, fit() is faster to write, easier to debug, and just as fast at runtime.

Q: How do I export a model that uses tf.function with custom ops?

A: Use tf.saved_model.save(model, path, signatures=...) and pass concrete functions in signatures . The SavedModel will include the graph for each signature. See tensorflow.org/guide/saved_model .

Q: Why is my GPU only at 40% utilization?

A: The data pipeline is the bottleneck. Check that your tf.data.Dataset ends with .prefetch(tf.data.AUTOTUNE) , that .map() uses num_parallel_calls=tf.data.AUTOTUNE , and that .cache() is applied if the dataset fits in memory. Profile with TensorBoard's profiler to see exactly where the time goes.

Q: Can I mix Keras and PyTorch in one project?

A: Not directly. You can convert weights between them via ONNX ( tf2onnx and onnx2pytorch ), but the development experience is messy. Pick one framework per project and stay there.

Q: TensorFlow 1 vs. TensorFlow 2 — does TF 1 still matter?

A: Almost never in 2026. TF 1 was officially deprecated; everything new is TF 2. The only TF 1 you'll meet is in legacy codebases or old research papers. The compatibility shim ( tf.compat.v1 ) lets you run most TF 1 code, but greenfield projects should be TF 2.

Q: Where do I get current model weights?

A: TensorFlow Hub ( tfhub.dev ) for general-purpose pretrained models, Keras applications for vision backbones (ResNet, EfficientNet, MobileNet), and Hugging Face for transformer-era models (most of which now ship in both TF and PyTorch flavors).

TensorFlow Guide: From Zero to Hero (2026 Edition)

May 3, 2026

#TensorFlow #Keras #deep learning #Python #tf.data #GradientTape #TensorFlow Serving #TensorFlow Lite #TFX #Google Colab

TensorFlow Guide: From Zero to Hero (2026 Edition)

TL;DR

TensorFlow 2 is eager-by-default, Keras-first, and the same code that runs on a laptop scales to multi-GPU and production.¹
Build models with Keras Sequential (linear stacks), Functional (branches and shared layers), or by subclassing tf.keras.Model for full control.²³
Pipe data with tf.data and always end the pipeline with .prefetch(tf.data.AUTOTUNE) — feeding NumPy arrays directly leaves GPUs idle.⁴⁵
Train with model.fit() for 90% of cases. Reach for tf.GradientTape only when you need a non-standard loss, multiple optimizers, or custom logging.⁶
Distribute with tf.distribute.MirroredStrategy (single host, multi-GPU) or MultiWorkerMirroredStrategy (multi-host) — same model code, just wrap construction in with strategy.scope():.⁷
Ship through a SavedModel: TF Serving for servers, TF Lite for mobile/edge, TF.js for browsers.⁸⁹¹⁰
This guide assumes TensorFlow 2.x on Python 3.10–3.12. Check current API docs at tensorflow.org/api_docs for the exact version.

What You'll Learn

Set up TensorFlow locally or in Google Colab and verify GPU acceleration in one line.
Manipulate tensors — the core data structure — and understand the eager-vs-graph trade-off.
Build neural networks with all three Keras APIs and pick the right one for the job.
Stream large datasets efficiently with tf.data so the GPU stays saturated.
Train and evaluate models with model.fit() plus callbacks for early stopping, learning-rate schedules, and TensorBoard logging.
Write custom training loops with tf.GradientTape when the high-level API isn't enough.
Scale training across multiple GPUs and machines using tf.distribute.
Export and serve models in production through TF Serving (servers) or TF Lite (devices).

Each section ends with a runnable code block and a link to the relevant page in the official TensorFlow documentation.

Prerequisites

You'll get the most out of this guide if you have:

Python: comfortable with functions, classes, list comprehensions, and basic NumPy. Python 3.10+ recommended; TensorFlow 2.x supports Python 3.9–3.12.¹¹
Math: linear algebra (vectors, matrices, dot products), calculus (gradients, partial derivatives), and basic probability. You don't need a PhD; you need to know what a gradient is and why it points uphill.
Machine learning basics: training/validation/test split, what overfitting looks like, why we use a held-out test set. If those terms are new, work through Andrew Ng's Machine Learning course on Coursera before this guide — it's still the best free intro.

You don't need a GPU to start. Google Colab gives you a free Tesla T4 for short sessions, and most of this guide's examples run there in under a minute.

Setting Up TensorFlow

Local install

# Create a clean virtual environment first
python3 -m venv tf-env
source tf-env/bin/activate     # Windows: tf-env\Scripts\activate

# Install TensorFlow (latest stable)
pip install --upgrade pip
pip install tensorflow

# For GPU support on Linux (CUDA-enabled wheel includes the GPU pieces)
pip install tensorflow[and-cuda]

pip install tensorflow ships the CPU-only wheel on macOS (Apple Silicon uses the Metal plugin, installed separately) and the CPU+GPU wheel on Linux when you append [and-cuda].¹¹ On Windows, native GPU support was dropped in TF 2.11 — use WSL2 + the Linux wheel for CUDA.

Verify the install

import tensorflow as tf

print("TensorFlow version:", tf.__version__)
print("GPUs visible:", tf.config.list_physical_devices("GPU"))
print("Built with CUDA:", tf.test.is_built_with_cuda())

If GPUs visible returns an empty list and you expected one, your driver/CUDA versions don't match the TF wheel. The mapping table on tensorflow.org/install/source#gpu is authoritative.¹¹

Google Colab (zero-install path)

Skip everything above and open a notebook at colab.research.google.com. TensorFlow is pre-installed; switch to a GPU runtime via Runtime → Change runtime type → GPU. Then:

import tensorflow as tf
tf.config.list_physical_devices("GPU")
# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Colab is the friction-free way to follow this guide. Every example below runs there unchanged.

TensorFlow Fundamentals

Tensors

A tensor is an n-dimensional array with a dtype (e.g. float32) and a shape (e.g. (32, 224, 224, 3)). It's the only data structure TensorFlow operations consume.

import tensorflow as tf

# Scalar (rank 0)
scalar = tf.constant(3.14)

# Vector (rank 1)
vector = tf.constant([1.0, 2.0, 3.0])

# Matrix (rank 2)
matrix = tf.constant([[1, 2, 3],
                      [4, 5, 6]], dtype=tf.float32)

# 4D tensor — typical image batch (batch, height, width, channels)
images = tf.zeros((32, 224, 224, 3), dtype=tf.float32)

print(matrix.shape, matrix.dtype)
# (2, 3) <dtype: 'float32'>

Element-wise math, reductions, and broadcasting work the way they do in NumPy:

a = tf.constant([1.0, 2.0, 3.0])
b = tf.constant([10.0, 20.0, 30.0])

a + b          # [11., 22., 33.]
a * b          # [10., 40., 90.]
tf.reduce_sum(a * b)  # 140.0  (dot product)
tf.linalg.matmul(matrix, tf.transpose(matrix))  # 2x2 result

Two things to know that NumPy doesn't have:

Tensors are immutable. Every operation produces a new tensor. To mutate, use tf.Variable (next section).
Tensors live on a device (CPU or GPU). They get placed automatically. tensor.device tells you where, and with tf.device("/GPU:0"): forces placement.

→ Full tensor reference: tensorflow.org/api_docs/python/tf/Tensor¹

Variables

tf.Variable is mutable state — used for model weights and any value that gets updated during training.

# Initialize from a tensor
w = tf.Variable(tf.random.normal((4, 3)), name="weights")

# Update in place — assign(), assign_add(), assign_sub()
w.assign(tf.zeros_like(w))     # set to zeros
w.assign_add(tf.ones_like(w))  # increment by 1

# Read like a tensor
print(w.numpy())

When you build a Keras layer, the layer's parameters are tf.Variable objects under the hood. You rarely create them by hand; the framework does.

→ tensorflow.org/guide/variable

Eager vs. graph execution

TensorFlow 2 runs eagerly by default: every Python line executes immediately and returns a tensor you can print(). This is what made TF 2 a usable library — TF 1's lazy graph mode was notorious.¹²

But eager mode has overhead. For production training, you wrap a function with @tf.function and TF traces it into a graph the first time it's called, then replays the graph on every subsequent call:

@tf.function
def train_step(x, y, model, loss_fn, optimizer):
    with tf.GradientTape() as tape:
        predictions = model(x, training=True)
        loss = loss_fn(y, predictions)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss

The same function works without @tf.function (eager) or with it (graph). Use eager while debugging; add @tf.function once it works to get the speedup.

→ tensorflow.org/guide/intro_to_graphs¹²

Building Models with Keras

Keras is the high-level API in tf.keras. It has three styles, each with a clear sweet spot.

1. Sequential — for linear stacks

Use when your model is a list of layers, one after another. Most beginner-level networks fit this shape.

import tensorflow as tf
from tensorflow.keras import layers, Sequential

model = Sequential([
    layers.Input(shape=(28, 28, 1)),
    layers.Conv2D(32, 3, activation="relu"),
    layers.MaxPool2D(),
    layers.Conv2D(64, 3, activation="relu"),
    layers.MaxPool2D(),
    layers.Flatten(),
    layers.Dense(128, activation="relu"),
    layers.Dropout(0.3),
    layers.Dense(10, activation="softmax"),
])

model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

model.summary()

→ tensorflow.org/guide/keras/sequential_model²

2. Functional — for branches, multi-input, residual connections

Use when your model has more than one input, multiple outputs, or a layer feeds two paths (skip connections, branching).

inputs = tf.keras.Input(shape=(224, 224, 3))

x = layers.Conv2D(32, 3, activation="relu")(inputs)
x = layers.MaxPool2D()(x)
x = layers.Conv2D(64, 3, activation="relu")(x)

# Skip connection: add the original 32-channel features back
skip = layers.Conv2D(64, 1)(layers.MaxPool2D()(inputs))
x = layers.Add()([x, skip])

x = layers.GlobalAveragePooling2D()(x)
outputs = layers.Dense(10, activation="softmax")(x)

model = tf.keras.Model(inputs, outputs, name="mini_resnet")
model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

→ tensorflow.org/guide/keras/functional³

3. Subclassing — for full control

Use when your forward pass has dynamic structure (e.g. a loop whose length depends on input data). It's the most flexible and the hardest to debug — reach for it last.

class TwoTowerModel(tf.keras.Model):
    def __init__(self, num_classes):
        super().__init__()
        self.text_dense = layers.Dense(64, activation="relu")
        self.image_dense = layers.Dense(64, activation="relu")
        self.combine = layers.Concatenate()
        self.classifier = layers.Dense(num_classes, activation="softmax")

    def call(self, inputs, training=False):
        text, image = inputs
        t = self.text_dense(text)
        i = self.image_dense(image)
        return self.classifier(self.combine([t, i]))


model = TwoTowerModel(num_classes=5)

→ tensorflow.org/guide/keras/custom_layers_and_models

Picking between the three

Model shape	API
Linear stack of layers	Sequential
Multi-input / multi-output / skip connections	Functional
Loops, conditionals, dynamic shape in `call()`	Subclassing

Default to Functional. It handles everything Sequential can plus most of what Subclassing can, while still serializing cleanly to a SavedModel.

Working with Data: `tf.data`

Loading data is where most TensorFlow projects waste time. The single biggest performance lever is using tf.data end-to-end and never feeding NumPy arrays directly to model.fit() for anything larger than a toy dataset.⁴

From in-memory arrays

import numpy as np

X = np.random.rand(10000, 32).astype("float32")
y = np.random.randint(0, 10, size=10000)

ds = (tf.data.Dataset.from_tensor_slices((X, y))
      .shuffle(buffer_size=10000, seed=42)
      .batch(32)
      .prefetch(tf.data.AUTOTUNE))

for batch_x, batch_y in ds.take(1):
    print(batch_x.shape, batch_y.shape)
# (32, 32) (32,)

From files (images)

ds = tf.keras.utils.image_dataset_from_directory(
    "data/train",
    image_size=(224, 224),
    batch_size=32,
    label_mode="int",
)

# Always finish with prefetch — overlaps data prep with training step
ds = ds.prefetch(tf.data.AUTOTUNE)

From the catalog: `tensorflow_datasets`

import tensorflow_datasets as tfds

(train_ds, val_ds), info = tfds.load(
    "cifar10",
    split=["train", "test"],
    as_supervised=True,
    with_info=True,
)

def normalize(image, label):
    image = tf.cast(image, tf.float32) / 255.0
    return image, label

train_ds = (train_ds
            .map(normalize, num_parallel_calls=tf.data.AUTOTUNE)
            .cache()
            .shuffle(10000)
            .batch(64)
            .prefetch(tf.data.AUTOTUNE))

val_ds = (val_ds
          .map(normalize, num_parallel_calls=tf.data.AUTOTUNE)
          .batch(64)
          .cache()
          .prefetch(tf.data.AUTOTUNE))

→ TFDS catalog: tensorflow.org/datasets/catalog/overview¹³

The pipeline rules of thumb

.shuffle(buffer) before .batch() so each batch is mixed.
.cache() if the data fits in memory after preprocessing — saves re-doing .map() every epoch.
.map(..., num_parallel_calls=tf.data.AUTOTUNE) to parallelize preprocessing across CPU cores.
.prefetch(tf.data.AUTOTUNE) as the last op — overlaps the next batch's prep with the current batch's training step.
Augment inside the model (using layers.RandomFlip, layers.RandomRotation, etc.), not in the dataset, so augmentation runs on the GPU and only during training.

→ Performance guide: tensorflow.org/guide/data_performance⁵

Training and Evaluation

`model.fit()` and callbacks

callbacks = [
    tf.keras.callbacks.EarlyStopping(
        monitor="val_loss",
        patience=5,
        restore_best_weights=True,
    ),
    tf.keras.callbacks.ReduceLROnPlateau(
        monitor="val_loss",
        factor=0.5,
        patience=2,
        min_lr=1e-6,
    ),
    tf.keras.callbacks.TensorBoard(log_dir="./logs"),
    tf.keras.callbacks.ModelCheckpoint(
        filepath="best_model.keras",
        save_best_only=True,
        monitor="val_loss",
    ),
]

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=50,
    callbacks=callbacks,
    verbose=2,
)

Three patterns that pay off every time:

Always use EarlyStopping with restore_best_weights=True. Otherwise you keep whichever weights the model had at the last epoch — usually overfit.
Always log to TensorBoard. It's free and lets you compare runs visually.
Always save the best checkpoint (not the last one). The two diverge by epoch 5–10 in any non-trivial run.

→ tensorflow.org/api_docs/python/tf/keras/callbacks

Custom training with `tf.GradientTape`

When model.fit() doesn't fit — multiple optimizers, GAN-style adversarial loops, custom gradient manipulation, RL — write your own training step.

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
train_acc = tf.keras.metrics.SparseCategoricalAccuracy()


@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        logits = model(x, training=True)
        loss = loss_fn(y, logits)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    train_acc.update_state(y, logits)
    return loss


for epoch in range(10):
    train_acc.reset_state()
    for x, y in train_ds:
        loss = train_step(x, y)
    print(f"Epoch {epoch+1}: loss={loss.numpy():.4f} acc={train_acc.result().numpy():.4f}")

Two things to know:

The @tf.function decorator is what makes this fast. Without it, you're back to eager mode and lose ~10× throughput on a GPU.
tape.gradient must be called before the tape goes out of scope; once the with block ends, the tape is gone.

→ tensorflow.org/api_docs/python/tf/GradientTape⁶

Mixed precision

Mixed precision keeps weights in float32 (for stability) but does the math in float16 or bfloat16 (for speed and memory). On Ampere-or-newer NVIDIA GPUs, it's roughly a 2–3× training speedup with negligible accuracy loss.¹⁴

from tensorflow.keras import mixed_precision

mixed_precision.set_global_policy("mixed_float16")
# Build the model AFTER setting the policy

# Output layer should stay in float32 for numerical stability:
outputs = layers.Dense(10, activation="softmax", dtype="float32")(x)

That's the whole change. The framework handles loss scaling automatically when you compile with the default model.compile(optimizer="adam", ...) — it wraps the optimizer with LossScaleOptimizer for you.

→ tensorflow.org/guide/mixed_precision¹⁴

Project: CIFAR-10 Image Classifier with Transfer Learning

A complete runnable example. Open Colab, paste this in, switch to a GPU runtime, and you'll have a 90%+ test accuracy classifier in about 5 minutes.

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras import layers

# 1. Data
(train_ds, test_ds), info = tfds.load(
    "cifar10",
    split=["train", "test"],
    as_supervised=True,
    with_info=True,
)

NUM_CLASSES = info.features["label"].num_classes
IMG_SIZE = 224  # MobileNetV2's preferred input size
BATCH = 64

def preprocess(image, label):
    image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))
    image = tf.keras.applications.mobilenet_v2.preprocess_input(image)
    return image, label

train_ds = (train_ds
            .map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
            .cache()
            .shuffle(10000)
            .batch(BATCH)
            .prefetch(tf.data.AUTOTUNE))

test_ds = (test_ds
           .map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
           .batch(BATCH)
           .cache()
           .prefetch(tf.data.AUTOTUNE))

# 2. Model — frozen MobileNetV2 backbone + new classifier head
base = tf.keras.applications.MobileNetV2(
    input_shape=(IMG_SIZE, IMG_SIZE, 3),
    include_top=False,
    weights="imagenet",
)
base.trainable = False

inputs = tf.keras.Input(shape=(IMG_SIZE, IMG_SIZE, 3))
x = tf.keras.Sequential([
    layers.RandomFlip("horizontal"),
    layers.RandomRotation(0.05),
])(inputs)
x = base(x, training=False)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(NUM_CLASSES, activation="softmax")(x)

model = tf.keras.Model(inputs, outputs)

model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

# 3. Train head
history = model.fit(
    train_ds,
    validation_data=test_ds,
    epochs=5,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True),
    ],
)

# 4. Fine-tune: unfreeze the top 30 layers of the backbone, very small LR
base.trainable = True
for layer in base.layers[:-30]:
    layer.trainable = False

model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

history_ft = model.fit(
    train_ds,
    validation_data=test_ds,
    epochs=5,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True),
    ],
)

# 5. Evaluate and save
loss, acc = model.evaluate(test_ds, verbose=0)
print(f"Test accuracy: {acc:.4f}")

model.save("cifar10_mobilenet.keras")

Why this works:

Frozen backbone first trains only the classifier head against general-purpose features — fast and stable.
Then unfreeze and fine-tune at 1e-5 lets the upper layers specialize without destroying the pretrained features.
Augmentation lives in the model (RandomFlip, RandomRotation) so it runs on the GPU and only during training (training=True).

→ Transfer learning guide: tensorflow.org/guide/keras/transfer_learning

Distributed Training

The TF distribution API lets you scale training across devices and machines without changing your model code. You wrap construction in a strategy scope; the framework handles the rest.⁷

Multi-GPU on a single machine: `MirroredStrategy`

strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices: {strategy.num_replicas_in_sync}")

with strategy.scope():
    model = build_my_model()         # same code as before
    model.compile(optimizer="adam",
                  loss="sparse_categorical_crossentropy",
                  metrics=["accuracy"])

# Important: scale the global batch size with the number of replicas
GLOBAL_BATCH = 64 * strategy.num_replicas_in_sync
ds = ds.batch(GLOBAL_BATCH).prefetch(tf.data.AUTOTUNE)

model.fit(ds, epochs=10)

Multi-host: `MultiWorkerMirroredStrategy`

Same pattern, plus a TF_CONFIG environment variable on each worker that lists every worker's address and the index of the current one:

# On worker 0:
export TF_CONFIG='{
  "cluster": {"worker": ["worker0.example:12345", "worker1.example:12345"]},
  "task": {"type": "worker", "index": 0}
}'
python train.py

# On worker 1: same JSON, but "index": 1

strategy = tf.distribute.MultiWorkerMirroredStrategy()
with strategy.scope():
    model = build_my_model()
    model.compile(...)
model.fit(...)

TPU: `TPUStrategy`

On Colab, switch to a TPU runtime, then:

resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)

Strategy selection

Hardware	Strategy
Single GPU	None — TensorFlow uses it automatically
Multiple GPUs, one machine	`MirroredStrategy`
Multiple machines	`MultiWorkerMirroredStrategy`
Cloud TPU	`TPUStrategy`
Massive embedding tables (recsys)	`ParameterServerStrategy`

→ tensorflow.org/guide/distributed_training⁷

Going to Production

The SavedModel format is the universal export. Both TF Serving and TF Lite consume it.⁸⁹

Export

# Keras-native (preferred for TF 2.x):
model.save("export/model.keras")          # single-file format

# Or the SavedModel directory format (what TF Serving expects):
model.export("export/saved_model")

model.export() produces a directory like:

export/saved_model/
├── saved_model.pb
├── variables/
│   ├── variables.data-00000-of-00001
│   └── variables.index
└── assets/

Reload with:

loaded = tf.saved_model.load("export/saved_model")
predictions = loaded(input_tensor)

TF Serving (HTTP/gRPC servers)

Run TF Serving in a Docker container, mount your SavedModel directory, hit it over REST.

# Save the model to a versioned directory: export/saved_model/1/...
docker pull tensorflow/serving

docker run -p 8501:8501 \
  --mount type=bind,source="$(pwd)/export/saved_model",target=/models/my_model \
  -e MODEL_NAME=my_model \
  -t tensorflow/serving

Send a prediction:

curl -X POST http://localhost:8501/v1/models/my_model:predict \
  -d '{"instances": [[1.0, 2.0, 3.0, 4.0]]}'

→ tensorflow.org/tfx/serving/serving_basic⁹

TF Lite (mobile and edge)

converter = tf.lite.TFLiteConverter.from_saved_model("export/saved_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]   # int8 quantization

# For full int8 (smallest + fastest), provide a representative dataset:
def representative_dataset():
    for sample in train_ds.take(100):
        yield [sample[0]]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

tflite_model = converter.convert()
with open("model.tflite", "wb") as f:
    f.write(tflite_model)

Loading on-device (Python is also one of the supported runtimes):

interpreter = tf.lite.Interpreter(model_path="model.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

interpreter.set_tensor(input_details[0]["index"], input_data)
interpreter.invoke()
predictions = interpreter.get_tensor(output_details[0]["index"])

The same .tflite file deploys to Android (via the AAR), iOS (CocoaPods), Raspberry Pi, and microcontrollers (TF Lite Micro, a separate runtime).

→ tensorflow.org/lite/guide¹⁰

TF.js (browser inference)

pip install tensorflowjs
tensorflowjs_converter \
  --input_format=tf_saved_model \
  export/saved_model \
  web_model

Then in the browser:

import * as tf from '@tensorflow/tfjs';
const model = await tf.loadGraphModel('/web_model/model.json');
const prediction = model.predict(tf.tensor(input));

→ tensorflow.org/js

TensorFlow Ecosystem at a Glance

You'll meet these in real projects:

Library	What it's for
`tf.keras`	The standard model API. You'll use it daily.
`tf.data`	Dataset pipelines. Same.
`tensorflow_datasets` (TFDS)	Catalog of pre-prepared public datasets (MNIST, CIFAR, ImageNet, GLUE, COCO, etc.).¹³
`tensorflow_hub`	Pre-trained models you can drop into a Keras model with one line.
`tf.distribute`	Multi-GPU, multi-host, TPU training without changing model code.⁷
TFX	End-to-end production pipelines: validation, training, serving, monitoring.¹⁵
TensorBoard	Training visualization, profiling, embedding projections.
`tf.lite`	On-device inference (mobile, embedded, edge).¹⁰
`tensorflow_serving`	HTTP/gRPC server for batch and online inference.⁹
`tensorflow_js`	Run models in the browser.

You don't need all of them. Most projects ship with tf.keras + tf.data + one of tf.lite / tensorflow_serving for production.

Common Patterns and Pitfalls

A few things that trip up everyone the first time.

Don't feed NumPy arrays to `model.fit()` for anything serious

It works for toy examples but the GPU will sit at 30% utilization. Wrap the arrays in a tf.data.Dataset and add .prefetch(tf.data.AUTOTUNE).

`training=True` matters

Some layers (Dropout, BatchNormalization) behave differently during training vs. inference. model.fit() handles this for you. In a custom training loop, always pass training=True to model(...) during training and False during evaluation.

`tf.function` retraces are expensive

Every time you call a @tf.function-decorated function with a new input shape or dtype, TF traces a new graph. If your batch size varies, pad to a fixed shape or use input_signature to fix the trace.

Save the `.keras` file, not just the weights

model.save_weights(...) only saves the parameters. model.save("file.keras") saves architecture + weights + optimizer state — the only one you can reload without rebuilding the model.

Don't `tf.constant` in a loop

tf.constant allocates each call. If you're calling it inside a training loop with the same value, hoist it out — or pass the value as a function argument.

Use `tf.device` sparingly

Manual device placement (with tf.device("/GPU:0")) is rarely needed in TF 2 — automatic placement is reliable. Reach for it only when you're squeezing the last 10% out of a multi-GPU pipeline.

The Bottom Line

TensorFlow rewards a small set of habits and punishes deviation:

Build with Keras Functional, fall back to subclassing only when you need dynamic structure.
Pipe data with tf.data + AUTOTUNE prefetch, never NumPy arrays.
Train with model.fit() + callbacks, drop into tf.GradientTape only for genuinely custom needs.
Use @tf.function for any hot path — it's a 5–10× speedup for free.
Export to SavedModel, then deploy through TF Serving or TF Lite.
Trust the official guides — TensorFlow's documentation is unusually good. Every link in this article goes to a page worth reading.

Once those six are second nature, the rest of the framework — distributed training, mixed precision, custom layers, TFX pipelines — is incremental. You're already most of the way there.

Now go open Colab and train something.

TensorFlow API reference — tensorflow.org/api_docs/python/tf. Authoritative source for every public symbol mentioned in this guide. ↩ ↩²
TensorFlow guide — The Sequential model. ↩ ↩²
TensorFlow guide — The Functional API. ↩ ↩²
TensorFlow guide — tf.data: Build TensorFlow input pipelines. ↩ ↩²
TensorFlow guide — Better performance with the tf.data API. ↩ ↩²
TensorFlow API reference — tf.GradientTape. ↩ ↩²
TensorFlow guide — Distributed training with TensorFlow. ↩ ↩² ↩³ ↩⁴
TensorFlow guide — Using the SavedModel format. ↩ ↩²
TensorFlow Serving — Architecture overview and serving basics. ↩ ↩² ↩³ ↩⁴
TensorFlow Lite — LiteRT (TF Lite) developer guide. ↩ ↩² ↩³
TensorFlow install — Pip install instructions and platform support, and the version-compatibility table. ↩ ↩² ↩³
TensorFlow guide — Introduction to graphs and tf.function. ↩ ↩²
TensorFlow Datasets — Catalog overview and the TFDS API reference. ↩ ↩²
TensorFlow guide — Mixed precision. ↩ ↩²
TFX — The TFX user guide. ↩

Frequently Asked Questions

A: Both are excellent. TensorFlow has the edge for production deployment (TF Serving, TF Lite, TFX), mobile (TF Lite is more mature than PyTorch Mobile/ExecuTorch), and TPUs (TF is first-class on Google Cloud TPUs). PyTorch has the edge for research velocity and is more popular in academia. If you're unsure, pick the one your team uses.

TensorFlow Guide: From Zero to Hero (2026 Edition)

Frequently Asked Questions

Related Posts

LSTM Networks: A Deep Dive with Code & Variants

TensorFlow 2026 Tutorial: Mastering TensorFlow 2.19 with GPUs and Beyond

Mastering RNN Sequence Modeling: From Theory to Production

The Ultimate Guide to Python AI Libraries in 2025

Stay on the Nerd Track