TensorFlow Guide: From Zero to Hero (2026 Edition)
May 3, 2026
TL;DR
- TensorFlow 2 is eager-by-default, Keras-first, and the same code that runs on a laptop scales to multi-GPU and production.1
- Build models with Keras Sequential (linear stacks), Functional (branches and shared layers), or by subclassing
tf.keras.Modelfor full control.23 - Pipe data with
tf.dataand always end the pipeline with.prefetch(tf.data.AUTOTUNE)— feeding NumPy arrays directly leaves GPUs idle.45 - Train with
model.fit()for 90% of cases. Reach fortf.GradientTapeonly when you need a non-standard loss, multiple optimizers, or custom logging.6 - Distribute with
tf.distribute.MirroredStrategy(single host, multi-GPU) orMultiWorkerMirroredStrategy(multi-host) — same model code, just wrap construction inwith strategy.scope():.7 - Ship through a SavedModel: TF Serving for servers, TF Lite for mobile/edge, TF.js for browsers.8910
- This guide assumes TensorFlow 2.x on Python 3.10–3.12. Check current API docs at tensorflow.org/api_docs for the exact version.
What You'll Learn
- Set up TensorFlow locally or in Google Colab and verify GPU acceleration in one line.
- Manipulate tensors — the core data structure — and understand the eager-vs-graph trade-off.
- Build neural networks with all three Keras APIs and pick the right one for the job.
- Stream large datasets efficiently with
tf.dataso the GPU stays saturated. - Train and evaluate models with
model.fit()plus callbacks for early stopping, learning-rate schedules, and TensorBoard logging. - Write custom training loops with
tf.GradientTapewhen the high-level API isn't enough. - Scale training across multiple GPUs and machines using
tf.distribute. - Export and serve models in production through TF Serving (servers) or TF Lite (devices).
Each section ends with a runnable code block and a link to the relevant page in the official TensorFlow documentation.
Prerequisites
You'll get the most out of this guide if you have:
- Python: comfortable with functions, classes, list comprehensions, and basic NumPy. Python 3.10+ recommended; TensorFlow 2.x supports Python 3.9–3.12.11
- Math: linear algebra (vectors, matrices, dot products), calculus (gradients, partial derivatives), and basic probability. You don't need a PhD; you need to know what a gradient is and why it points uphill.
- Machine learning basics: training/validation/test split, what overfitting looks like, why we use a held-out test set. If those terms are new, work through Andrew Ng's Machine Learning course on Coursera before this guide — it's still the best free intro.
You don't need a GPU to start. Google Colab gives you a free Tesla T4 for short sessions, and most of this guide's examples run there in under a minute.
Setting Up TensorFlow
Local install
# Create a clean virtual environment first
python3 -m venv tf-env
source tf-env/bin/activate # Windows: tf-env\Scripts\activate
# Install TensorFlow (latest stable)
pip install --upgrade pip
pip install tensorflow
# For GPU support on Linux (CUDA-enabled wheel includes the GPU pieces)
pip install tensorflow[and-cuda]
pip install tensorflow ships the CPU-only wheel on macOS (Apple Silicon uses the Metal plugin, installed separately) and the CPU+GPU wheel on Linux when you append [and-cuda].11 On Windows, native GPU support was dropped in TF 2.11 — use WSL2 + the Linux wheel for CUDA.
Verify the install
import tensorflow as tf
print("TensorFlow version:", tf.__version__)
print("GPUs visible:", tf.config.list_physical_devices("GPU"))
print("Built with CUDA:", tf.test.is_built_with_cuda())
If GPUs visible returns an empty list and you expected one, your driver/CUDA versions don't match the TF wheel. The mapping table on tensorflow.org/install/source#gpu is authoritative.11
Google Colab (zero-install path)
Skip everything above and open a notebook at colab.research.google.com. TensorFlow is pre-installed; switch to a GPU runtime via Runtime → Change runtime type → GPU. Then:
import tensorflow as tf
tf.config.list_physical_devices("GPU")
# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Colab is the friction-free way to follow this guide. Every example below runs there unchanged.
TensorFlow Fundamentals
Tensors
A tensor is an n-dimensional array with a dtype (e.g. float32) and a shape (e.g. (32, 224, 224, 3)). It's the only data structure TensorFlow operations consume.
import tensorflow as tf
# Scalar (rank 0)
scalar = tf.constant(3.14)
# Vector (rank 1)
vector = tf.constant([1.0, 2.0, 3.0])
# Matrix (rank 2)
matrix = tf.constant([[1, 2, 3],
[4, 5, 6]], dtype=tf.float32)
# 4D tensor — typical image batch (batch, height, width, channels)
images = tf.zeros((32, 224, 224, 3), dtype=tf.float32)
print(matrix.shape, matrix.dtype)
# (2, 3) <dtype: 'float32'>
Element-wise math, reductions, and broadcasting work the way they do in NumPy:
a = tf.constant([1.0, 2.0, 3.0])
b = tf.constant([10.0, 20.0, 30.0])
a + b # [11., 22., 33.]
a * b # [10., 40., 90.]
tf.reduce_sum(a * b) # 140.0 (dot product)
tf.linalg.matmul(matrix, tf.transpose(matrix)) # 2x2 result
Two things to know that NumPy doesn't have:
- Tensors are immutable. Every operation produces a new tensor. To mutate, use
tf.Variable(next section). - Tensors live on a device (CPU or GPU). They get placed automatically.
tensor.devicetells you where, andwith tf.device("/GPU:0"):forces placement.
→ Full tensor reference: tensorflow.org/api_docs/python/tf/Tensor1
Variables
tf.Variable is mutable state — used for model weights and any value that gets updated during training.
# Initialize from a tensor
w = tf.Variable(tf.random.normal((4, 3)), name="weights")
# Update in place — assign(), assign_add(), assign_sub()
w.assign(tf.zeros_like(w)) # set to zeros
w.assign_add(tf.ones_like(w)) # increment by 1
# Read like a tensor
print(w.numpy())
When you build a Keras layer, the layer's parameters are tf.Variable objects under the hood. You rarely create them by hand; the framework does.
→ tensorflow.org/guide/variable
Eager vs. graph execution
TensorFlow 2 runs eagerly by default: every Python line executes immediately and returns a tensor you can print(). This is what made TF 2 a usable library — TF 1's lazy graph mode was notorious.12
But eager mode has overhead. For production training, you wrap a function with @tf.function and TF traces it into a graph the first time it's called, then replays the graph on every subsequent call:
@tf.function
def train_step(x, y, model, loss_fn, optimizer):
with tf.GradientTape() as tape:
predictions = model(x, training=True)
loss = loss_fn(y, predictions)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
return loss
The same function works without @tf.function (eager) or with it (graph). Use eager while debugging; add @tf.function once it works to get the speedup.
→ tensorflow.org/guide/intro_to_graphs12
Building Models with Keras
Keras is the high-level API in tf.keras. It has three styles, each with a clear sweet spot.
1. Sequential — for linear stacks
Use when your model is a list of layers, one after another. Most beginner-level networks fit this shape.
import tensorflow as tf
from tensorflow.keras import layers, Sequential
model = Sequential([
layers.Input(shape=(28, 28, 1)),
layers.Conv2D(32, 3, activation="relu"),
layers.MaxPool2D(),
layers.Conv2D(64, 3, activation="relu"),
layers.MaxPool2D(),
layers.Flatten(),
layers.Dense(128, activation="relu"),
layers.Dropout(0.3),
layers.Dense(10, activation="softmax"),
])
model.compile(optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
model.summary()
→ tensorflow.org/guide/keras/sequential_model2
2. Functional — for branches, multi-input, residual connections
Use when your model has more than one input, multiple outputs, or a layer feeds two paths (skip connections, branching).
inputs = tf.keras.Input(shape=(224, 224, 3))
x = layers.Conv2D(32, 3, activation="relu")(inputs)
x = layers.MaxPool2D()(x)
x = layers.Conv2D(64, 3, activation="relu")(x)
# Skip connection: add the original 32-channel features back
skip = layers.Conv2D(64, 1)(layers.MaxPool2D()(inputs))
x = layers.Add()([x, skip])
x = layers.GlobalAveragePooling2D()(x)
outputs = layers.Dense(10, activation="softmax")(x)
model = tf.keras.Model(inputs, outputs, name="mini_resnet")
model.compile(optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
→ tensorflow.org/guide/keras/functional3
3. Subclassing — for full control
Use when your forward pass has dynamic structure (e.g. a loop whose length depends on input data). It's the most flexible and the hardest to debug — reach for it last.
class TwoTowerModel(tf.keras.Model):
def __init__(self, num_classes):
super().__init__()
self.text_dense = layers.Dense(64, activation="relu")
self.image_dense = layers.Dense(64, activation="relu")
self.combine = layers.Concatenate()
self.classifier = layers.Dense(num_classes, activation="softmax")
def call(self, inputs, training=False):
text, image = inputs
t = self.text_dense(text)
i = self.image_dense(image)
return self.classifier(self.combine([t, i]))
model = TwoTowerModel(num_classes=5)
→ tensorflow.org/guide/keras/custom_layers_and_models
Picking between the three
| Model shape | API |
|---|---|
| Linear stack of layers | Sequential |
| Multi-input / multi-output / skip connections | Functional |
Loops, conditionals, dynamic shape in call() | Subclassing |
Default to Functional. It handles everything Sequential can plus most of what Subclassing can, while still serializing cleanly to a SavedModel.
Working with Data: tf.data
Loading data is where most TensorFlow projects waste time. The single biggest performance lever is using tf.data end-to-end and never feeding NumPy arrays directly to model.fit() for anything larger than a toy dataset.4
From in-memory arrays
import numpy as np
X = np.random.rand(10000, 32).astype("float32")
y = np.random.randint(0, 10, size=10000)
ds = (tf.data.Dataset.from_tensor_slices((X, y))
.shuffle(buffer_size=10000, seed=42)
.batch(32)
.prefetch(tf.data.AUTOTUNE))
for batch_x, batch_y in ds.take(1):
print(batch_x.shape, batch_y.shape)
# (32, 32) (32,)
From files (images)
ds = tf.keras.utils.image_dataset_from_directory(
"data/train",
image_size=(224, 224),
batch_size=32,
label_mode="int",
)
# Always finish with prefetch — overlaps data prep with training step
ds = ds.prefetch(tf.data.AUTOTUNE)
From the catalog: tensorflow_datasets
import tensorflow_datasets as tfds
(train_ds, val_ds), info = tfds.load(
"cifar10",
split=["train", "test"],
as_supervised=True,
with_info=True,
)
def normalize(image, label):
image = tf.cast(image, tf.float32) / 255.0
return image, label
train_ds = (train_ds
.map(normalize, num_parallel_calls=tf.data.AUTOTUNE)
.cache()
.shuffle(10000)
.batch(64)
.prefetch(tf.data.AUTOTUNE))
val_ds = (val_ds
.map(normalize, num_parallel_calls=tf.data.AUTOTUNE)
.batch(64)
.cache()
.prefetch(tf.data.AUTOTUNE))
→ TFDS catalog: tensorflow.org/datasets/catalog/overview13
The pipeline rules of thumb
.shuffle(buffer)before.batch()so each batch is mixed..cache()if the data fits in memory after preprocessing — saves re-doing.map()every epoch..map(..., num_parallel_calls=tf.data.AUTOTUNE)to parallelize preprocessing across CPU cores..prefetch(tf.data.AUTOTUNE)as the last op — overlaps the next batch's prep with the current batch's training step.- Augment inside the model (using
layers.RandomFlip,layers.RandomRotation, etc.), not in the dataset, so augmentation runs on the GPU and only during training.
→ Performance guide: tensorflow.org/guide/data_performance5
Training and Evaluation
model.fit() and callbacks
callbacks = [
tf.keras.callbacks.EarlyStopping(
monitor="val_loss",
patience=5,
restore_best_weights=True,
),
tf.keras.callbacks.ReduceLROnPlateau(
monitor="val_loss",
factor=0.5,
patience=2,
min_lr=1e-6,
),
tf.keras.callbacks.TensorBoard(log_dir="./logs"),
tf.keras.callbacks.ModelCheckpoint(
filepath="best_model.keras",
save_best_only=True,
monitor="val_loss",
),
]
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=50,
callbacks=callbacks,
verbose=2,
)
Three patterns that pay off every time:
- Always use
EarlyStoppingwithrestore_best_weights=True. Otherwise you keep whichever weights the model had at the last epoch — usually overfit. - Always log to TensorBoard. It's free and lets you compare runs visually.
- Always save the best checkpoint (not the last one). The two diverge by epoch 5–10 in any non-trivial run.
→ tensorflow.org/api_docs/python/tf/keras/callbacks
Custom training with tf.GradientTape
When model.fit() doesn't fit — multiple optimizers, GAN-style adversarial loops, custom gradient manipulation, RL — write your own training step.
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
train_acc = tf.keras.metrics.SparseCategoricalAccuracy()
@tf.function
def train_step(x, y):
with tf.GradientTape() as tape:
logits = model(x, training=True)
loss = loss_fn(y, logits)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
train_acc.update_state(y, logits)
return loss
for epoch in range(10):
train_acc.reset_state()
for x, y in train_ds:
loss = train_step(x, y)
print(f"Epoch {epoch+1}: loss={loss.numpy():.4f} acc={train_acc.result().numpy():.4f}")
Two things to know:
- The
@tf.functiondecorator is what makes this fast. Without it, you're back to eager mode and lose ~10× throughput on a GPU. tape.gradientmust be called before the tape goes out of scope; once thewithblock ends, the tape is gone.
→ tensorflow.org/api_docs/python/tf/GradientTape6
Mixed precision
Mixed precision keeps weights in float32 (for stability) but does the math in float16 or bfloat16 (for speed and memory). On Ampere-or-newer NVIDIA GPUs, it's roughly a 2–3× training speedup with negligible accuracy loss.14
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy("mixed_float16")
# Build the model AFTER setting the policy
# Output layer should stay in float32 for numerical stability:
outputs = layers.Dense(10, activation="softmax", dtype="float32")(x)
That's the whole change. The framework handles loss scaling automatically when you compile with the default model.compile(optimizer="adam", ...) — it wraps the optimizer with LossScaleOptimizer for you.
→ tensorflow.org/guide/mixed_precision14
Project: CIFAR-10 Image Classifier with Transfer Learning
A complete runnable example. Open Colab, paste this in, switch to a GPU runtime, and you'll have a 90%+ test accuracy classifier in about 5 minutes.
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras import layers
# 1. Data
(train_ds, test_ds), info = tfds.load(
"cifar10",
split=["train", "test"],
as_supervised=True,
with_info=True,
)
NUM_CLASSES = info.features["label"].num_classes
IMG_SIZE = 224 # MobileNetV2's preferred input size
BATCH = 64
def preprocess(image, label):
image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))
image = tf.keras.applications.mobilenet_v2.preprocess_input(image)
return image, label
train_ds = (train_ds
.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
.cache()
.shuffle(10000)
.batch(BATCH)
.prefetch(tf.data.AUTOTUNE))
test_ds = (test_ds
.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
.batch(BATCH)
.cache()
.prefetch(tf.data.AUTOTUNE))
# 2. Model — frozen MobileNetV2 backbone + new classifier head
base = tf.keras.applications.MobileNetV2(
input_shape=(IMG_SIZE, IMG_SIZE, 3),
include_top=False,
weights="imagenet",
)
base.trainable = False
inputs = tf.keras.Input(shape=(IMG_SIZE, IMG_SIZE, 3))
x = tf.keras.Sequential([
layers.RandomFlip("horizontal"),
layers.RandomRotation(0.05),
])(inputs)
x = base(x, training=False)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(NUM_CLASSES, activation="softmax")(x)
model = tf.keras.Model(inputs, outputs)
model.compile(
optimizer=tf.keras.optimizers.Adam(1e-3),
loss="sparse_categorical_crossentropy",
metrics=["accuracy"],
)
# 3. Train head
history = model.fit(
train_ds,
validation_data=test_ds,
epochs=5,
callbacks=[
tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True),
],
)
# 4. Fine-tune: unfreeze the top 30 layers of the backbone, very small LR
base.trainable = True
for layer in base.layers[:-30]:
layer.trainable = False
model.compile(
optimizer=tf.keras.optimizers.Adam(1e-5),
loss="sparse_categorical_crossentropy",
metrics=["accuracy"],
)
history_ft = model.fit(
train_ds,
validation_data=test_ds,
epochs=5,
callbacks=[
tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True),
],
)
# 5. Evaluate and save
loss, acc = model.evaluate(test_ds, verbose=0)
print(f"Test accuracy: {acc:.4f}")
model.save("cifar10_mobilenet.keras")
Why this works:
- Frozen backbone first trains only the classifier head against general-purpose features — fast and stable.
- Then unfreeze and fine-tune at 1e-5 lets the upper layers specialize without destroying the pretrained features.
- Augmentation lives in the model (
RandomFlip,RandomRotation) so it runs on the GPU and only during training (training=True).
→ Transfer learning guide: tensorflow.org/guide/keras/transfer_learning
Distributed Training
The TF distribution API lets you scale training across devices and machines without changing your model code. You wrap construction in a strategy scope; the framework handles the rest.7
Multi-GPU on a single machine: MirroredStrategy
strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices: {strategy.num_replicas_in_sync}")
with strategy.scope():
model = build_my_model() # same code as before
model.compile(optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
# Important: scale the global batch size with the number of replicas
GLOBAL_BATCH = 64 * strategy.num_replicas_in_sync
ds = ds.batch(GLOBAL_BATCH).prefetch(tf.data.AUTOTUNE)
model.fit(ds, epochs=10)
Multi-host: MultiWorkerMirroredStrategy
Same pattern, plus a TF_CONFIG environment variable on each worker that lists every worker's address and the index of the current one:
# On worker 0:
export TF_CONFIG='{
"cluster": {"worker": ["worker0.example:12345", "worker1.example:12345"]},
"task": {"type": "worker", "index": 0}
}'
python train.py
# On worker 1: same JSON, but "index": 1
strategy = tf.distribute.MultiWorkerMirroredStrategy()
with strategy.scope():
model = build_my_model()
model.compile(...)
model.fit(...)
TPU: TPUStrategy
On Colab, switch to a TPU runtime, then:
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
Strategy selection
| Hardware | Strategy |
|---|---|
| Single GPU | None — TensorFlow uses it automatically |
| Multiple GPUs, one machine | MirroredStrategy |
| Multiple machines | MultiWorkerMirroredStrategy |
| Cloud TPU | TPUStrategy |
| Massive embedding tables (recsys) | ParameterServerStrategy |
→ tensorflow.org/guide/distributed_training7
Going to Production
The SavedModel format is the universal export. Both TF Serving and TF Lite consume it.89
Export
# Keras-native (preferred for TF 2.x):
model.save("export/model.keras") # single-file format
# Or the SavedModel directory format (what TF Serving expects):
model.export("export/saved_model")
model.export() produces a directory like:
export/saved_model/
├── saved_model.pb
├── variables/
│ ├── variables.data-00000-of-00001
│ └── variables.index
└── assets/
Reload with:
loaded = tf.saved_model.load("export/saved_model")
predictions = loaded(input_tensor)
TF Serving (HTTP/gRPC servers)
Run TF Serving in a Docker container, mount your SavedModel directory, hit it over REST.
# Save the model to a versioned directory: export/saved_model/1/...
docker pull tensorflow/serving
docker run -p 8501:8501 \
--mount type=bind,source="$(pwd)/export/saved_model",target=/models/my_model \
-e MODEL_NAME=my_model \
-t tensorflow/serving
Send a prediction:
curl -X POST http://localhost:8501/v1/models/my_model:predict \
-d '{"instances": [[1.0, 2.0, 3.0, 4.0]]}'
→ tensorflow.org/tfx/serving/serving_basic9
TF Lite (mobile and edge)
converter = tf.lite.TFLiteConverter.from_saved_model("export/saved_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT] # int8 quantization
# For full int8 (smallest + fastest), provide a representative dataset:
def representative_dataset():
for sample in train_ds.take(100):
yield [sample[0]]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()
with open("model.tflite", "wb") as f:
f.write(tflite_model)
Loading on-device (Python is also one of the supported runtimes):
interpreter = tf.lite.Interpreter(model_path="model.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
interpreter.set_tensor(input_details[0]["index"], input_data)
interpreter.invoke()
predictions = interpreter.get_tensor(output_details[0]["index"])
The same .tflite file deploys to Android (via the AAR), iOS (CocoaPods), Raspberry Pi, and microcontrollers (TF Lite Micro, a separate runtime).
TF.js (browser inference)
pip install tensorflowjs
tensorflowjs_converter \
--input_format=tf_saved_model \
export/saved_model \
web_model
Then in the browser:
import * as tf from '@tensorflow/tfjs';
const model = await tf.loadGraphModel('/web_model/model.json');
const prediction = model.predict(tf.tensor(input));
TensorFlow Ecosystem at a Glance
You'll meet these in real projects:
| Library | What it's for |
|---|---|
tf.keras | The standard model API. You'll use it daily. |
tf.data | Dataset pipelines. Same. |
tensorflow_datasets (TFDS) | Catalog of pre-prepared public datasets (MNIST, CIFAR, ImageNet, GLUE, COCO, etc.).13 |
tensorflow_hub | Pre-trained models you can drop into a Keras model with one line. |
tf.distribute | Multi-GPU, multi-host, TPU training without changing model code.7 |
| TFX | End-to-end production pipelines: validation, training, serving, monitoring.15 |
| TensorBoard | Training visualization, profiling, embedding projections. |
tf.lite | On-device inference (mobile, embedded, edge).10 |
tensorflow_serving | HTTP/gRPC server for batch and online inference.9 |
tensorflow_js | Run models in the browser. |
You don't need all of them. Most projects ship with tf.keras + tf.data + one of tf.lite / tensorflow_serving for production.
Common Patterns and Pitfalls
A few things that trip up everyone the first time.
Don't feed NumPy arrays to model.fit() for anything serious
It works for toy examples but the GPU will sit at 30% utilization. Wrap the arrays in a tf.data.Dataset and add .prefetch(tf.data.AUTOTUNE).
training=True matters
Some layers (Dropout, BatchNormalization) behave differently during training vs. inference. model.fit() handles this for you. In a custom training loop, always pass training=True to model(...) during training and False during evaluation.
tf.function retraces are expensive
Every time you call a @tf.function-decorated function with a new input shape or dtype, TF traces a new graph. If your batch size varies, pad to a fixed shape or use input_signature to fix the trace.
Save the .keras file, not just the weights
model.save_weights(...) only saves the parameters. model.save("file.keras") saves architecture + weights + optimizer state — the only one you can reload without rebuilding the model.
Don't tf.constant in a loop
tf.constant allocates each call. If you're calling it inside a training loop with the same value, hoist it out — or pass the value as a function argument.
Use tf.device sparingly
Manual device placement (with tf.device("/GPU:0")) is rarely needed in TF 2 — automatic placement is reliable. Reach for it only when you're squeezing the last 10% out of a multi-GPU pipeline.
The Bottom Line
TensorFlow rewards a small set of habits and punishes deviation:
- Build with Keras Functional, fall back to subclassing only when you need dynamic structure.
- Pipe data with
tf.data+ AUTOTUNE prefetch, never NumPy arrays. - Train with
model.fit()+ callbacks, drop intotf.GradientTapeonly for genuinely custom needs. - Use
@tf.functionfor any hot path — it's a 5–10× speedup for free. - Export to SavedModel, then deploy through TF Serving or TF Lite.
- Trust the official guides — TensorFlow's documentation is unusually good. Every link in this article goes to a page worth reading.
Once those six are second nature, the rest of the framework — distributed training, mixed precision, custom layers, TFX pipelines — is incremental. You're already most of the way there.
Now go open Colab and train something.
Related reads
- Deep Learning Fundamentals: A Practical Guide to Neural Networks
- Mastering Scikit-learn: A Complete 2026 Tutorial for Machine Learning
- Mastering CNN Image Classification: From Basics to Production
- How GPUs Power the AI Revolution
Footnotes
-
TensorFlow API reference — tensorflow.org/api_docs/python/tf. Authoritative source for every public symbol mentioned in this guide. ↩ ↩2
-
TensorFlow guide — The Sequential model. ↩ ↩2
-
TensorFlow guide — The Functional API. ↩ ↩2
-
TensorFlow guide — tf.data: Build TensorFlow input pipelines. ↩ ↩2
-
TensorFlow guide — Better performance with the tf.data API. ↩ ↩2
-
TensorFlow API reference — tf.GradientTape. ↩ ↩2
-
TensorFlow guide — Distributed training with TensorFlow. ↩ ↩2 ↩3 ↩4
-
TensorFlow guide — Using the SavedModel format. ↩ ↩2
-
TensorFlow Serving — Architecture overview and serving basics. ↩ ↩2 ↩3 ↩4
-
TensorFlow Lite — LiteRT (TF Lite) developer guide. ↩ ↩2 ↩3
-
TensorFlow install — Pip install instructions and platform support, and the version-compatibility table. ↩ ↩2 ↩3
-
TensorFlow guide — Introduction to graphs and tf.function. ↩ ↩2
-
TensorFlow Datasets — Catalog overview and the TFDS API reference. ↩ ↩2
-
TensorFlow guide — Mixed precision. ↩ ↩2
-
TFX — The TFX user guide. ↩