Data & Model Versioning with DVC

Versioning Models

4 min read

Models are the output of your ML experiments. Like datasets, they evolve—you'll train many versions before finding the best one. DVC makes model versioning seamless.

Why Version Models?

Challenge Solution
Reproducibility Recreate any model version
Comparison Compare metrics across versions
Rollback Quickly revert to a working model
Audit Track who trained what, when

Tracking Model Files

Single Model File

# Track a trained model
dvc add models/random_forest.pkl

# Creates models/random_forest.pkl.dvc
# Adds models/random_forest.pkl to .gitignore

Model with Artifacts

Many frameworks save multiple files:

# PyTorch saves state dict
dvc add models/pytorch_model/

# This tracks the entire directory:
# models/pytorch_model/
# ├── model.pt
# ├── config.json
# └── tokenizer/

Large Model Checkpoints

# Track training checkpoints
dvc add checkpoints/

# Directory structure:
# checkpoints/
# ├── epoch_10.pt
# ├── epoch_20.pt
# └── epoch_30.pt

Model Versioning Workflow

Complete Workflow Example

# 1. Train your model (Python script)
python train.py --epochs 50 --lr 0.001

# 2. Track the output model
dvc add models/classifier.pkl

# 3. Record metadata in Git
git add models/classifier.pkl.dvc
git commit -m "Model v1: baseline random forest, acc=0.82"
git tag model-v1.0

# 4. Push model to remote storage
dvc push

# 5. Train improved model
python train.py --epochs 100 --lr 0.0005

# 6. Update tracking
dvc add models/classifier.pkl
git add models/classifier.pkl.dvc
git commit -m "Model v2: tuned hyperparameters, acc=0.87"
git tag model-v2.0
dvc push

Switching Model Versions

# Go back to v1
git checkout model-v1.0
dvc checkout

# Verify the model changed
ls -la models/classifier.pkl
# File hash matches v1.0

# Return to latest
git checkout main
dvc checkout

Tracking Multiple Models

Project with Multiple Models

# Track each model separately
dvc add models/preprocessor.pkl
dvc add models/classifier.pkl
dvc add models/postprocessor.pkl

# Or track the entire models directory
dvc add models/

Model Naming Conventions

models/
├── classifier_v1_baseline.pkl
├── classifier_v2_tuned.pkl
├── classifier_v3_production.pkl
└── metadata/
    ├── v1_metrics.json
    ├── v2_metrics.json
    └── v3_metrics.json

Associating Models with Metrics

Manual Metrics Tracking

# Save metrics alongside model
echo '{"accuracy": 0.87, "f1": 0.85}' > models/metrics.json

# Track both
dvc add models/classifier.pkl
git add models/classifier.pkl.dvc models/metrics.json
git commit -m "Model v2 with metrics"

Using DVC Metrics

# dvc.yaml
stages:
  train:
    cmd: python train.py
    deps:
      - data/train.csv
      - src/train.py
    outs:
      - models/classifier.pkl
    metrics:
      - models/metrics.json:
          cache: false
# View metrics
dvc metrics show

# Compare across commits
dvc metrics diff

Best Practices for Model Versioning

What to Track

Track Don't Track
Final model weights Intermediate optimizer states
Model configuration Training logs (use MLflow)
Inference artifacts Temporary cache files
Preprocessing pipelines Debug outputs

Commit Message Guidelines

# Good: Descriptive with metrics
git commit -m "Model v3: XGBoost with feature selection
- Accuracy: 0.91 (↑0.04)
- F1 Score: 0.89 (↑0.03)
- Training time: 45min
- Features: 25 (reduced from 100)"

# Bad: Vague
git commit -m "Updated model"

Tagging Strategy

# Semantic versioning for models
git tag model-v1.0.0  # Major: architecture change
git tag model-v1.1.0  # Minor: new features/data
git tag model-v1.1.1  # Patch: bug fixes

Framework-Specific Patterns

Scikit-learn

import joblib

# Save model
joblib.dump(model, 'models/sklearn_model.pkl')
dvc add models/sklearn_model.pkl

PyTorch

import torch

# Save full model
torch.save(model.state_dict(), 'models/pytorch_model.pt')
torch.save({
    'model_state': model.state_dict(),
    'optimizer_state': optimizer.state_dict(),
    'epoch': epoch,
    'loss': loss
}, 'models/checkpoint.pt')
dvc add models/pytorch_model.pt
dvc add models/checkpoint.pt

TensorFlow/Keras

# SavedModel format (recommended)
model.save('models/tf_model')

# Or HDF5 format
model.save('models/keras_model.h5')
dvc add models/tf_model/   # Directory
dvc add models/keras_model.h5  # Single file

Retrieving Specific Model Versions

# List all model versions (via Git tags)
git tag | grep model

# Get a specific version
git checkout model-v1.0
dvc checkout

# Or fetch without switching branches
dvc get . models/classifier.pkl --rev model-v1.0 -o models/v1_model.pkl

Key insight: Version models like you version code—with meaningful commits, tags, and the ability to reproduce any version at any time.

Next, we'll learn how to create reproducible ML pipelines with DVC. :::

Quiz

Module 2: Data & Model Versioning with DVC

Take Quiz