Data & Model Versioning with DVC
Versioning Models
4 min read
Models are the output of your ML experiments. Like datasets, they evolve—you'll train many versions before finding the best one. DVC makes model versioning seamless.
Why Version Models?
| Challenge | Solution |
|---|---|
| Reproducibility | Recreate any model version |
| Comparison | Compare metrics across versions |
| Rollback | Quickly revert to a working model |
| Audit | Track who trained what, when |
Tracking Model Files
Single Model File
# Track a trained model
dvc add models/random_forest.pkl
# Creates models/random_forest.pkl.dvc
# Adds models/random_forest.pkl to .gitignore
Model with Artifacts
Many frameworks save multiple files:
# PyTorch saves state dict
dvc add models/pytorch_model/
# This tracks the entire directory:
# models/pytorch_model/
# ├── model.pt
# ├── config.json
# └── tokenizer/
Large Model Checkpoints
# Track training checkpoints
dvc add checkpoints/
# Directory structure:
# checkpoints/
# ├── epoch_10.pt
# ├── epoch_20.pt
# └── epoch_30.pt
Model Versioning Workflow
Complete Workflow Example
# 1. Train your model (Python script)
python train.py --epochs 50 --lr 0.001
# 2. Track the output model
dvc add models/classifier.pkl
# 3. Record metadata in Git
git add models/classifier.pkl.dvc
git commit -m "Model v1: baseline random forest, acc=0.82"
git tag model-v1.0
# 4. Push model to remote storage
dvc push
# 5. Train improved model
python train.py --epochs 100 --lr 0.0005
# 6. Update tracking
dvc add models/classifier.pkl
git add models/classifier.pkl.dvc
git commit -m "Model v2: tuned hyperparameters, acc=0.87"
git tag model-v2.0
dvc push
Switching Model Versions
# Go back to v1
git checkout model-v1.0
dvc checkout
# Verify the model changed
ls -la models/classifier.pkl
# File hash matches v1.0
# Return to latest
git checkout main
dvc checkout
Tracking Multiple Models
Project with Multiple Models
# Track each model separately
dvc add models/preprocessor.pkl
dvc add models/classifier.pkl
dvc add models/postprocessor.pkl
# Or track the entire models directory
dvc add models/
Model Naming Conventions
models/
├── classifier_v1_baseline.pkl
├── classifier_v2_tuned.pkl
├── classifier_v3_production.pkl
└── metadata/
├── v1_metrics.json
├── v2_metrics.json
└── v3_metrics.json
Associating Models with Metrics
Manual Metrics Tracking
# Save metrics alongside model
echo '{"accuracy": 0.87, "f1": 0.85}' > models/metrics.json
# Track both
dvc add models/classifier.pkl
git add models/classifier.pkl.dvc models/metrics.json
git commit -m "Model v2 with metrics"
Using DVC Metrics
# dvc.yaml
stages:
train:
cmd: python train.py
deps:
- data/train.csv
- src/train.py
outs:
- models/classifier.pkl
metrics:
- models/metrics.json:
cache: false
# View metrics
dvc metrics show
# Compare across commits
dvc metrics diff
Best Practices for Model Versioning
What to Track
| Track | Don't Track |
|---|---|
| Final model weights | Intermediate optimizer states |
| Model configuration | Training logs (use MLflow) |
| Inference artifacts | Temporary cache files |
| Preprocessing pipelines | Debug outputs |
Commit Message Guidelines
# Good: Descriptive with metrics
git commit -m "Model v3: XGBoost with feature selection
- Accuracy: 0.91 (↑0.04)
- F1 Score: 0.89 (↑0.03)
- Training time: 45min
- Features: 25 (reduced from 100)"
# Bad: Vague
git commit -m "Updated model"
Tagging Strategy
# Semantic versioning for models
git tag model-v1.0.0 # Major: architecture change
git tag model-v1.1.0 # Minor: new features/data
git tag model-v1.1.1 # Patch: bug fixes
Framework-Specific Patterns
Scikit-learn
import joblib
# Save model
joblib.dump(model, 'models/sklearn_model.pkl')
dvc add models/sklearn_model.pkl
PyTorch
import torch
# Save full model
torch.save(model.state_dict(), 'models/pytorch_model.pt')
torch.save({
'model_state': model.state_dict(),
'optimizer_state': optimizer.state_dict(),
'epoch': epoch,
'loss': loss
}, 'models/checkpoint.pt')
dvc add models/pytorch_model.pt
dvc add models/checkpoint.pt
TensorFlow/Keras
# SavedModel format (recommended)
model.save('models/tf_model')
# Or HDF5 format
model.save('models/keras_model.h5')
dvc add models/tf_model/ # Directory
dvc add models/keras_model.h5 # Single file
Retrieving Specific Model Versions
# List all model versions (via Git tags)
git tag | grep model
# Get a specific version
git checkout model-v1.0
dvc checkout
# Or fetch without switching branches
dvc get . models/classifier.pkl --rev model-v1.0 -o models/v1_model.pkl
Key insight: Version models like you version code—with meaningful commits, tags, and the ability to reproduce any version at any time.
Next, we'll learn how to create reproducible ML pipelines with DVC. :::