Data & Model Versioning with DVC
Experiment Reproducibility
4 min read
The true power of DVC lies in pipelines—defining your entire ML workflow as code. Run dvc repro and recreate any experiment exactly.
The Reproducibility Problem
Without pipelines:
# "How did I train this model?"
python preprocess.py --input raw.csv --output clean.csv # What params?
python train.py --data clean.csv --model rf # Which hyperparameters?
python evaluate.py --model model.pkl # What metrics?
# Order? Dependencies? Versions?
With DVC pipelines:
dvc repro # Recreates everything, automatically
DVC Pipeline Basics
The dvc.yaml File
Pipelines are defined in dvc.yaml:
# dvc.yaml
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- src/preprocess.py
- data/raw.csv
outs:
- data/processed.csv
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed.csv
params:
- train.epochs
- train.learning_rate
outs:
- models/model.pkl
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/test.csv
metrics:
- metrics.json:
cache: false
Key Components
| Component | Purpose | Example |
|---|---|---|
cmd |
Command to run | python train.py |
deps |
Input dependencies | Source files, data |
outs |
Output artifacts | Models, processed data |
params |
Hyperparameters | Learning rate, epochs |
metrics |
Evaluation metrics | Accuracy, F1 score |
Parameters File
Store hyperparameters in params.yaml:
# params.yaml
preprocess:
test_split: 0.2
random_seed: 42
train:
epochs: 100
learning_rate: 0.001
batch_size: 32
model_type: random_forest
evaluate:
threshold: 0.5
Reference in your code:
# src/train.py
import yaml
with open('params.yaml') as f:
params = yaml.safe_load(f)
epochs = params['train']['epochs']
lr = params['train']['learning_rate']
Running Pipelines
Basic Execution
# Run entire pipeline
dvc repro
# Run specific stage
dvc repro train
# Force re-run (ignore cache)
dvc repro --force
Smart Caching
DVC only re-runs stages when something changes:
$ dvc repro
Stage 'preprocess' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and calculation are already up to date.
# Now modify learning_rate in params.yaml
$ dvc repro
Stage 'preprocess' didn't change, skipping
Running stage 'train'... # Re-runs because param changed
Running stage 'evaluate'... # Re-runs because dependency changed
Tracking Metrics
Metrics File
# src/evaluate.py
import json
metrics = {
"accuracy": 0.87,
"precision": 0.85,
"recall": 0.89,
"f1_score": 0.87
}
with open('metrics.json', 'w') as f:
json.dump(metrics, f, indent=2)
Viewing Metrics
# Show current metrics
dvc metrics show
# Output:
# Path accuracy precision recall f1_score
# metrics.json 0.87 0.85 0.89 0.87
Comparing Experiments
# Compare metrics across Git commits
dvc metrics diff HEAD~1
# Output:
# Path Metric HEAD HEAD~1 Change
# metrics.json accuracy 0.87 0.82 +0.05
# metrics.json f1_score 0.87 0.81 +0.06
Plots and Visualizations
Defining Plots
# dvc.yaml
stages:
train:
cmd: python src/train.py
plots:
- training_history.csv:
x: epoch
y: loss
Training History File
# src/train.py
import csv
history = []
for epoch in range(epochs):
loss = train_epoch(model, data)
history.append({'epoch': epoch, 'loss': loss})
with open('training_history.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=['epoch', 'loss'])
writer.writeheader()
writer.writerows(history)
Viewing Plots
# Generate HTML plots
dvc plots show
# Compare plots across experiments
dvc plots diff HEAD~1
Complete Pipeline Example
Project Structure
my-project/
├── data/
│ ├── raw.csv
│ └── processed.csv
├── src/
│ ├── preprocess.py
│ ├── train.py
│ └── evaluate.py
├── models/
│ └── model.pkl
├── dvc.yaml
├── params.yaml
├── metrics.json
└── dvc.lock
Full dvc.yaml
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- src/preprocess.py
- data/raw.csv
params:
- preprocess.test_split
- preprocess.random_seed
outs:
- data/processed.csv
- data/test.csv
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed.csv
params:
- train.epochs
- train.learning_rate
- train.model_type
outs:
- models/model.pkl
plots:
- training_history.csv:
x: epoch
y: loss
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/test.csv
metrics:
- metrics.json:
cache: false
The Lock File
dvc.lock records exact state of each run:
# dvc.lock (auto-generated)
schema: '2.0'
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- path: data/raw.csv
hash: md5
md5: a1b2c3d4e5...
- path: src/preprocess.py
hash: md5
md5: f6g7h8i9j0...
outs:
- path: data/processed.csv
hash: md5
md5: k1l2m3n4o5...
Experiment Workflow
# 1. Create experiment branch
git checkout -b experiment/new-features
# 2. Modify parameters
vim params.yaml
# 3. Run pipeline
dvc repro
# 4. Compare results
dvc metrics diff main
# 5. If better, commit and merge
git add dvc.yaml dvc.lock params.yaml metrics.json
git commit -m "Experiment: added feature engineering, acc +5%"
git checkout main
git merge experiment/new-features
Best Practices
| Practice | Why |
|---|---|
| Keep stages granular | Easier caching and debugging |
| Use params.yaml | Separate code from configuration |
| Commit dvc.lock | Ensures exact reproducibility |
| Document experiments | Git commit messages matter |
| Use branches | One experiment per branch |
Key insight: With
dvc.yamlanddvc.lockin Git, anyone can rundvc reproand get exactly the same results—months or years later.
Next module: We'll explore ML workflow orchestration with Kubeflow, Airflow, and Prefect. :::