Data & Model Versioning with DVC

Experiment Reproducibility

4 min read

The true power of DVC lies in pipelines—defining your entire ML workflow as code. Run dvc repro and recreate any experiment exactly.

The Reproducibility Problem

Without pipelines:

# "How did I train this model?"
python preprocess.py --input raw.csv --output clean.csv  # What params?
python train.py --data clean.csv --model rf  # Which hyperparameters?
python evaluate.py --model model.pkl  # What metrics?
# Order? Dependencies? Versions?

With DVC pipelines:

dvc repro  # Recreates everything, automatically

DVC Pipeline Basics

The dvc.yaml File

Pipelines are defined in dvc.yaml:

# dvc.yaml
stages:
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - src/preprocess.py
      - data/raw.csv
    outs:
      - data/processed.csv

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed.csv
    params:
      - train.epochs
      - train.learning_rate
    outs:
      - models/model.pkl

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/test.csv
    metrics:
      - metrics.json:
          cache: false

Key Components

Component Purpose Example
cmd Command to run python train.py
deps Input dependencies Source files, data
outs Output artifacts Models, processed data
params Hyperparameters Learning rate, epochs
metrics Evaluation metrics Accuracy, F1 score

Parameters File

Store hyperparameters in params.yaml:

# params.yaml
preprocess:
  test_split: 0.2
  random_seed: 42

train:
  epochs: 100
  learning_rate: 0.001
  batch_size: 32
  model_type: random_forest

evaluate:
  threshold: 0.5

Reference in your code:

# src/train.py
import yaml

with open('params.yaml') as f:
    params = yaml.safe_load(f)

epochs = params['train']['epochs']
lr = params['train']['learning_rate']

Running Pipelines

Basic Execution

# Run entire pipeline
dvc repro

# Run specific stage
dvc repro train

# Force re-run (ignore cache)
dvc repro --force

Smart Caching

DVC only re-runs stages when something changes:

$ dvc repro
Stage 'preprocess' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and calculation are already up to date.

# Now modify learning_rate in params.yaml
$ dvc repro
Stage 'preprocess' didn't change, skipping
Running stage 'train'...  # Re-runs because param changed
Running stage 'evaluate'...  # Re-runs because dependency changed

Tracking Metrics

Metrics File

# src/evaluate.py
import json

metrics = {
    "accuracy": 0.87,
    "precision": 0.85,
    "recall": 0.89,
    "f1_score": 0.87
}

with open('metrics.json', 'w') as f:
    json.dump(metrics, f, indent=2)

Viewing Metrics

# Show current metrics
dvc metrics show

# Output:
# Path          accuracy    precision    recall    f1_score
# metrics.json  0.87        0.85         0.89      0.87

Comparing Experiments

# Compare metrics across Git commits
dvc metrics diff HEAD~1

# Output:
# Path          Metric     HEAD      HEAD~1    Change
# metrics.json  accuracy   0.87      0.82      +0.05
# metrics.json  f1_score   0.87      0.81      +0.06

Plots and Visualizations

Defining Plots

# dvc.yaml
stages:
  train:
    cmd: python src/train.py
    plots:
      - training_history.csv:
          x: epoch
          y: loss

Training History File

# src/train.py
import csv

history = []
for epoch in range(epochs):
    loss = train_epoch(model, data)
    history.append({'epoch': epoch, 'loss': loss})

with open('training_history.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['epoch', 'loss'])
    writer.writeheader()
    writer.writerows(history)

Viewing Plots

# Generate HTML plots
dvc plots show

# Compare plots across experiments
dvc plots diff HEAD~1

Complete Pipeline Example

Project Structure

my-project/
├── data/
│   ├── raw.csv
│   └── processed.csv
├── src/
│   ├── preprocess.py
│   ├── train.py
│   └── evaluate.py
├── models/
│   └── model.pkl
├── dvc.yaml
├── params.yaml
├── metrics.json
└── dvc.lock

Full dvc.yaml

stages:
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - src/preprocess.py
      - data/raw.csv
    params:
      - preprocess.test_split
      - preprocess.random_seed
    outs:
      - data/processed.csv
      - data/test.csv

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed.csv
    params:
      - train.epochs
      - train.learning_rate
      - train.model_type
    outs:
      - models/model.pkl
    plots:
      - training_history.csv:
          x: epoch
          y: loss

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/test.csv
    metrics:
      - metrics.json:
          cache: false

The Lock File

dvc.lock records exact state of each run:

# dvc.lock (auto-generated)
schema: '2.0'
stages:
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - path: data/raw.csv
        hash: md5
        md5: a1b2c3d4e5...
      - path: src/preprocess.py
        hash: md5
        md5: f6g7h8i9j0...
    outs:
      - path: data/processed.csv
        hash: md5
        md5: k1l2m3n4o5...

Experiment Workflow

# 1. Create experiment branch
git checkout -b experiment/new-features

# 2. Modify parameters
vim params.yaml

# 3. Run pipeline
dvc repro

# 4. Compare results
dvc metrics diff main

# 5. If better, commit and merge
git add dvc.yaml dvc.lock params.yaml metrics.json
git commit -m "Experiment: added feature engineering, acc +5%"
git checkout main
git merge experiment/new-features

Best Practices

Practice Why
Keep stages granular Easier caching and debugging
Use params.yaml Separate code from configuration
Commit dvc.lock Ensures exact reproducibility
Document experiments Git commit messages matter
Use branches One experiment per branch

Key insight: With dvc.yaml and dvc.lock in Git, anyone can run dvc repro and get exactly the same results—months or years later.

Next module: We'll explore ML workflow orchestration with Kubeflow, Airflow, and Prefect. :::

Quiz

Module 2: Data & Model Versioning with DVC

Take Quiz