Data & Model Versioning with DVC

DVC Fundamentals

4 min read

DVC (Data Version Control) extends Git to handle large files, datasets, and ML models. It gives you version control for data without storing everything in Git.

Why DVC?

Problem Without DVC With DVC
Large files Git slows down, fails Tracked via metadata
Dataset versions Copy folders manually git checkout any version
Collaboration Share via Dropbox/Drive Push/pull like Git
Reproducibility "Which data did I use?" Exact version tracked

Installation

DVC requires Python 3.9+:

# Basic installation
pip install dvc

# With S3 support
pip install dvc[s3]

# With all cloud storage support
pip install dvc[all]

# Verify installation
dvc version

Initializing DVC

Set up DVC in an existing Git repository:

# Navigate to your project
cd my-ml-project

# Initialize Git (if not already)
git init

# Initialize DVC
dvc init

# This creates:
# .dvc/           - DVC configuration
# .dvcignore      - Files to ignore
# .gitignore      - Updated with .dvc patterns

After initialization:

# Commit DVC initialization
git add .dvc .dvcignore
git commit -m "Initialize DVC"

How DVC Works

DVC doesn't store your data in Git. Instead:

┌─────────────────────────────────────────────────────┐
│                  How DVC Works                       │
├─────────────────────────────────────────────────────┤
│                                                      │
│  Your Data File          .dvc File (in Git)         │
│  ┌────────────┐         ┌────────────────┐          │
│  │ data.csv   │ ──────▶ │ data.csv.dvc   │          │
│  │ (100 MB)   │         │ (hash + meta)  │          │
│  └────────────┘         └────────────────┘          │
│       │                         │                    │
│       ▼                         ▼                    │
│  DVC Cache              Git Repository              │
│  (.dvc/cache)           (Remote origin)             │
│       │                                              │
│       ▼                                              │
│  Remote Storage                                      │
│  (S3, GCS, etc.)                                    │
│                                                      │
└─────────────────────────────────────────────────────┘
  1. Large file → DVC calculates hash, stores in cache
  2. Metadata file (.dvc) → Contains hash, stored in Git
  3. Remote storage → Push cache to S3/GCS for sharing

Tracking Your First File

# Add a data file to DVC tracking
dvc add data/training_data.csv

# This creates:
# data/training_data.csv.dvc  - Metadata (track in Git)
# data/.gitignore             - Excludes the actual file

# Commit the changes
git add data/training_data.csv.dvc data/.gitignore
git commit -m "Track training data with DVC"

Example .dvc file content:

outs:
- md5: a3b2c1d4e5f6...
  size: 104857600
  hash: md5
  path: training_data.csv

DVC Cache

DVC stores file contents in a local cache:

# Cache location
ls .dvc/cache/

# Cache structure (content-addressed)
.dvc/cache/
├── a3/
│   └── b2c1d4e5f6...  # First 2 chars of hash = folder
├── d7/
│   └── e8f9a0b1c2...

Key Commands Summary

Command Purpose
dvc init Initialize DVC in project
dvc add <file> Start tracking a file
dvc push Upload to remote storage
dvc pull Download from remote storage
dvc checkout Restore files from cache
dvc status Show tracked file status

.dvcignore

Like .gitignore, but for DVC:

# .dvcignore
*.tmp
*.log
__pycache__
.ipynb_checkpoints

Key insight: DVC is "Git for data"—it uses the same mental model of add, commit, push, and pull, but handles large files efficiently.

Next, we'll learn how to version datasets and work with multiple data versions. :::

Quiz

Module 2: Data & Model Versioning with DVC

Take Quiz