Data & Model Versioning with DVC
DVC Fundamentals
4 min read
DVC (Data Version Control) extends Git to handle large files, datasets, and ML models. It gives you version control for data without storing everything in Git.
Why DVC?
| Problem | Without DVC | With DVC |
|---|---|---|
| Large files | Git slows down, fails | Tracked via metadata |
| Dataset versions | Copy folders manually | git checkout any version |
| Collaboration | Share via Dropbox/Drive | Push/pull like Git |
| Reproducibility | "Which data did I use?" | Exact version tracked |
Installation
DVC requires Python 3.9+:
# Basic installation
pip install dvc
# With S3 support
pip install dvc[s3]
# With all cloud storage support
pip install dvc[all]
# Verify installation
dvc version
Initializing DVC
Set up DVC in an existing Git repository:
# Navigate to your project
cd my-ml-project
# Initialize Git (if not already)
git init
# Initialize DVC
dvc init
# This creates:
# .dvc/ - DVC configuration
# .dvcignore - Files to ignore
# .gitignore - Updated with .dvc patterns
After initialization:
# Commit DVC initialization
git add .dvc .dvcignore
git commit -m "Initialize DVC"
How DVC Works
DVC doesn't store your data in Git. Instead:
┌─────────────────────────────────────────────────────┐
│ How DVC Works │
├─────────────────────────────────────────────────────┤
│ │
│ Your Data File .dvc File (in Git) │
│ ┌────────────┐ ┌────────────────┐ │
│ │ data.csv │ ──────▶ │ data.csv.dvc │ │
│ │ (100 MB) │ │ (hash + meta) │ │
│ └────────────┘ └────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ DVC Cache Git Repository │
│ (.dvc/cache) (Remote origin) │
│ │ │
│ ▼ │
│ Remote Storage │
│ (S3, GCS, etc.) │
│ │
└─────────────────────────────────────────────────────┘
- Large file → DVC calculates hash, stores in cache
- Metadata file (
.dvc) → Contains hash, stored in Git - Remote storage → Push cache to S3/GCS for sharing
Tracking Your First File
# Add a data file to DVC tracking
dvc add data/training_data.csv
# This creates:
# data/training_data.csv.dvc - Metadata (track in Git)
# data/.gitignore - Excludes the actual file
# Commit the changes
git add data/training_data.csv.dvc data/.gitignore
git commit -m "Track training data with DVC"
Example .dvc file content:
outs:
- md5: a3b2c1d4e5f6...
size: 104857600
hash: md5
path: training_data.csv
DVC Cache
DVC stores file contents in a local cache:
# Cache location
ls .dvc/cache/
# Cache structure (content-addressed)
.dvc/cache/
├── a3/
│ └── b2c1d4e5f6... # First 2 chars of hash = folder
├── d7/
│ └── e8f9a0b1c2...
Key Commands Summary
| Command | Purpose |
|---|---|
dvc init |
Initialize DVC in project |
dvc add <file> |
Start tracking a file |
dvc push |
Upload to remote storage |
dvc pull |
Download from remote storage |
dvc checkout |
Restore files from cache |
dvc status |
Show tracked file status |
.dvcignore
Like .gitignore, but for DVC:
# .dvcignore
*.tmp
*.log
__pycache__
.ipynb_checkpoints
Key insight: DVC is "Git for data"—it uses the same mental model of add, commit, push, and pull, but handles large files efficiently.
Next, we'll learn how to version datasets and work with multiple data versions. :::