DVC Fundamentals

DVC (Data Version Control) extends Git to handle large files, datasets, and ML models. It gives you version control for data without storing everything in Git.

Why DVC?

Problem	Without DVC	With DVC
Large files	Git slows down, fails	Tracked via metadata
Dataset versions	Copy folders manually	`git checkout` any version
Collaboration	Share via Dropbox/Drive	Push/pull like Git
Reproducibility	"Which data did I use?"	Exact version tracked

Installation

DVC requires Python 3.9+:

# Basic installation
pip install dvc

# With S3 support
pip install dvc[s3]

# With all cloud storage support
pip install dvc[all]

# Verify installation
dvc version

Initializing DVC

Set up DVC in an existing Git repository:

# Navigate to your project
cd my-ml-project

# Initialize Git (if not already)
git init

# Initialize DVC
dvc init

# This creates:
# .dvc/           - DVC configuration
# .dvcignore      - Files to ignore
# .gitignore      - Updated with .dvc patterns

After initialization:

# Commit DVC initialization
git add .dvc .dvcignore
git commit -m "Initialize DVC"

How DVC Works

DVC doesn't store your data in Git. Instead:

┌─────────────────────────────────────────────────────┐
│                  How DVC Works                       │
├─────────────────────────────────────────────────────┤
│                                                      │
│  Your Data File          .dvc File (in Git)         │
│  ┌────────────┐         ┌────────────────┐          │
│  │ data.csv   │ ──────▶ │ data.csv.dvc   │          │
│  │ (100 MB)   │         │ (hash + meta)  │          │
│  └────────────┘         └────────────────┘          │
│       │                         │                    │
│       ▼                         ▼                    │
│  DVC Cache              Git Repository              │
│  (.dvc/cache)           (Remote origin)             │
│       │                                              │
│       ▼                                              │
│  Remote Storage                                      │
│  (S3, GCS, etc.)                                    │
│                                                      │
└─────────────────────────────────────────────────────┘

Large file → DVC calculates hash, stores in cache
Metadata file (.dvc) → Contains hash, stored in Git
Remote storage → Push cache to S3/GCS for sharing

Tracking Your First File

# Add a data file to DVC tracking
dvc add data/training_data.csv

# This creates:
# data/training_data.csv.dvc  - Metadata (track in Git)
# data/.gitignore             - Excludes the actual file

# Commit the changes
git add data/training_data.csv.dvc data/.gitignore
git commit -m "Track training data with DVC"

Example .dvc file content:

outs:
- md5: a3b2c1d4e5f6...
  size: 104857600
  hash: md5
  path: training_data.csv

DVC Cache

DVC stores file contents in a local cache:

# Cache location
ls .dvc/cache/

# Cache structure (content-addressed)
.dvc/cache/
├── a3/
│   └── b2c1d4e5f6...  # First 2 chars of hash = folder
├── d7/
│   └── e8f9a0b1c2...

Key Commands Summary

Command	Purpose
`dvc init`	Initialize DVC in project
`dvc add <file>`	Start tracking a file
`dvc push`	Upload to remote storage
`dvc pull`	Download from remote storage
`dvc checkout`	Restore files from cache
`dvc status`	Show tracked file status

.dvcignore

Like .gitignore, but for DVC:

# .dvcignore
*.tmp
*.log
__pycache__
.ipynb_checkpoints

Key insight: DVC is "Git for data"—it uses the same mental model of add, commit, push, and pull, but handles large files efficiently.

Next, we'll learn how to version datasets and work with multiple data versions. :::