Data & Model Versioning with DVC
Versioning Datasets
4 min read
Datasets evolve over time—new data arrives, errors are corrected, features are added. DVC lets you track every version and switch between them effortlessly.
Adding Datasets to DVC
Single Files
# Track a single data file
dvc add data/customers.csv
# Track a model file
dvc add models/classifier.pkl
Directories
# Track an entire directory
dvc add data/images/
# This creates a single .dvc file for the whole directory
# data/images.dvc
Multiple Files at Once
# Track multiple files
dvc add data/train.csv data/test.csv data/validation.csv
# Or use glob patterns
dvc add data/*.parquet
Setting Up Remote Storage
Remote storage is where DVC pushes and pulls actual file contents.
AWS S3
# Add S3 remote
dvc remote add -d myremote s3://my-bucket/dvc-storage
# Configure credentials (if not using AWS CLI profile)
dvc remote modify myremote access_key_id YOUR_KEY
dvc remote modify myremote secret_access_key YOUR_SECRET
Google Cloud Storage
# Add GCS remote
dvc remote add -d myremote gs://my-bucket/dvc-storage
# Uses default application credentials
Azure Blob Storage
# Add Azure remote
dvc remote add -d myremote azure://my-container/dvc-storage
# Configure connection string
dvc remote modify myremote connection_string 'your-connection-string'
Local/Network Storage
# Local directory (for testing or network drives)
dvc remote add -d myremote /path/to/storage
# SSH remote
dvc remote add -d myremote ssh://user@server/path/to/storage
Push and Pull Workflow
# After adding files and committing to Git
dvc push # Uploads data to remote storage
# On another machine (after git clone)
dvc pull # Downloads data from remote storage
Selective Push/Pull
# Push only specific files
dvc push data/train.csv.dvc
# Pull only what you need
dvc pull data/test.csv.dvc
Working with Data Versions
Switching Versions
# Check out a previous Git commit
git checkout v1.0
# Sync DVC files to match that version
dvc checkout
# Your data files now match the v1.0 state
Viewing History
# See DVC file changes in Git history
git log --oneline data/train.csv.dvc
# Compare data versions
dvc diff HEAD~1
Practical Example: Dataset Versioning
Let's walk through a real workflow:
# Initial setup
mkdir my-project && cd my-project
git init && dvc init
# Add initial dataset
dvc add data/customers.csv
git add data/customers.csv.dvc data/.gitignore
git commit -m "Add initial customer dataset v1"
git tag v1.0
# Configure remote
dvc remote add -d storage s3://my-bucket/customer-data
dvc push
# Later: Update dataset with new records
# (After modifying customers.csv)
dvc add data/customers.csv
git add data/customers.csv.dvc
git commit -m "Update customer dataset - added Q4 data"
git tag v2.0
dvc push
# Switch between versions anytime
git checkout v1.0
dvc checkout
# Now you have v1.0 of the data
git checkout v2.0
dvc checkout
# Now you have v2.0 of the data
Data Status and Tracking
# Check status of tracked files
dvc status
# Example output:
# data/customers.csv.dvc:
# changed outs:
# modified: data/customers.csv
# See what's tracked
dvc list . --dvc-only
Importing External Data
DVC can import datasets from other repositories:
# Import from another DVC repo
dvc import https://github.com/iterative/dataset-registry \
tutorials/versioning/data.xml
# Import and track
dvc import-url https://data.example.com/dataset.zip data/
Best Practices
| Practice | Why |
|---|---|
| Version on meaningful changes | Not every edit needs a commit |
| Use tags for releases | Easy to reference specific versions |
| Document data changes | Commit messages should explain what changed |
| Keep .dvc files in Git | They're small, always commit them |
| Don't edit .dvc files manually | Let DVC manage them |
Key insight: Think of DVC versions like Git branches—you can always go back to any point in your data's history.
Next, we'll learn how to version models and track model artifacts. :::