Versioning Datasets

Datasets evolve over time—new data arrives, errors are corrected, features are added. DVC lets you track every version and switch between them effortlessly.

Adding Datasets to DVC

Single Files

# Track a single data file
dvc add data/customers.csv

# Track a model file
dvc add models/classifier.pkl

Directories

# Track an entire directory
dvc add data/images/

# This creates a single .dvc file for the whole directory
# data/images.dvc

Multiple Files at Once

# Track multiple files
dvc add data/train.csv data/test.csv data/validation.csv

# Or use glob patterns
dvc add data/*.parquet

Setting Up Remote Storage

Remote storage is where DVC pushes and pulls actual file contents.

AWS S3

# Add S3 remote
dvc remote add -d myremote s3://my-bucket/dvc-storage

# Configure credentials (if not using AWS CLI profile)
dvc remote modify myremote access_key_id YOUR_KEY
dvc remote modify myremote secret_access_key YOUR_SECRET

Google Cloud Storage

# Add GCS remote
dvc remote add -d myremote gs://my-bucket/dvc-storage

# Uses default application credentials

Azure Blob Storage

# Add Azure remote
dvc remote add -d myremote azure://my-container/dvc-storage

# Configure connection string
dvc remote modify myremote connection_string 'your-connection-string'

Local/Network Storage

# Local directory (for testing or network drives)
dvc remote add -d myremote /path/to/storage

# SSH remote
dvc remote add -d myremote ssh://user@server/path/to/storage

Push and Pull Workflow

# After adding files and committing to Git
dvc push  # Uploads data to remote storage

# On another machine (after git clone)
dvc pull  # Downloads data from remote storage

Selective Push/Pull

# Push only specific files
dvc push data/train.csv.dvc

# Pull only what you need
dvc pull data/test.csv.dvc

Working with Data Versions

Switching Versions

# Check out a previous Git commit
git checkout v1.0

# Sync DVC files to match that version
dvc checkout

# Your data files now match the v1.0 state

Viewing History

# See DVC file changes in Git history
git log --oneline data/train.csv.dvc

# Compare data versions
dvc diff HEAD~1

Practical Example: Dataset Versioning

Let's walk through a real workflow:

# Initial setup
mkdir my-project && cd my-project
git init && dvc init

# Add initial dataset
dvc add data/customers.csv
git add data/customers.csv.dvc data/.gitignore
git commit -m "Add initial customer dataset v1"
git tag v1.0

# Configure remote
dvc remote add -d storage s3://my-bucket/customer-data
dvc push

# Later: Update dataset with new records
# (After modifying customers.csv)
dvc add data/customers.csv
git add data/customers.csv.dvc
git commit -m "Update customer dataset - added Q4 data"
git tag v2.0
dvc push

# Switch between versions anytime
git checkout v1.0
dvc checkout
# Now you have v1.0 of the data

git checkout v2.0
dvc checkout
# Now you have v2.0 of the data

Data Status and Tracking

# Check status of tracked files
dvc status

# Example output:
# data/customers.csv.dvc:
#     changed outs:
#         modified: data/customers.csv

# See what's tracked
dvc list . --dvc-only

Importing External Data

DVC can import datasets from other repositories:

# Import from another DVC repo
dvc import https://github.com/iterative/dataset-registry \
    tutorials/versioning/data.xml

# Import and track
dvc import-url https://data.example.com/dataset.zip data/

Best Practices

Practice	Why
Version on meaningful changes	Not every edit needs a commit
Use tags for releases	Easy to reference specific versions
Document data changes	Commit messages should explain what changed
Keep .dvc files in Git	They're small, always commit them
Don't edit .dvc files manually	Let DVC manage them

Key insight: Think of DVC versions like Git branches—you can always go back to any point in your data's history.

Next, we'll learn how to version models and track model artifacts. :::