DVC + CML for ML Automation
DVC in CI/CD Pipelines
5 min read
English Content
What is DVC?
DVC (Data Version Control) is an open-source tool for ML data and model versioning. It works alongside Git to version large files, datasets, and ML models that Git can't handle efficiently.
Key DVC concepts:
- Data versioning: Track large files without storing them in Git
- Remote storage: Store data in S3, GCS, Azure, or local storage
- Pipelines: Define reproducible ML workflows
- Metrics & Plots: Track experiment results
DVC Setup for CI/CD
# Install DVC with cloud storage support
pip install 'dvc[s3]' # For AWS S3
pip install 'dvc[gs]' # For Google Cloud Storage
pip install 'dvc[azure]' # For Azure Blob Storage
# Initialize DVC in your repo
dvc init
# Configure remote storage
dvc remote add -d myremote s3://my-bucket/dvc-storage
dvc.yaml Pipeline Definition
Define your ML pipeline in dvc.yaml:
# dvc.yaml
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw/
outs:
- data/prepared/
featurize:
cmd: python src/featurize.py
deps:
- src/featurize.py
- data/prepared/
params:
- featurize.max_features
- featurize.ngrams
outs:
- data/features/
train:
cmd: python src/train.py
deps:
- src/train.py
- data/features/
params:
- train.n_estimators
- train.learning_rate
outs:
- models/model.pkl
metrics:
- metrics/train_metrics.json:
cache: false
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/features/
metrics:
- metrics/eval_metrics.json:
cache: false
plots:
- metrics/plots/:
cache: false
params.yaml for Hyperparameters
# params.yaml
featurize:
max_features: 200
ngrams: 2
train:
n_estimators: 100
learning_rate: 0.1
max_depth: 5
Running DVC Pipelines in CI/CD
# .github/workflows/ml-pipeline.yml
name: ML Pipeline
on:
push:
branches: [main]
paths:
- 'src/**'
- 'params.yaml'
- 'dvc.yaml'
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install dvc[s3] pandas scikit-learn
- name: Configure DVC remote
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc remote modify myremote access_key_id $AWS_ACCESS_KEY_ID
dvc remote modify myremote secret_access_key $AWS_SECRET_ACCESS_KEY
- name: Pull data
run: dvc pull
- name: Run pipeline
run: dvc repro
- name: Push results
run: dvc push
- name: Commit DVC files
run: |
git config user.name "GitHub Actions"
git config user.email "actions@github.com"
git add dvc.lock metrics/
git diff --staged --quiet || git commit -m "Update DVC lock and metrics"
git push
Data-Triggered Pipelines
Run pipelines when data changes:
# .github/workflows/data-update.yml
name: Data Update Pipeline
on:
push:
paths:
- '*.dvc' # Trigger when .dvc files change
- 'data.dvc'
jobs:
retrain:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install dvc[s3]
- name: Pull new data
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: dvc pull
- name: Check if data changed
id: check
run: |
if dvc status | grep -q "changed"; then
echo "data_changed=true" >> $GITHUB_OUTPUT
else
echo "data_changed=false" >> $GITHUB_OUTPUT
fi
- name: Run pipeline if data changed
if: steps.check.outputs.data_changed == 'true'
run: dvc repro
DVC Metrics Comparison
Compare metrics across experiments:
# .github/workflows/compare-metrics.yml
name: Compare Model Metrics
on:
pull_request:
branches: [main]
jobs:
compare:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Need full history for comparison
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install DVC
run: pip install dvc[s3]
- name: Pull data
run: dvc pull
- name: Run pipeline
run: dvc repro
- name: Compare metrics
run: |
echo "## Model Metrics Comparison" >> $GITHUB_STEP_SUMMARY
echo '```' >> $GITHUB_STEP_SUMMARY
dvc metrics diff main >> $GITHUB_STEP_SUMMARY
echo '```' >> $GITHUB_STEP_SUMMARY
Key Takeaways
| DVC Feature | CI/CD Usage |
|---|---|
dvc pull |
Fetch data before training |
dvc repro |
Run pipeline stages |
dvc push |
Store outputs to remote |
dvc metrics diff |
Compare experiment metrics |
.dvc files |
Trigger data change pipelines |
المحتوى العربي
ما هو DVC؟
DVC (التحكم في إصدار البيانات) هو أداة مفتوحة المصدر لإصدار بيانات ونماذج ML. يعمل جنباً إلى جنب مع Git لإصدار الملفات الكبيرة ومجموعات البيانات ونماذج ML التي لا يستطيع Git التعامل معها بكفاءة.
مفاهيم DVC الرئيسية:
- إصدار البيانات: تتبع الملفات الكبيرة بدون تخزينها في Git
- التخزين البعيد: تخزين البيانات في S3 أو GCS أو Azure أو التخزين المحلي
- خطوط الأنابيب: تعريف سير عمل ML قابل للتكرار
- المقاييس والرسوم البيانية: تتبع نتائج التجارب
إعداد DVC لـ CI/CD
# تثبيت DVC مع دعم التخزين السحابي
pip install 'dvc[s3]' # لـ AWS S3
pip install 'dvc[gs]' # لـ Google Cloud Storage
pip install 'dvc[azure]' # لـ Azure Blob Storage
# تهيئة DVC في مستودعك
dvc init
# تكوين التخزين البعيد
dvc remote add -d myremote s3://my-bucket/dvc-storage
تعريف Pipeline في dvc.yaml
عرّف خط أنابيب ML في dvc.yaml:
# dvc.yaml
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw/
outs:
- data/prepared/
featurize:
cmd: python src/featurize.py
deps:
- src/featurize.py
- data/prepared/
params:
- featurize.max_features
- featurize.ngrams
outs:
- data/features/
train:
cmd: python src/train.py
deps:
- src/train.py
- data/features/
params:
- train.n_estimators
- train.learning_rate
outs:
- models/model.pkl
metrics:
- metrics/train_metrics.json:
cache: false
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/features/
metrics:
- metrics/eval_metrics.json:
cache: false
plots:
- metrics/plots/:
cache: false
params.yaml للمعلمات الفائقة
# params.yaml
featurize:
max_features: 200
ngrams: 2
train:
n_estimators: 100
learning_rate: 0.1
max_depth: 5
تشغيل خطوط أنابيب DVC في CI/CD
# .github/workflows/ml-pipeline.yml
name: ML Pipeline
on:
push:
branches: [main]
paths:
- 'src/**'
- 'params.yaml'
- 'dvc.yaml'
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install dvc[s3] pandas scikit-learn
- name: Configure DVC remote
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc remote modify myremote access_key_id $AWS_ACCESS_KEY_ID
dvc remote modify myremote secret_access_key $AWS_SECRET_ACCESS_KEY
- name: Pull data
run: dvc pull
- name: Run pipeline
run: dvc repro
- name: Push results
run: dvc push
- name: Commit DVC files
run: |
git config user.name "GitHub Actions"
git config user.email "actions@github.com"
git add dvc.lock metrics/
git diff --staged --quiet || git commit -m "Update DVC lock and metrics"
git push
خطوط الأنابيب المحفزة بالبيانات
شغّل خطوط الأنابيب عند تغيير البيانات:
# .github/workflows/data-update.yml
name: Data Update Pipeline
on:
push:
paths:
- '*.dvc' # التحفيز عند تغيير ملفات .dvc
- 'data.dvc'
jobs:
retrain:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install dvc[s3]
- name: Pull new data
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: dvc pull
- name: Check if data changed
id: check
run: |
if dvc status | grep -q "changed"; then
echo "data_changed=true" >> $GITHUB_OUTPUT
else
echo "data_changed=false" >> $GITHUB_OUTPUT
fi
- name: Run pipeline if data changed
if: steps.check.outputs.data_changed == 'true'
run: dvc repro
مقارنة مقاييس DVC
قارن المقاييس عبر التجارب:
# .github/workflows/compare-metrics.yml
name: Compare Model Metrics
on:
pull_request:
branches: [main]
jobs:
compare:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # نحتاج التاريخ الكامل للمقارنة
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install DVC
run: pip install dvc[s3]
- name: Pull data
run: dvc pull
- name: Run pipeline
run: dvc repro
- name: Compare metrics
run: |
echo "## Model Metrics Comparison" >> $GITHUB_STEP_SUMMARY
echo '```' >> $GITHUB_STEP_SUMMARY
dvc metrics diff main >> $GITHUB_STEP_SUMMARY
echo '```' >> $GITHUB_STEP_SUMMARY
النقاط الرئيسية
| ميزة DVC | استخدام CI/CD |
|---|---|
dvc pull |
جلب البيانات قبل التدريب |
dvc repro |
تشغيل مراحل pipeline |
dvc push |
تخزين المخرجات للتخزين البعيد |
dvc metrics diff |
مقارنة مقاييس التجارب |
ملفات .dvc |
تحفيز pipelines تغيير البيانات |