DVC + CML for ML Automation

DVC in CI/CD Pipelines

5 min read

English Content

What is DVC?

DVC (Data Version Control) is an open-source tool for ML data and model versioning. It works alongside Git to version large files, datasets, and ML models that Git can't handle efficiently.

Key DVC concepts:

  • Data versioning: Track large files without storing them in Git
  • Remote storage: Store data in S3, GCS, Azure, or local storage
  • Pipelines: Define reproducible ML workflows
  • Metrics & Plots: Track experiment results

DVC Setup for CI/CD

# Install DVC with cloud storage support
pip install 'dvc[s3]'  # For AWS S3
pip install 'dvc[gs]'  # For Google Cloud Storage
pip install 'dvc[azure]'  # For Azure Blob Storage

# Initialize DVC in your repo
dvc init

# Configure remote storage
dvc remote add -d myremote s3://my-bucket/dvc-storage

dvc.yaml Pipeline Definition

Define your ML pipeline in dvc.yaml:

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw/
    outs:
      - data/prepared/

  featurize:
    cmd: python src/featurize.py
    deps:
      - src/featurize.py
      - data/prepared/
    params:
      - featurize.max_features
      - featurize.ngrams
    outs:
      - data/features/

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/features/
    params:
      - train.n_estimators
      - train.learning_rate
    outs:
      - models/model.pkl
    metrics:
      - metrics/train_metrics.json:
          cache: false

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/features/
    metrics:
      - metrics/eval_metrics.json:
          cache: false
    plots:
      - metrics/plots/:
          cache: false

params.yaml for Hyperparameters

# params.yaml
featurize:
  max_features: 200
  ngrams: 2

train:
  n_estimators: 100
  learning_rate: 0.1
  max_depth: 5

Running DVC Pipelines in CI/CD

# .github/workflows/ml-pipeline.yml
name: ML Pipeline
on:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'params.yaml'
      - 'dvc.yaml'

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install dvc[s3] pandas scikit-learn

      - name: Configure DVC remote
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc remote modify myremote access_key_id $AWS_ACCESS_KEY_ID
          dvc remote modify myremote secret_access_key $AWS_SECRET_ACCESS_KEY

      - name: Pull data
        run: dvc pull

      - name: Run pipeline
        run: dvc repro

      - name: Push results
        run: dvc push

      - name: Commit DVC files
        run: |
          git config user.name "GitHub Actions"
          git config user.email "actions@github.com"
          git add dvc.lock metrics/
          git diff --staged --quiet || git commit -m "Update DVC lock and metrics"
          git push

Data-Triggered Pipelines

Run pipelines when data changes:

# .github/workflows/data-update.yml
name: Data Update Pipeline
on:
  push:
    paths:
      - '*.dvc'  # Trigger when .dvc files change
      - 'data.dvc'

jobs:
  retrain:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install dvc[s3]

      - name: Pull new data
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: dvc pull

      - name: Check if data changed
        id: check
        run: |
          if dvc status | grep -q "changed"; then
            echo "data_changed=true" >> $GITHUB_OUTPUT
          else
            echo "data_changed=false" >> $GITHUB_OUTPUT
          fi

      - name: Run pipeline if data changed
        if: steps.check.outputs.data_changed == 'true'
        run: dvc repro

DVC Metrics Comparison

Compare metrics across experiments:

# .github/workflows/compare-metrics.yml
name: Compare Model Metrics
on:
  pull_request:
    branches: [main]

jobs:
  compare:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need full history for comparison

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install DVC
        run: pip install dvc[s3]

      - name: Pull data
        run: dvc pull

      - name: Run pipeline
        run: dvc repro

      - name: Compare metrics
        run: |
          echo "## Model Metrics Comparison" >> $GITHUB_STEP_SUMMARY
          echo '```' >> $GITHUB_STEP_SUMMARY
          dvc metrics diff main >> $GITHUB_STEP_SUMMARY
          echo '```' >> $GITHUB_STEP_SUMMARY

Key Takeaways

DVC Feature CI/CD Usage
dvc pull Fetch data before training
dvc repro Run pipeline stages
dvc push Store outputs to remote
dvc metrics diff Compare experiment metrics
.dvc files Trigger data change pipelines

المحتوى العربي

ما هو DVC؟

DVC (التحكم في إصدار البيانات) هو أداة مفتوحة المصدر لإصدار بيانات ونماذج ML. يعمل جنباً إلى جنب مع Git لإصدار الملفات الكبيرة ومجموعات البيانات ونماذج ML التي لا يستطيع Git التعامل معها بكفاءة.

مفاهيم DVC الرئيسية:

  • إصدار البيانات: تتبع الملفات الكبيرة بدون تخزينها في Git
  • التخزين البعيد: تخزين البيانات في S3 أو GCS أو Azure أو التخزين المحلي
  • خطوط الأنابيب: تعريف سير عمل ML قابل للتكرار
  • المقاييس والرسوم البيانية: تتبع نتائج التجارب

إعداد DVC لـ CI/CD

# تثبيت DVC مع دعم التخزين السحابي
pip install 'dvc[s3]'  # لـ AWS S3
pip install 'dvc[gs]'  # لـ Google Cloud Storage
pip install 'dvc[azure]'  # لـ Azure Blob Storage

# تهيئة DVC في مستودعك
dvc init

# تكوين التخزين البعيد
dvc remote add -d myremote s3://my-bucket/dvc-storage

تعريف Pipeline في dvc.yaml

عرّف خط أنابيب ML في dvc.yaml:

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw/
    outs:
      - data/prepared/

  featurize:
    cmd: python src/featurize.py
    deps:
      - src/featurize.py
      - data/prepared/
    params:
      - featurize.max_features
      - featurize.ngrams
    outs:
      - data/features/

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/features/
    params:
      - train.n_estimators
      - train.learning_rate
    outs:
      - models/model.pkl
    metrics:
      - metrics/train_metrics.json:
          cache: false

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/features/
    metrics:
      - metrics/eval_metrics.json:
          cache: false
    plots:
      - metrics/plots/:
          cache: false

params.yaml للمعلمات الفائقة

# params.yaml
featurize:
  max_features: 200
  ngrams: 2

train:
  n_estimators: 100
  learning_rate: 0.1
  max_depth: 5

تشغيل خطوط أنابيب DVC في CI/CD

# .github/workflows/ml-pipeline.yml
name: ML Pipeline
on:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'params.yaml'
      - 'dvc.yaml'

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install dvc[s3] pandas scikit-learn

      - name: Configure DVC remote
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc remote modify myremote access_key_id $AWS_ACCESS_KEY_ID
          dvc remote modify myremote secret_access_key $AWS_SECRET_ACCESS_KEY

      - name: Pull data
        run: dvc pull

      - name: Run pipeline
        run: dvc repro

      - name: Push results
        run: dvc push

      - name: Commit DVC files
        run: |
          git config user.name "GitHub Actions"
          git config user.email "actions@github.com"
          git add dvc.lock metrics/
          git diff --staged --quiet || git commit -m "Update DVC lock and metrics"
          git push

خطوط الأنابيب المحفزة بالبيانات

شغّل خطوط الأنابيب عند تغيير البيانات:

# .github/workflows/data-update.yml
name: Data Update Pipeline
on:
  push:
    paths:
      - '*.dvc'  # التحفيز عند تغيير ملفات .dvc
      - 'data.dvc'

jobs:
  retrain:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install dvc[s3]

      - name: Pull new data
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: dvc pull

      - name: Check if data changed
        id: check
        run: |
          if dvc status | grep -q "changed"; then
            echo "data_changed=true" >> $GITHUB_OUTPUT
          else
            echo "data_changed=false" >> $GITHUB_OUTPUT
          fi

      - name: Run pipeline if data changed
        if: steps.check.outputs.data_changed == 'true'
        run: dvc repro

مقارنة مقاييس DVC

قارن المقاييس عبر التجارب:

# .github/workflows/compare-metrics.yml
name: Compare Model Metrics
on:
  pull_request:
    branches: [main]

jobs:
  compare:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # نحتاج التاريخ الكامل للمقارنة

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install DVC
        run: pip install dvc[s3]

      - name: Pull data
        run: dvc pull

      - name: Run pipeline
        run: dvc repro

      - name: Compare metrics
        run: |
          echo "## Model Metrics Comparison" >> $GITHUB_STEP_SUMMARY
          echo '```' >> $GITHUB_STEP_SUMMARY
          dvc metrics diff main >> $GITHUB_STEP_SUMMARY
          echo '```' >> $GITHUB_STEP_SUMMARY

النقاط الرئيسية

ميزة DVC استخدام CI/CD
dvc pull جلب البيانات قبل التدريب
dvc repro تشغيل مراحل pipeline
dvc push تخزين المخرجات للتخزين البعيد
dvc metrics diff مقارنة مقاييس التجارب
ملفات .dvc تحفيز pipelines تغيير البيانات

Quiz

Module 5: DVC + CML for ML Automation

Take Quiz