DVC + CML for ML Automation

Self-Hosted Runners for Training

5 min read

English Content

Why Self-Hosted Runners?

GitHub Actions and GitLab CI free runners have limitations for ML:

  • No GPU support
  • Limited memory (7GB)
  • Job time limits (6 hours GitHub, 1 hour GitLab)
  • Shared resources

CML solves this by provisioning on-demand cloud runners with GPUs.

CML Runner Basics

# Launch a runner on AWS
cml runner launch \
  --cloud aws \
  --cloud-region us-west-2 \
  --cloud-type g4dn.xlarge \
  --labels cml-gpu

# Launch a runner on GCP
cml runner launch \
  --cloud gcp \
  --cloud-region us-central1-a \
  --cloud-type n1-standard-4 \
  --cloud-gpu nvidia-tesla-t4 \
  --labels cml-gpu

# Launch a runner on Azure
cml runner launch \
  --cloud azure \
  --cloud-region eastus \
  --cloud-type Standard_NC6 \
  --labels cml-gpu

GitHub Actions with CML Runner

# .github/workflows/train-gpu.yml
name: Train on GPU
on:
  workflow_dispatch:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'params.yaml'

jobs:
  launch-runner:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: iterative/setup-cml@v2

      - name: Launch GPU runner
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          cml runner launch \
            --cloud aws \
            --cloud-region us-west-2 \
            --cloud-type g4dn.xlarge \
            --cloud-spot \
            --cloud-spot-price 0.50 \
            --idle-timeout 600 \
            --labels cml-gpu \
            --single

  train:
    needs: launch-runner
    runs-on: [self-hosted, cml-gpu]
    timeout-minutes: 360
    steps:
      - uses: actions/checkout@v4

      - name: Setup environment
        run: |
          pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
          pip install dvc pandas scikit-learn

      - name: Verify GPU
        run: nvidia-smi

      - name: Pull data
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: dvc pull

      - name: Train model
        run: python train.py --device cuda

      - name: Push results
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: dvc push

Spot Instance Configuration

Use spot instances to reduce costs by up to 90%:

launch-runner:
  runs-on: ubuntu-latest
  steps:
    - uses: iterative/setup-cml@v2

    - name: Launch spot GPU runner
      env:
        REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |
        cml runner launch \
          --cloud aws \
          --cloud-region us-west-2 \
          --cloud-type g4dn.xlarge \
          --cloud-spot \
          --cloud-spot-price 0.30 \
          --idle-timeout 300 \
          --labels cml-gpu-spot \
          --reuse-idle

Runner Lifecycle Management

# .github/workflows/managed-training.yml
name: Managed GPU Training
on:
  workflow_dispatch:
    inputs:
      instance_type:
        description: 'AWS instance type'
        default: 'g4dn.xlarge'
        type: choice
        options:
          - g4dn.xlarge
          - g4dn.2xlarge
          - p3.2xlarge

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: iterative/setup-cml@v2

      - name: Launch and train
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          # Launch runner with auto-termination
          cml runner launch \
            --cloud aws \
            --cloud-region us-west-2 \
            --cloud-type ${{ inputs.instance_type }} \
            --cloud-spot \
            --idle-timeout 180 \
            --single &

          # Wait for runner
          sleep 60

          # The runner will execute subsequent jobs
          # and terminate after idle timeout

Cost Optimization Strategies

Strategy Savings Trade-off
Spot instances Up to 90% May be interrupted
Idle timeout Variable Restart delay
Right-sizing 20-50% Performance
Regional selection 10-30% Latency

GitLab CI Configuration

# .gitlab-ci.yml
stages:
  - launch
  - train
  - cleanup

launch-runner:
  stage: launch
  image: iterative/cml:latest
  script:
    - |
      cml runner launch \
        --cloud aws \
        --cloud-region us-west-2 \
        --cloud-type g4dn.xlarge \
        --cloud-spot \
        --idle-timeout 600 \
        --labels gitlab-gpu \
        --single

train-model:
  stage: train
  tags:
    - gitlab-gpu
  script:
    - nvidia-smi
    - pip install -r requirements.txt
    - dvc pull
    - python train.py --device cuda
    - dvc push
  timeout: 6h

Multi-GPU Training Setup

launch-multi-gpu:
  runs-on: ubuntu-latest
  steps:
    - uses: iterative/setup-cml@v2

    - name: Launch multi-GPU runner
      env:
        REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |
        cml runner launch \
          --cloud aws \
          --cloud-region us-west-2 \
          --cloud-type p3.8xlarge \
          --idle-timeout 600 \
          --labels cml-multi-gpu

train-distributed:
  needs: launch-multi-gpu
  runs-on: [self-hosted, cml-multi-gpu]
  steps:
    - uses: actions/checkout@v4

    - name: Distributed training
      run: |
        NUM_GPUS=$(nvidia-smi -L | wc -l)
        torchrun --nproc_per_node=$NUM_GPUS train.py

Key Takeaways

CML Option Purpose
--cloud-spot Use spot/preemptible instances
--cloud-spot-price Max bid for spot instance
--idle-timeout Auto-terminate after idle
--single Terminate after one job
--reuse-idle Reuse existing idle runners

المحتوى العربي

لماذا Runners ذاتية الاستضافة؟

GitHub Actions وGitLab CI runners المجانية لها قيود لـ ML:

  • لا دعم GPU
  • ذاكرة محدودة (7GB)
  • حدود وقت المهمة (6 ساعات GitHub، ساعة واحدة GitLab)
  • موارد مشتركة

CML يحل هذا بتوفير runners سحابية عند الطلب مع GPUs.

أساسيات CML Runner

# إطلاق runner على AWS
cml runner launch \
  --cloud aws \
  --cloud-region us-west-2 \
  --cloud-type g4dn.xlarge \
  --labels cml-gpu

# إطلاق runner على GCP
cml runner launch \
  --cloud gcp \
  --cloud-region us-central1-a \
  --cloud-type n1-standard-4 \
  --cloud-gpu nvidia-tesla-t4 \
  --labels cml-gpu

# إطلاق runner على Azure
cml runner launch \
  --cloud azure \
  --cloud-region eastus \
  --cloud-type Standard_NC6 \
  --labels cml-gpu

GitHub Actions مع CML Runner

# .github/workflows/train-gpu.yml
name: Train on GPU
on:
  workflow_dispatch:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'params.yaml'

jobs:
  launch-runner:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: iterative/setup-cml@v2

      - name: Launch GPU runner
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          cml runner launch \
            --cloud aws \
            --cloud-region us-west-2 \
            --cloud-type g4dn.xlarge \
            --cloud-spot \
            --cloud-spot-price 0.50 \
            --idle-timeout 600 \
            --labels cml-gpu \
            --single

  train:
    needs: launch-runner
    runs-on: [self-hosted, cml-gpu]
    timeout-minutes: 360
    steps:
      - uses: actions/checkout@v4

      - name: Setup environment
        run: |
          pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
          pip install dvc pandas scikit-learn

      - name: Verify GPU
        run: nvidia-smi

      - name: Pull data
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: dvc pull

      - name: Train model
        run: python train.py --device cuda

      - name: Push results
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: dvc push

تكوين Spot Instance

استخدم spot instances لتقليل التكاليف بنسبة تصل إلى 90%:

launch-runner:
  runs-on: ubuntu-latest
  steps:
    - uses: iterative/setup-cml@v2

    - name: Launch spot GPU runner
      env:
        REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |
        cml runner launch \
          --cloud aws \
          --cloud-region us-west-2 \
          --cloud-type g4dn.xlarge \
          --cloud-spot \
          --cloud-spot-price 0.30 \
          --idle-timeout 300 \
          --labels cml-gpu-spot \
          --reuse-idle

إدارة دورة حياة Runner

# .github/workflows/managed-training.yml
name: Managed GPU Training
on:
  workflow_dispatch:
    inputs:
      instance_type:
        description: 'AWS instance type'
        default: 'g4dn.xlarge'
        type: choice
        options:
          - g4dn.xlarge
          - g4dn.2xlarge
          - p3.2xlarge

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: iterative/setup-cml@v2

      - name: Launch and train
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          # إطلاق runner مع إنهاء تلقائي
          cml runner launch \
            --cloud aws \
            --cloud-region us-west-2 \
            --cloud-type ${{ inputs.instance_type }} \
            --cloud-spot \
            --idle-timeout 180 \
            --single &

          # انتظار runner
          sleep 60

استراتيجيات تحسين التكلفة

الاستراتيجية الوفورات المقايضة
Spot instances حتى 90% قد يُقاطع
Idle timeout متغير تأخير إعادة التشغيل
الحجم المناسب 20-50% الأداء
اختيار المنطقة 10-30% التأخير

تكوين GitLab CI

# .gitlab-ci.yml
stages:
  - launch
  - train
  - cleanup

launch-runner:
  stage: launch
  image: iterative/cml:latest
  script:
    - |
      cml runner launch \
        --cloud aws \
        --cloud-region us-west-2 \
        --cloud-type g4dn.xlarge \
        --cloud-spot \
        --idle-timeout 600 \
        --labels gitlab-gpu \
        --single

train-model:
  stage: train
  tags:
    - gitlab-gpu
  script:
    - nvidia-smi
    - pip install -r requirements.txt
    - dvc pull
    - python train.py --device cuda
    - dvc push
  timeout: 6h

إعداد التدريب متعدد GPU

launch-multi-gpu:
  runs-on: ubuntu-latest
  steps:
    - uses: iterative/setup-cml@v2

    - name: Launch multi-GPU runner
      env:
        REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |
        cml runner launch \
          --cloud aws \
          --cloud-region us-west-2 \
          --cloud-type p3.8xlarge \
          --idle-timeout 600 \
          --labels cml-multi-gpu

train-distributed:
  needs: launch-multi-gpu
  runs-on: [self-hosted, cml-multi-gpu]
  steps:
    - uses: actions/checkout@v4

    - name: Distributed training
      run: |
        NUM_GPUS=$(nvidia-smi -L | wc -l)
        torchrun --nproc_per_node=$NUM_GPUS train.py

النقاط الرئيسية

خيار CML الغرض
--cloud-spot استخدام spot/preemptible instances
--cloud-spot-price الحد الأقصى للمزايدة على spot instance
--idle-timeout الإنهاء التلقائي بعد الخمول
--single الإنهاء بعد مهمة واحدة
--reuse-idle إعادة استخدام runners الخاملة الموجودة

Quiz

Module 5: DVC + CML for ML Automation

Take Quiz