DVC + CML for ML Automation
Self-Hosted Runners for Training
5 min read
English Content
Why Self-Hosted Runners?
GitHub Actions and GitLab CI free runners have limitations for ML:
- No GPU support
- Limited memory (7GB)
- Job time limits (6 hours GitHub, 1 hour GitLab)
- Shared resources
CML solves this by provisioning on-demand cloud runners with GPUs.
CML Runner Basics
# Launch a runner on AWS
cml runner launch \
--cloud aws \
--cloud-region us-west-2 \
--cloud-type g4dn.xlarge \
--labels cml-gpu
# Launch a runner on GCP
cml runner launch \
--cloud gcp \
--cloud-region us-central1-a \
--cloud-type n1-standard-4 \
--cloud-gpu nvidia-tesla-t4 \
--labels cml-gpu
# Launch a runner on Azure
cml runner launch \
--cloud azure \
--cloud-region eastus \
--cloud-type Standard_NC6 \
--labels cml-gpu
GitHub Actions with CML Runner
# .github/workflows/train-gpu.yml
name: Train on GPU
on:
workflow_dispatch:
push:
branches: [main]
paths:
- 'src/**'
- 'params.yaml'
jobs:
launch-runner:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: iterative/setup-cml@v2
- name: Launch GPU runner
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cml runner launch \
--cloud aws \
--cloud-region us-west-2 \
--cloud-type g4dn.xlarge \
--cloud-spot \
--cloud-spot-price 0.50 \
--idle-timeout 600 \
--labels cml-gpu \
--single
train:
needs: launch-runner
runs-on: [self-hosted, cml-gpu]
timeout-minutes: 360
steps:
- uses: actions/checkout@v4
- name: Setup environment
run: |
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install dvc pandas scikit-learn
- name: Verify GPU
run: nvidia-smi
- name: Pull data
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: dvc pull
- name: Train model
run: python train.py --device cuda
- name: Push results
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: dvc push
Spot Instance Configuration
Use spot instances to reduce costs by up to 90%:
launch-runner:
runs-on: ubuntu-latest
steps:
- uses: iterative/setup-cml@v2
- name: Launch spot GPU runner
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cml runner launch \
--cloud aws \
--cloud-region us-west-2 \
--cloud-type g4dn.xlarge \
--cloud-spot \
--cloud-spot-price 0.30 \
--idle-timeout 300 \
--labels cml-gpu-spot \
--reuse-idle
Runner Lifecycle Management
# .github/workflows/managed-training.yml
name: Managed GPU Training
on:
workflow_dispatch:
inputs:
instance_type:
description: 'AWS instance type'
default: 'g4dn.xlarge'
type: choice
options:
- g4dn.xlarge
- g4dn.2xlarge
- p3.2xlarge
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: iterative/setup-cml@v2
- name: Launch and train
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
# Launch runner with auto-termination
cml runner launch \
--cloud aws \
--cloud-region us-west-2 \
--cloud-type ${{ inputs.instance_type }} \
--cloud-spot \
--idle-timeout 180 \
--single &
# Wait for runner
sleep 60
# The runner will execute subsequent jobs
# and terminate after idle timeout
Cost Optimization Strategies
| Strategy | Savings | Trade-off |
|---|---|---|
| Spot instances | Up to 90% | May be interrupted |
| Idle timeout | Variable | Restart delay |
| Right-sizing | 20-50% | Performance |
| Regional selection | 10-30% | Latency |
GitLab CI Configuration
# .gitlab-ci.yml
stages:
- launch
- train
- cleanup
launch-runner:
stage: launch
image: iterative/cml:latest
script:
- |
cml runner launch \
--cloud aws \
--cloud-region us-west-2 \
--cloud-type g4dn.xlarge \
--cloud-spot \
--idle-timeout 600 \
--labels gitlab-gpu \
--single
train-model:
stage: train
tags:
- gitlab-gpu
script:
- nvidia-smi
- pip install -r requirements.txt
- dvc pull
- python train.py --device cuda
- dvc push
timeout: 6h
Multi-GPU Training Setup
launch-multi-gpu:
runs-on: ubuntu-latest
steps:
- uses: iterative/setup-cml@v2
- name: Launch multi-GPU runner
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cml runner launch \
--cloud aws \
--cloud-region us-west-2 \
--cloud-type p3.8xlarge \
--idle-timeout 600 \
--labels cml-multi-gpu
train-distributed:
needs: launch-multi-gpu
runs-on: [self-hosted, cml-multi-gpu]
steps:
- uses: actions/checkout@v4
- name: Distributed training
run: |
NUM_GPUS=$(nvidia-smi -L | wc -l)
torchrun --nproc_per_node=$NUM_GPUS train.py
Key Takeaways
| CML Option | Purpose |
|---|---|
--cloud-spot |
Use spot/preemptible instances |
--cloud-spot-price |
Max bid for spot instance |
--idle-timeout |
Auto-terminate after idle |
--single |
Terminate after one job |
--reuse-idle |
Reuse existing idle runners |
المحتوى العربي
لماذا Runners ذاتية الاستضافة؟
GitHub Actions وGitLab CI runners المجانية لها قيود لـ ML:
- لا دعم GPU
- ذاكرة محدودة (7GB)
- حدود وقت المهمة (6 ساعات GitHub، ساعة واحدة GitLab)
- موارد مشتركة
CML يحل هذا بتوفير runners سحابية عند الطلب مع GPUs.
أساسيات CML Runner
# إطلاق runner على AWS
cml runner launch \
--cloud aws \
--cloud-region us-west-2 \
--cloud-type g4dn.xlarge \
--labels cml-gpu
# إطلاق runner على GCP
cml runner launch \
--cloud gcp \
--cloud-region us-central1-a \
--cloud-type n1-standard-4 \
--cloud-gpu nvidia-tesla-t4 \
--labels cml-gpu
# إطلاق runner على Azure
cml runner launch \
--cloud azure \
--cloud-region eastus \
--cloud-type Standard_NC6 \
--labels cml-gpu
GitHub Actions مع CML Runner
# .github/workflows/train-gpu.yml
name: Train on GPU
on:
workflow_dispatch:
push:
branches: [main]
paths:
- 'src/**'
- 'params.yaml'
jobs:
launch-runner:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: iterative/setup-cml@v2
- name: Launch GPU runner
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cml runner launch \
--cloud aws \
--cloud-region us-west-2 \
--cloud-type g4dn.xlarge \
--cloud-spot \
--cloud-spot-price 0.50 \
--idle-timeout 600 \
--labels cml-gpu \
--single
train:
needs: launch-runner
runs-on: [self-hosted, cml-gpu]
timeout-minutes: 360
steps:
- uses: actions/checkout@v4
- name: Setup environment
run: |
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install dvc pandas scikit-learn
- name: Verify GPU
run: nvidia-smi
- name: Pull data
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: dvc pull
- name: Train model
run: python train.py --device cuda
- name: Push results
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: dvc push
تكوين Spot Instance
استخدم spot instances لتقليل التكاليف بنسبة تصل إلى 90%:
launch-runner:
runs-on: ubuntu-latest
steps:
- uses: iterative/setup-cml@v2
- name: Launch spot GPU runner
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cml runner launch \
--cloud aws \
--cloud-region us-west-2 \
--cloud-type g4dn.xlarge \
--cloud-spot \
--cloud-spot-price 0.30 \
--idle-timeout 300 \
--labels cml-gpu-spot \
--reuse-idle
إدارة دورة حياة Runner
# .github/workflows/managed-training.yml
name: Managed GPU Training
on:
workflow_dispatch:
inputs:
instance_type:
description: 'AWS instance type'
default: 'g4dn.xlarge'
type: choice
options:
- g4dn.xlarge
- g4dn.2xlarge
- p3.2xlarge
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: iterative/setup-cml@v2
- name: Launch and train
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
# إطلاق runner مع إنهاء تلقائي
cml runner launch \
--cloud aws \
--cloud-region us-west-2 \
--cloud-type ${{ inputs.instance_type }} \
--cloud-spot \
--idle-timeout 180 \
--single &
# انتظار runner
sleep 60
استراتيجيات تحسين التكلفة
| الاستراتيجية | الوفورات | المقايضة |
|---|---|---|
| Spot instances | حتى 90% | قد يُقاطع |
| Idle timeout | متغير | تأخير إعادة التشغيل |
| الحجم المناسب | 20-50% | الأداء |
| اختيار المنطقة | 10-30% | التأخير |
تكوين GitLab CI
# .gitlab-ci.yml
stages:
- launch
- train
- cleanup
launch-runner:
stage: launch
image: iterative/cml:latest
script:
- |
cml runner launch \
--cloud aws \
--cloud-region us-west-2 \
--cloud-type g4dn.xlarge \
--cloud-spot \
--idle-timeout 600 \
--labels gitlab-gpu \
--single
train-model:
stage: train
tags:
- gitlab-gpu
script:
- nvidia-smi
- pip install -r requirements.txt
- dvc pull
- python train.py --device cuda
- dvc push
timeout: 6h
إعداد التدريب متعدد GPU
launch-multi-gpu:
runs-on: ubuntu-latest
steps:
- uses: iterative/setup-cml@v2
- name: Launch multi-GPU runner
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cml runner launch \
--cloud aws \
--cloud-region us-west-2 \
--cloud-type p3.8xlarge \
--idle-timeout 600 \
--labels cml-multi-gpu
train-distributed:
needs: launch-multi-gpu
runs-on: [self-hosted, cml-multi-gpu]
steps:
- uses: actions/checkout@v4
- name: Distributed training
run: |
NUM_GPUS=$(nvidia-smi -L | wc -l)
torchrun --nproc_per_node=$NUM_GPUS train.py
النقاط الرئيسية
| خيار CML | الغرض |
|---|---|
--cloud-spot |
استخدام spot/preemptible instances |
--cloud-spot-price |
الحد الأقصى للمزايدة على spot instance |
--idle-timeout |
الإنهاء التلقائي بعد الخمول |
--single |
الإنهاء بعد مهمة واحدة |
--reuse-idle |
إعادة استخدام runners الخاملة الموجودة |