GitLab CI/CD & Alternative Platforms
GPU Runners & Large Model Training
English Content
GPU Runner Options in GitLab
GitLab offers multiple approaches for GPU-accelerated ML training:
1. GitLab SaaS GPU Runners (Available for Premium/Ultimate)
- Pre-configured NVIDIA GPU instances
- Runner tag:
saas-linux-medium-amd64-gpu-standard - No infrastructure management required
2. Self-Hosted GPU Runners
- Full control over hardware and configuration
- Cost-effective for high utilization
- Supports any GPU hardware
Using GitLab SaaS GPU Runners
For GitLab Premium/Ultimate users, GPU runners are available:
# .gitlab-ci.yml
train-with-gpu:
stage: train
tags:
- saas-linux-medium-amd64-gpu-standard
image: nvidia/cuda:12.0-runtime-ubuntu22.04
before_script:
- nvidia-smi # Verify GPU availability
- pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
script:
- python scripts/train_gpu.py
artifacts:
paths:
- models/
Setting Up Self-Hosted GPU Runners
For self-hosted GPU runners, follow this setup process:
# On your GPU machine (Ubuntu 22.04 with NVIDIA GPU)
# 1. Install NVIDIA drivers and CUDA
sudo apt update
sudo apt install -y nvidia-driver-535 nvidia-cuda-toolkit
# 2. Install Docker with NVIDIA runtime
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker
# 3. Install GitLab Runner
curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | sudo bash
sudo apt install gitlab-runner
# 4. Register the runner with GPU tags
sudo gitlab-runner register \
--url "https://gitlab.com/" \
--registration-token "YOUR_TOKEN" \
--executor "docker" \
--docker-image "nvidia/cuda:12.0-runtime-ubuntu22.04" \
--docker-gpus all \
--tag-list "gpu,cuda,ml-training" \
--description "GPU Runner for ML Training"
Configure the runner for GPU access:
# /etc/gitlab-runner/config.toml
[[runners]]
name = "GPU Runner for ML Training"
url = "https://gitlab.com/"
token = "YOUR_RUNNER_TOKEN"
executor = "docker"
[runners.docker]
image = "nvidia/cuda:12.0-runtime-ubuntu22.04"
gpus = "all"
privileged = true
volumes = ["/cache", "/var/run/docker.sock:/var/run/docker.sock"]
shm_size = 8589934592 # 8GB shared memory for PyTorch DataLoader
Optimizing Large Model Training
For training large models, implement resource-aware pipelines:
# .gitlab-ci.yml
variables:
# Model configuration
MODEL_SIZE: "large" # small, medium, large
BATCH_SIZE: "32"
GRADIENT_ACCUMULATION: "4"
stages:
- prepare
- train
- evaluate
.gpu-template:
tags:
- gpu
- cuda
image: nvidia/cuda:12.0-runtime-ubuntu22.04
before_script:
- nvidia-smi
- pip install -r requirements-gpu.txt
prepare-data:
stage: prepare
script:
- python scripts/prepare_data.py
artifacts:
paths:
- data/processed/
expire_in: 1 day
train-distributed:
extends: .gpu-template
stage: train
script:
- |
# Configure for available GPU memory
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
# Enable mixed precision for memory efficiency
python scripts/train.py \
--model-size $MODEL_SIZE \
--batch-size $BATCH_SIZE \
--gradient-accumulation $GRADIENT_ACCUMULATION \
--mixed-precision \
--checkpoint-dir models/checkpoints/
artifacts:
paths:
- models/
- logs/
timeout: 6h
needs:
- prepare-data
# Parallel hyperparameter search
.hyperparam-template:
extends: .gpu-template
stage: train
script:
- python scripts/train.py --lr $LEARNING_RATE --batch-size $BATCH_SIZE
artifacts:
paths:
- models/${LEARNING_RATE}_${BATCH_SIZE}/
reports:
metrics: metrics.txt
parallel:
matrix:
- LEARNING_RATE: ["0.001", "0.0001", "0.00001"]
BATCH_SIZE: ["16", "32"]
Memory Management for Large Models
Handle out-of-memory scenarios gracefully:
train-with-memory-fallback:
extends: .gpu-template
stage: train
script:
- |
# Try full batch first, fall back to smaller batches
python scripts/train.py --batch-size 64 || \
python scripts/train.py --batch-size 32 --gradient-accumulation 2 || \
python scripts/train.py --batch-size 16 --gradient-accumulation 4
variables:
CUDA_VISIBLE_DEVICES: "0"
PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:128"
Multi-GPU Training Configuration
For distributed training across multiple GPUs:
train-multi-gpu:
tags:
- multi-gpu
image: nvidia/cuda:12.0-runtime-ubuntu22.04
script:
- |
# Detect available GPUs
NUM_GPUS=$(nvidia-smi -L | wc -l)
echo "Training on $NUM_GPUS GPUs"
# Launch distributed training
torchrun \
--nproc_per_node=$NUM_GPUS \
--master_port=29500 \
scripts/train_distributed.py \
--batch-size-per-gpu 32 \
--epochs 100
artifacts:
paths:
- models/
Cost Optimization Strategies
Implement cost-aware training pipelines:
variables:
USE_SPOT_INSTANCE: "true"
train-cost-optimized:
extends: .gpu-template
stage: train
script:
- |
# Save checkpoints frequently for spot instance resilience
python scripts/train.py \
--checkpoint-frequency 100 \
--resume-from-checkpoint \
--checkpoint-dir /cache/checkpoints/
cache:
key: "training-checkpoints-${CI_COMMIT_REF_SLUG}"
paths:
- /cache/checkpoints/
policy: pull-push
retry:
max: 3
when:
- runner_system_failure
- stuck_or_timeout_failure
interruptible: true # Allow cancellation for spot instances
Key Takeaways
| Aspect | Recommendation |
|---|---|
| SaaS Runners | Use saas-linux-medium-amd64-gpu-standard tag |
| Self-Hosted | Configure Docker with --gpus all flag |
| Memory | Enable mixed precision and gradient accumulation |
| Large Models | Use checkpointing and distributed training |
| Cost | Implement spot instance resilience with retries |
المحتوى العربي
خيارات GPU Runner في GitLab
يقدم GitLab مناهج متعددة لتدريب ML المسرّع بـ GPU:
1. GitLab SaaS GPU Runners (متاح لـ Premium/Ultimate)
- مثيلات NVIDIA GPU مكونة مسبقاً
- علامة Runner:
saas-linux-medium-amd64-gpu-standard - لا حاجة لإدارة البنية التحتية
2. GPU Runners ذاتية الاستضافة
- تحكم كامل في الأجهزة والتكوين
- فعالة من حيث التكلفة للاستخدام العالي
- تدعم أي عتاد GPU
استخدام GitLab SaaS GPU Runners
لمستخدمي GitLab Premium/Ultimate، GPU runners متاحة:
# .gitlab-ci.yml
train-with-gpu:
stage: train
tags:
- saas-linux-medium-amd64-gpu-standard
image: nvidia/cuda:12.0-runtime-ubuntu22.04
before_script:
- nvidia-smi # التحقق من توفر GPU
- pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
script:
- python scripts/train_gpu.py
artifacts:
paths:
- models/
إعداد GPU Runners ذاتية الاستضافة
لـ GPU runners ذاتية الاستضافة، اتبع عملية الإعداد هذه:
# على جهاز GPU الخاص بك (Ubuntu 22.04 مع NVIDIA GPU)
# 1. تثبيت تعريفات NVIDIA وCUDA
sudo apt update
sudo apt install -y nvidia-driver-535 nvidia-cuda-toolkit
# 2. تثبيت Docker مع NVIDIA runtime
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
# تثبيت NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker
# 3. تثبيت GitLab Runner
curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | sudo bash
sudo apt install gitlab-runner
# 4. تسجيل runner مع علامات GPU
sudo gitlab-runner register \
--url "https://gitlab.com/" \
--registration-token "YOUR_TOKEN" \
--executor "docker" \
--docker-image "nvidia/cuda:12.0-runtime-ubuntu22.04" \
--docker-gpus all \
--tag-list "gpu,cuda,ml-training" \
--description "GPU Runner for ML Training"
تكوين runner للوصول إلى GPU:
# /etc/gitlab-runner/config.toml
[[runners]]
name = "GPU Runner for ML Training"
url = "https://gitlab.com/"
token = "YOUR_RUNNER_TOKEN"
executor = "docker"
[runners.docker]
image = "nvidia/cuda:12.0-runtime-ubuntu22.04"
gpus = "all"
privileged = true
volumes = ["/cache", "/var/run/docker.sock:/var/run/docker.sock"]
shm_size = 8589934592 # 8GB ذاكرة مشتركة لـ PyTorch DataLoader
تحسين تدريب النماذج الكبيرة
لتدريب النماذج الكبيرة، نفّذ خطوط أنابيب واعية بالموارد:
# .gitlab-ci.yml
variables:
# تكوين النموذج
MODEL_SIZE: "large" # small, medium, large
BATCH_SIZE: "32"
GRADIENT_ACCUMULATION: "4"
stages:
- prepare
- train
- evaluate
.gpu-template:
tags:
- gpu
- cuda
image: nvidia/cuda:12.0-runtime-ubuntu22.04
before_script:
- nvidia-smi
- pip install -r requirements-gpu.txt
prepare-data:
stage: prepare
script:
- python scripts/prepare_data.py
artifacts:
paths:
- data/processed/
expire_in: 1 day
train-distributed:
extends: .gpu-template
stage: train
script:
- |
# التكوين للذاكرة المتاحة لـ GPU
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
# تمكين الدقة المختلطة لكفاءة الذاكرة
python scripts/train.py \
--model-size $MODEL_SIZE \
--batch-size $BATCH_SIZE \
--gradient-accumulation $GRADIENT_ACCUMULATION \
--mixed-precision \
--checkpoint-dir models/checkpoints/
artifacts:
paths:
- models/
- logs/
timeout: 6h
needs:
- prepare-data
# بحث المعلمات الفائقة المتوازي
.hyperparam-template:
extends: .gpu-template
stage: train
script:
- python scripts/train.py --lr $LEARNING_RATE --batch-size $BATCH_SIZE
artifacts:
paths:
- models/${LEARNING_RATE}_${BATCH_SIZE}/
reports:
metrics: metrics.txt
parallel:
matrix:
- LEARNING_RATE: ["0.001", "0.0001", "0.00001"]
BATCH_SIZE: ["16", "32"]
إدارة الذاكرة للنماذج الكبيرة
التعامل مع سيناريوهات نفاد الذاكرة برشاقة:
train-with-memory-fallback:
extends: .gpu-template
stage: train
script:
- |
# جرب الدفعة الكاملة أولاً، ارجع لدفعات أصغر
python scripts/train.py --batch-size 64 || \
python scripts/train.py --batch-size 32 --gradient-accumulation 2 || \
python scripts/train.py --batch-size 16 --gradient-accumulation 4
variables:
CUDA_VISIBLE_DEVICES: "0"
PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:128"
تكوين التدريب متعدد GPU
للتدريب الموزع عبر GPUs متعددة:
train-multi-gpu:
tags:
- multi-gpu
image: nvidia/cuda:12.0-runtime-ubuntu22.04
script:
- |
# اكتشاف GPUs المتاحة
NUM_GPUS=$(nvidia-smi -L | wc -l)
echo "Training on $NUM_GPUS GPUs"
# إطلاق التدريب الموزع
torchrun \
--nproc_per_node=$NUM_GPUS \
--master_port=29500 \
scripts/train_distributed.py \
--batch-size-per-gpu 32 \
--epochs 100
artifacts:
paths:
- models/
استراتيجيات تحسين التكلفة
نفّذ خطوط أنابيب تدريب واعية بالتكلفة:
variables:
USE_SPOT_INSTANCE: "true"
train-cost-optimized:
extends: .gpu-template
stage: train
script:
- |
# احفظ نقاط التحقق بشكل متكرر لمرونة spot instance
python scripts/train.py \
--checkpoint-frequency 100 \
--resume-from-checkpoint \
--checkpoint-dir /cache/checkpoints/
cache:
key: "training-checkpoints-${CI_COMMIT_REF_SLUG}"
paths:
- /cache/checkpoints/
policy: pull-push
retry:
max: 3
when:
- runner_system_failure
- stuck_or_timeout_failure
interruptible: true # السماح بالإلغاء لـ spot instances
النقاط الرئيسية
| الجانب | التوصية |
|---|---|
| SaaS Runners | استخدم علامة saas-linux-medium-amd64-gpu-standard |
| ذاتية الاستضافة | كوّن Docker مع علامة --gpus all |
| الذاكرة | مكّن الدقة المختلطة وتراكم التدرج |
| النماذج الكبيرة | استخدم نقاط التحقق والتدريب الموزع |
| التكلفة | نفّذ مرونة spot instance مع إعادة المحاولات |