GitLab CI/CD والمنصات البديلة

GPU Runners وتدريب النماذج الكبيرة

5 دقيقة للقراءة

English Content

GPU Runner Options in GitLab

GitLab offers multiple approaches for GPU-accelerated ML training:

1. GitLab SaaS GPU Runners (Available for Premium/Ultimate)

  • Pre-configured NVIDIA GPU instances
  • Runner tag: saas-linux-medium-amd64-gpu-standard
  • No infrastructure management required

2. Self-Hosted GPU Runners

  • Full control over hardware and configuration
  • Cost-effective for high utilization
  • Supports any GPU hardware

Using GitLab SaaS GPU Runners

For GitLab Premium/Ultimate users, GPU runners are available:

# .gitlab-ci.yml
train-with-gpu:
  stage: train
  tags:
    - saas-linux-medium-amd64-gpu-standard
  image: nvidia/cuda:12.0-runtime-ubuntu22.04
  before_script:
    - nvidia-smi  # Verify GPU availability
    - pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
  script:
    - python scripts/train_gpu.py
  artifacts:
    paths:
      - models/

Setting Up Self-Hosted GPU Runners

For self-hosted GPU runners, follow this setup process:

# On your GPU machine (Ubuntu 22.04 with NVIDIA GPU)

# 1. Install NVIDIA drivers and CUDA
sudo apt update
sudo apt install -y nvidia-driver-535 nvidia-cuda-toolkit

# 2. Install Docker with NVIDIA runtime
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

# 3. Install GitLab Runner
curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | sudo bash
sudo apt install gitlab-runner

# 4. Register the runner with GPU tags
sudo gitlab-runner register \
  --url "https://gitlab.com/" \
  --registration-token "YOUR_TOKEN" \
  --executor "docker" \
  --docker-image "nvidia/cuda:12.0-runtime-ubuntu22.04" \
  --docker-gpus all \
  --tag-list "gpu,cuda,ml-training" \
  --description "GPU Runner for ML Training"

Configure the runner for GPU access:

# /etc/gitlab-runner/config.toml
[[runners]]
  name = "GPU Runner for ML Training"
  url = "https://gitlab.com/"
  token = "YOUR_RUNNER_TOKEN"
  executor = "docker"
  [runners.docker]
    image = "nvidia/cuda:12.0-runtime-ubuntu22.04"
    gpus = "all"
    privileged = true
    volumes = ["/cache", "/var/run/docker.sock:/var/run/docker.sock"]
    shm_size = 8589934592  # 8GB shared memory for PyTorch DataLoader

Optimizing Large Model Training

For training large models, implement resource-aware pipelines:

# .gitlab-ci.yml
variables:
  # Model configuration
  MODEL_SIZE: "large"  # small, medium, large
  BATCH_SIZE: "32"
  GRADIENT_ACCUMULATION: "4"

stages:
  - prepare
  - train
  - evaluate

.gpu-template:
  tags:
    - gpu
    - cuda
  image: nvidia/cuda:12.0-runtime-ubuntu22.04
  before_script:
    - nvidia-smi
    - pip install -r requirements-gpu.txt

prepare-data:
  stage: prepare
  script:
    - python scripts/prepare_data.py
  artifacts:
    paths:
      - data/processed/
    expire_in: 1 day

train-distributed:
  extends: .gpu-template
  stage: train
  script:
    - |
      # Configure for available GPU memory
      export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

      # Enable mixed precision for memory efficiency
      python scripts/train.py \
        --model-size $MODEL_SIZE \
        --batch-size $BATCH_SIZE \
        --gradient-accumulation $GRADIENT_ACCUMULATION \
        --mixed-precision \
        --checkpoint-dir models/checkpoints/
  artifacts:
    paths:
      - models/
      - logs/
  timeout: 6h
  needs:
    - prepare-data

# Parallel hyperparameter search
.hyperparam-template:
  extends: .gpu-template
  stage: train
  script:
    - python scripts/train.py --lr $LEARNING_RATE --batch-size $BATCH_SIZE
  artifacts:
    paths:
      - models/${LEARNING_RATE}_${BATCH_SIZE}/
    reports:
      metrics: metrics.txt
  parallel:
    matrix:
      - LEARNING_RATE: ["0.001", "0.0001", "0.00001"]
        BATCH_SIZE: ["16", "32"]

Memory Management for Large Models

Handle out-of-memory scenarios gracefully:

train-with-memory-fallback:
  extends: .gpu-template
  stage: train
  script:
    - |
      # Try full batch first, fall back to smaller batches
      python scripts/train.py --batch-size 64 || \
      python scripts/train.py --batch-size 32 --gradient-accumulation 2 || \
      python scripts/train.py --batch-size 16 --gradient-accumulation 4
  variables:
    CUDA_VISIBLE_DEVICES: "0"
    PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:128"

Multi-GPU Training Configuration

For distributed training across multiple GPUs:

train-multi-gpu:
  tags:
    - multi-gpu
  image: nvidia/cuda:12.0-runtime-ubuntu22.04
  script:
    - |
      # Detect available GPUs
      NUM_GPUS=$(nvidia-smi -L | wc -l)
      echo "Training on $NUM_GPUS GPUs"

      # Launch distributed training
      torchrun \
        --nproc_per_node=$NUM_GPUS \
        --master_port=29500 \
        scripts/train_distributed.py \
        --batch-size-per-gpu 32 \
        --epochs 100
  artifacts:
    paths:
      - models/

Cost Optimization Strategies

Implement cost-aware training pipelines:

variables:
  USE_SPOT_INSTANCE: "true"

train-cost-optimized:
  extends: .gpu-template
  stage: train
  script:
    - |
      # Save checkpoints frequently for spot instance resilience
      python scripts/train.py \
        --checkpoint-frequency 100 \
        --resume-from-checkpoint \
        --checkpoint-dir /cache/checkpoints/
  cache:
    key: "training-checkpoints-${CI_COMMIT_REF_SLUG}"
    paths:
      - /cache/checkpoints/
    policy: pull-push
  retry:
    max: 3
    when:
      - runner_system_failure
      - stuck_or_timeout_failure
  interruptible: true  # Allow cancellation for spot instances

Key Takeaways

Aspect Recommendation
SaaS Runners Use saas-linux-medium-amd64-gpu-standard tag
Self-Hosted Configure Docker with --gpus all flag
Memory Enable mixed precision and gradient accumulation
Large Models Use checkpointing and distributed training
Cost Implement spot instance resilience with retries

المحتوى العربي

خيارات GPU Runner في GitLab

يقدم GitLab مناهج متعددة لتدريب ML المسرّع بـ GPU:

1. GitLab SaaS GPU Runners (متاح لـ Premium/Ultimate)

  • مثيلات NVIDIA GPU مكونة مسبقاً
  • علامة Runner: saas-linux-medium-amd64-gpu-standard
  • لا حاجة لإدارة البنية التحتية

2. GPU Runners ذاتية الاستضافة

  • تحكم كامل في الأجهزة والتكوين
  • فعالة من حيث التكلفة للاستخدام العالي
  • تدعم أي عتاد GPU

استخدام GitLab SaaS GPU Runners

لمستخدمي GitLab Premium/Ultimate، GPU runners متاحة:

# .gitlab-ci.yml
train-with-gpu:
  stage: train
  tags:
    - saas-linux-medium-amd64-gpu-standard
  image: nvidia/cuda:12.0-runtime-ubuntu22.04
  before_script:
    - nvidia-smi  # التحقق من توفر GPU
    - pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
  script:
    - python scripts/train_gpu.py
  artifacts:
    paths:
      - models/

إعداد GPU Runners ذاتية الاستضافة

لـ GPU runners ذاتية الاستضافة، اتبع عملية الإعداد هذه:

# على جهاز GPU الخاص بك (Ubuntu 22.04 مع NVIDIA GPU)

# 1. تثبيت تعريفات NVIDIA وCUDA
sudo apt update
sudo apt install -y nvidia-driver-535 nvidia-cuda-toolkit

# 2. تثبيت Docker مع NVIDIA runtime
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# تثبيت NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

# 3. تثبيت GitLab Runner
curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | sudo bash
sudo apt install gitlab-runner

# 4. تسجيل runner مع علامات GPU
sudo gitlab-runner register \
  --url "https://gitlab.com/" \
  --registration-token "YOUR_TOKEN" \
  --executor "docker" \
  --docker-image "nvidia/cuda:12.0-runtime-ubuntu22.04" \
  --docker-gpus all \
  --tag-list "gpu,cuda,ml-training" \
  --description "GPU Runner for ML Training"

تكوين runner للوصول إلى GPU:

# /etc/gitlab-runner/config.toml
[[runners]]
  name = "GPU Runner for ML Training"
  url = "https://gitlab.com/"
  token = "YOUR_RUNNER_TOKEN"
  executor = "docker"
  [runners.docker]
    image = "nvidia/cuda:12.0-runtime-ubuntu22.04"
    gpus = "all"
    privileged = true
    volumes = ["/cache", "/var/run/docker.sock:/var/run/docker.sock"]
    shm_size = 8589934592  # 8GB ذاكرة مشتركة لـ PyTorch DataLoader

تحسين تدريب النماذج الكبيرة

لتدريب النماذج الكبيرة، نفّذ خطوط أنابيب واعية بالموارد:

# .gitlab-ci.yml
variables:
  # تكوين النموذج
  MODEL_SIZE: "large"  # small, medium, large
  BATCH_SIZE: "32"
  GRADIENT_ACCUMULATION: "4"

stages:
  - prepare
  - train
  - evaluate

.gpu-template:
  tags:
    - gpu
    - cuda
  image: nvidia/cuda:12.0-runtime-ubuntu22.04
  before_script:
    - nvidia-smi
    - pip install -r requirements-gpu.txt

prepare-data:
  stage: prepare
  script:
    - python scripts/prepare_data.py
  artifacts:
    paths:
      - data/processed/
    expire_in: 1 day

train-distributed:
  extends: .gpu-template
  stage: train
  script:
    - |
      # التكوين للذاكرة المتاحة لـ GPU
      export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

      # تمكين الدقة المختلطة لكفاءة الذاكرة
      python scripts/train.py \
        --model-size $MODEL_SIZE \
        --batch-size $BATCH_SIZE \
        --gradient-accumulation $GRADIENT_ACCUMULATION \
        --mixed-precision \
        --checkpoint-dir models/checkpoints/
  artifacts:
    paths:
      - models/
      - logs/
  timeout: 6h
  needs:
    - prepare-data

# بحث المعلمات الفائقة المتوازي
.hyperparam-template:
  extends: .gpu-template
  stage: train
  script:
    - python scripts/train.py --lr $LEARNING_RATE --batch-size $BATCH_SIZE
  artifacts:
    paths:
      - models/${LEARNING_RATE}_${BATCH_SIZE}/
    reports:
      metrics: metrics.txt
  parallel:
    matrix:
      - LEARNING_RATE: ["0.001", "0.0001", "0.00001"]
        BATCH_SIZE: ["16", "32"]

إدارة الذاكرة للنماذج الكبيرة

التعامل مع سيناريوهات نفاد الذاكرة برشاقة:

train-with-memory-fallback:
  extends: .gpu-template
  stage: train
  script:
    - |
      # جرب الدفعة الكاملة أولاً، ارجع لدفعات أصغر
      python scripts/train.py --batch-size 64 || \
      python scripts/train.py --batch-size 32 --gradient-accumulation 2 || \
      python scripts/train.py --batch-size 16 --gradient-accumulation 4
  variables:
    CUDA_VISIBLE_DEVICES: "0"
    PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:128"

تكوين التدريب متعدد GPU

للتدريب الموزع عبر GPUs متعددة:

train-multi-gpu:
  tags:
    - multi-gpu
  image: nvidia/cuda:12.0-runtime-ubuntu22.04
  script:
    - |
      # اكتشاف GPUs المتاحة
      NUM_GPUS=$(nvidia-smi -L | wc -l)
      echo "Training on $NUM_GPUS GPUs"

      # إطلاق التدريب الموزع
      torchrun \
        --nproc_per_node=$NUM_GPUS \
        --master_port=29500 \
        scripts/train_distributed.py \
        --batch-size-per-gpu 32 \
        --epochs 100
  artifacts:
    paths:
      - models/

استراتيجيات تحسين التكلفة

نفّذ خطوط أنابيب تدريب واعية بالتكلفة:

variables:
  USE_SPOT_INSTANCE: "true"

train-cost-optimized:
  extends: .gpu-template
  stage: train
  script:
    - |
      # احفظ نقاط التحقق بشكل متكرر لمرونة spot instance
      python scripts/train.py \
        --checkpoint-frequency 100 \
        --resume-from-checkpoint \
        --checkpoint-dir /cache/checkpoints/
  cache:
    key: "training-checkpoints-${CI_COMMIT_REF_SLUG}"
    paths:
      - /cache/checkpoints/
    policy: pull-push
  retry:
    max: 3
    when:
      - runner_system_failure
      - stuck_or_timeout_failure
  interruptible: true  # السماح بالإلغاء لـ spot instances

النقاط الرئيسية

الجانب التوصية
SaaS Runners استخدم علامة saas-linux-medium-amd64-gpu-standard
ذاتية الاستضافة كوّن Docker مع علامة --gpus all
الذاكرة مكّن الدقة المختلطة وتراكم التدرج
النماذج الكبيرة استخدم نقاط التحقق والتدريب الموزع
التكلفة نفّذ مرونة spot instance مع إعادة المحاولات

اختبار

الوحدة 3: GitLab CI/CD والمنصات البديلة

خذ الاختبار