GPU Runners & Large Model Training

English Content

GPU Runner Options in GitLab

GitLab offers multiple approaches for GPU-accelerated ML training:

1. GitLab SaaS GPU Runners (Available for Premium/Ultimate)

Pre-configured NVIDIA GPU instances
Runner tag: saas-linux-medium-amd64-gpu-standard
No infrastructure management required

2. Self-Hosted GPU Runners

Full control over hardware and configuration
Cost-effective for high utilization
Supports any GPU hardware

Using GitLab SaaS GPU Runners

For GitLab Premium/Ultimate users, GPU runners are available:

# .gitlab-ci.yml
train-with-gpu:
  stage: train
  tags:
    - saas-linux-medium-amd64-gpu-standard
  image: nvidia/cuda:12.0-runtime-ubuntu22.04
  before_script:
    - nvidia-smi  # Verify GPU availability
    - pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
  script:
    - python scripts/train_gpu.py
  artifacts:
    paths:
      - models/

Setting Up Self-Hosted GPU Runners

For self-hosted GPU runners, follow this setup process:

# On your GPU machine (Ubuntu 22.04 with NVIDIA GPU)

# 1. Install NVIDIA drivers and CUDA
sudo apt update
sudo apt install -y nvidia-driver-535 nvidia-cuda-toolkit

# 2. Install Docker with NVIDIA runtime
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

# 3. Install GitLab Runner
curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | sudo bash
sudo apt install gitlab-runner

# 4. Register the runner with GPU tags
sudo gitlab-runner register \
  --url "https://gitlab.com/" \
  --registration-token "YOUR_TOKEN" \
  --executor "docker" \
  --docker-image "nvidia/cuda:12.0-runtime-ubuntu22.04" \
  --docker-gpus all \
  --tag-list "gpu,cuda,ml-training" \
  --description "GPU Runner for ML Training"

Configure the runner for GPU access:

# /etc/gitlab-runner/config.toml
[[runners]]
  name = "GPU Runner for ML Training"
  url = "https://gitlab.com/"
  token = "YOUR_RUNNER_TOKEN"
  executor = "docker"
  [runners.docker]
    image = "nvidia/cuda:12.0-runtime-ubuntu22.04"
    gpus = "all"
    privileged = true
    volumes = ["/cache", "/var/run/docker.sock:/var/run/docker.sock"]
    shm_size = 8589934592  # 8GB shared memory for PyTorch DataLoader

Optimizing Large Model Training

For training large models, implement resource-aware pipelines:

# .gitlab-ci.yml
variables:
  # Model configuration
  MODEL_SIZE: "large"  # small, medium, large
  BATCH_SIZE: "32"
  GRADIENT_ACCUMULATION: "4"

stages:
  - prepare
  - train
  - evaluate

.gpu-template:
  tags:
    - gpu
    - cuda
  image: nvidia/cuda:12.0-runtime-ubuntu22.04
  before_script:
    - nvidia-smi
    - pip install -r requirements-gpu.txt

prepare-data:
  stage: prepare
  script:
    - python scripts/prepare_data.py
  artifacts:
    paths:
      - data/processed/
    expire_in: 1 day

train-distributed:
  extends: .gpu-template
  stage: train
  script:
    - |
      # Configure for available GPU memory
      export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

      # Enable mixed precision for memory efficiency
      python scripts/train.py \
        --model-size $MODEL_SIZE \
        --batch-size $BATCH_SIZE \
        --gradient-accumulation $GRADIENT_ACCUMULATION \
        --mixed-precision \
        --checkpoint-dir models/checkpoints/
  artifacts:
    paths:
      - models/
      - logs/
  timeout: 6h
  needs:
    - prepare-data

# Parallel hyperparameter search
.hyperparam-template:
  extends: .gpu-template
  stage: train
  script:
    - python scripts/train.py --lr $LEARNING_RATE --batch-size $BATCH_SIZE
  artifacts:
    paths:
      - models/${LEARNING_RATE}_${BATCH_SIZE}/
    reports:
      metrics: metrics.txt
  parallel:
    matrix:
      - LEARNING_RATE: ["0.001", "0.0001", "0.00001"]
        BATCH_SIZE: ["16", "32"]

Memory Management for Large Models

Handle out-of-memory scenarios gracefully:

train-with-memory-fallback:
  extends: .gpu-template
  stage: train
  script:
    - |
      # Try full batch first, fall back to smaller batches
      python scripts/train.py --batch-size 64 || \
      python scripts/train.py --batch-size 32 --gradient-accumulation 2 || \
      python scripts/train.py --batch-size 16 --gradient-accumulation 4
  variables:
    CUDA_VISIBLE_DEVICES: "0"
    PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:128"

Multi-GPU Training Configuration

For distributed training across multiple GPUs:

train-multi-gpu:
  tags:
    - multi-gpu
  image: nvidia/cuda:12.0-runtime-ubuntu22.04
  script:
    - |
      # Detect available GPUs
      NUM_GPUS=$(nvidia-smi -L | wc -l)
      echo "Training on $NUM_GPUS GPUs"

      # Launch distributed training
      torchrun \
        --nproc_per_node=$NUM_GPUS \
        --master_port=29500 \
        scripts/train_distributed.py \
        --batch-size-per-gpu 32 \
        --epochs 100
  artifacts:
    paths:
      - models/

Cost Optimization Strategies

Implement cost-aware training pipelines:

variables:
  USE_SPOT_INSTANCE: "true"

train-cost-optimized:
  extends: .gpu-template
  stage: train
  script:
    - |
      # Save checkpoints frequently for spot instance resilience
      python scripts/train.py \
        --checkpoint-frequency 100 \
        --resume-from-checkpoint \
        --checkpoint-dir /cache/checkpoints/
  cache:
    key: "training-checkpoints-${CI_COMMIT_REF_SLUG}"
    paths:
      - /cache/checkpoints/
    policy: pull-push
  retry:
    max: 3
    when:
      - runner_system_failure
      - stuck_or_timeout_failure
  interruptible: true  # Allow cancellation for spot instances

Key Takeaways

Aspect	Recommendation
SaaS Runners	Use `saas-linux-medium-amd64-gpu-standard` tag
Self-Hosted	Configure Docker with `--gpus all` flag
Memory	Enable mixed precision and gradient accumulation
Large Models	Use checkpointing and distributed training
Cost	Implement spot instance resilience with retries

المحتوى العربي

خيارات GPU Runner في GitLab

يقدم GitLab مناهج متعددة لتدريب ML المسرّع بـ GPU:

1. GitLab SaaS GPU Runners (متاح لـ Premium/Ultimate)

مثيلات NVIDIA GPU مكونة مسبقاً
علامة Runner: saas-linux-medium-amd64-gpu-standard
لا حاجة لإدارة البنية التحتية

2. GPU Runners ذاتية الاستضافة

تحكم كامل في الأجهزة والتكوين
فعالة من حيث التكلفة للاستخدام العالي
تدعم أي عتاد GPU

استخدام GitLab SaaS GPU Runners

لمستخدمي GitLab Premium/Ultimate، GPU runners متاحة:

# .gitlab-ci.yml
train-with-gpu:
  stage: train
  tags:
    - saas-linux-medium-amd64-gpu-standard
  image: nvidia/cuda:12.0-runtime-ubuntu22.04
  before_script:
    - nvidia-smi  # التحقق من توفر GPU
    - pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
  script:
    - python scripts/train_gpu.py
  artifacts:
    paths:
      - models/

إعداد GPU Runners ذاتية الاستضافة

لـ GPU runners ذاتية الاستضافة، اتبع عملية الإعداد هذه:

# على جهاز GPU الخاص بك (Ubuntu 22.04 مع NVIDIA GPU)

# 1. تثبيت تعريفات NVIDIA وCUDA
sudo apt update
sudo apt install -y nvidia-driver-535 nvidia-cuda-toolkit

# 2. تثبيت Docker مع NVIDIA runtime
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# تثبيت NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

# 3. تثبيت GitLab Runner
curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | sudo bash
sudo apt install gitlab-runner

# 4. تسجيل runner مع علامات GPU
sudo gitlab-runner register \
  --url "https://gitlab.com/" \
  --registration-token "YOUR_TOKEN" \
  --executor "docker" \
  --docker-image "nvidia/cuda:12.0-runtime-ubuntu22.04" \
  --docker-gpus all \
  --tag-list "gpu,cuda,ml-training" \
  --description "GPU Runner for ML Training"

تكوين runner للوصول إلى GPU:

# /etc/gitlab-runner/config.toml
[[runners]]
  name = "GPU Runner for ML Training"
  url = "https://gitlab.com/"
  token = "YOUR_RUNNER_TOKEN"
  executor = "docker"
  [runners.docker]
    image = "nvidia/cuda:12.0-runtime-ubuntu22.04"
    gpus = "all"
    privileged = true
    volumes = ["/cache", "/var/run/docker.sock:/var/run/docker.sock"]
    shm_size = 8589934592  # 8GB ذاكرة مشتركة لـ PyTorch DataLoader

تحسين تدريب النماذج الكبيرة

لتدريب النماذج الكبيرة، نفّذ خطوط أنابيب واعية بالموارد:

# .gitlab-ci.yml
variables:
  # تكوين النموذج
  MODEL_SIZE: "large"  # small, medium, large
  BATCH_SIZE: "32"
  GRADIENT_ACCUMULATION: "4"

stages:
  - prepare
  - train
  - evaluate

.gpu-template:
  tags:
    - gpu
    - cuda
  image: nvidia/cuda:12.0-runtime-ubuntu22.04
  before_script:
    - nvidia-smi
    - pip install -r requirements-gpu.txt

prepare-data:
  stage: prepare
  script:
    - python scripts/prepare_data.py
  artifacts:
    paths:
      - data/processed/
    expire_in: 1 day

train-distributed:
  extends: .gpu-template
  stage: train
  script:
    - |
      # التكوين للذاكرة المتاحة لـ GPU
      export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

      # تمكين الدقة المختلطة لكفاءة الذاكرة
      python scripts/train.py \
        --model-size $MODEL_SIZE \
        --batch-size $BATCH_SIZE \
        --gradient-accumulation $GRADIENT_ACCUMULATION \
        --mixed-precision \
        --checkpoint-dir models/checkpoints/
  artifacts:
    paths:
      - models/
      - logs/
  timeout: 6h
  needs:
    - prepare-data

# بحث المعلمات الفائقة المتوازي
.hyperparam-template:
  extends: .gpu-template
  stage: train
  script:
    - python scripts/train.py --lr $LEARNING_RATE --batch-size $BATCH_SIZE
  artifacts:
    paths:
      - models/${LEARNING_RATE}_${BATCH_SIZE}/
    reports:
      metrics: metrics.txt
  parallel:
    matrix:
      - LEARNING_RATE: ["0.001", "0.0001", "0.00001"]
        BATCH_SIZE: ["16", "32"]

إدارة الذاكرة للنماذج الكبيرة

التعامل مع سيناريوهات نفاد الذاكرة برشاقة:

train-with-memory-fallback:
  extends: .gpu-template
  stage: train
  script:
    - |
      # جرب الدفعة الكاملة أولاً، ارجع لدفعات أصغر
      python scripts/train.py --batch-size 64 || \
      python scripts/train.py --batch-size 32 --gradient-accumulation 2 || \
      python scripts/train.py --batch-size 16 --gradient-accumulation 4
  variables:
    CUDA_VISIBLE_DEVICES: "0"
    PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:128"

تكوين التدريب متعدد GPU

للتدريب الموزع عبر GPUs متعددة:

train-multi-gpu:
  tags:
    - multi-gpu
  image: nvidia/cuda:12.0-runtime-ubuntu22.04
  script:
    - |
      # اكتشاف GPUs المتاحة
      NUM_GPUS=$(nvidia-smi -L | wc -l)
      echo "Training on $NUM_GPUS GPUs"

      # إطلاق التدريب الموزع
      torchrun \
        --nproc_per_node=$NUM_GPUS \
        --master_port=29500 \
        scripts/train_distributed.py \
        --batch-size-per-gpu 32 \
        --epochs 100
  artifacts:
    paths:
      - models/

استراتيجيات تحسين التكلفة

نفّذ خطوط أنابيب تدريب واعية بالتكلفة:

variables:
  USE_SPOT_INSTANCE: "true"

train-cost-optimized:
  extends: .gpu-template
  stage: train
  script:
    - |
      # احفظ نقاط التحقق بشكل متكرر لمرونة spot instance
      python scripts/train.py \
        --checkpoint-frequency 100 \
        --resume-from-checkpoint \
        --checkpoint-dir /cache/checkpoints/
  cache:
    key: "training-checkpoints-${CI_COMMIT_REF_SLUG}"
    paths:
      - /cache/checkpoints/
    policy: pull-push
  retry:
    max: 3
    when:
      - runner_system_failure
      - stuck_or_timeout_failure
  interruptible: true  # السماح بالإلغاء لـ spot instances

النقاط الرئيسية

الجانب	التوصية
SaaS Runners	استخدم علامة `saas-linux-medium-amd64-gpu-standard`
ذاتية الاستضافة	كوّن Docker مع علامة `--gpus all`
الذاكرة	مكّن الدقة المختلطة وتراكم التدرج
النماذج الكبيرة	استخدم نقاط التحقق والتدريب الموزع
التكلفة	نفّذ مرونة spot instance مع إعادة المحاولات