Training Pipelines in Actions

ML training jobs have unique requirements: large compute, long runtimes, and expensive resources. Let's build training pipelines that handle these challenges efficiently.

Matrix Builds for Hyperparameter Search

Test multiple configurations in parallel:

name: Hyperparameter Search

on: workflow_dispatch

jobs:
  train-matrix:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        learning_rate: [0.001, 0.01, 0.1]
        batch_size: [32, 64, 128]
      fail-fast: false  # Don't cancel other jobs if one fails

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Train with config
        run: |
          python train.py \
            --learning-rate ${{ matrix.learning_rate }} \
            --batch-size ${{ matrix.batch_size }}

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: results-lr${{ matrix.learning_rate }}-bs${{ matrix.batch_size }}
          path: results/

This creates 9 parallel jobs (3 learning rates × 3 batch sizes).

Matrix with Include and Exclude

Fine-tune which combinations to run:

strategy:
  matrix:
    model: [xgboost, lightgbm, catboost]
    dataset: [small, medium, large]
    exclude:
      # Skip expensive combinations
      - model: catboost
        dataset: large
    include:
      # Add specific test case
      - model: xgboost
        dataset: large
        special_flag: "--early-stopping 10"

Caching Dependencies

Speed up workflows with caching:

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Cache pip packages
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'

      # Cache model artifacts from previous runs
      - name: Cache trained models
        uses: actions/cache@v4
        with:
          path: ~/.cache/models
          key: models-${{ hashFiles('data/train.parquet') }}
          restore-keys: |
            models-

      # Cache HuggingFace models
      - name: Cache HuggingFace
        uses: actions/cache@v4
        with:
          path: ~/.cache/huggingface
          key: hf-${{ hashFiles('requirements.txt') }}

      - run: pip install -r requirements.txt
      - run: python train.py

Self-Hosted GPU Runners

For GPU training, set up self-hosted runners:

Option 1: Self-Hosted on Your Infrastructure

jobs:
  train-gpu:
    runs-on: [self-hosted, linux, gpu]  # Custom labels
    steps:
      - uses: actions/checkout@v4

      - name: Check GPU availability
        run: nvidia-smi

      - name: Train on GPU
        run: |
          python train.py --device cuda
        env:
          CUDA_VISIBLE_DEVICES: "0"

Option 2: Cloud GPU Runners (Third-Party)

Several services provide GPU runners:

jobs:
  train-gpu:
    runs-on: gpu-runner-nvidia-t4  # Example label
    steps:
      - name: Train on cloud GPU
        run: python train.py --device cuda

Setting Up Self-Hosted Runner

# On your GPU server
mkdir actions-runner && cd actions-runner
curl -o actions-runner.tar.gz -L https://github.com/actions/runner/releases/download/v2.311.0/actions-runner-linux-x64-2.311.0.tar.gz
tar xzf actions-runner.tar.gz

# Configure with your repo
./config.sh --url https://github.com/OWNER/REPO --token YOUR_TOKEN --labels gpu,linux

# Install and start as service
sudo ./svc.sh install
sudo ./svc.sh start

Handling Long-Running Jobs

Training can take hours. Handle timeouts properly:

jobs:
  train:
    runs-on: [self-hosted, gpu]
    timeout-minutes: 360  # 6 hours max

    steps:
      - name: Train with checkpointing
        run: |
          python train.py \
            --checkpoint-dir checkpoints/ \
            --save-every 1000
        continue-on-error: true  # Don't fail immediately

      - name: Upload checkpoints on any outcome
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: checkpoints
          path: checkpoints/

      - name: Resume from checkpoint
        if: failure()
        run: |
          python train.py --resume checkpoints/latest.pt

Parallel Data Processing

Split data processing across jobs:

jobs:
  prepare-splits:
    runs-on: ubuntu-latest
    outputs:
      splits: ${{ steps.split.outputs.splits }}
    steps:
      - name: Calculate splits
        id: split
        run: |
          echo 'splits=["0-1000", "1000-2000", "2000-3000"]' >> $GITHUB_OUTPUT

  process:
    needs: prepare-splits
    runs-on: ubuntu-latest
    strategy:
      matrix:
        split: ${{ fromJson(needs.prepare-splits.outputs.splits) }}

    steps:
      - name: Process data split
        run: |
          python process.py --range "${{ matrix.split }}"

  aggregate:
    needs: process
    runs-on: ubuntu-latest
    steps:
      - name: Combine results
        run: python combine.py

Resource Management

Control resource usage:

jobs:
  train:
    runs-on: ubuntu-latest
    concurrency:
      group: training-${{ github.ref }}
      cancel-in-progress: true  # Cancel previous runs

    steps:
      - name: Train with resource limits
        run: |
          python train.py
        env:
          OMP_NUM_THREADS: 4
          MKL_NUM_THREADS: 4

Complete Training Pipeline

name: Full Training Pipeline

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * 0'  # Weekly retraining

jobs:
  prepare:
    runs-on: ubuntu-latest
    outputs:
      data_hash: ${{ steps.hash.outputs.hash }}
    steps:
      - uses: actions/checkout@v4
      - id: hash
        run: echo "hash=$(sha256sum data/train.parquet | cut -d' ' -f1)" >> $GITHUB_OUTPUT

  train:
    needs: prepare
    runs-on: [self-hosted, gpu]
    timeout-minutes: 240
    strategy:
      matrix:
        model: [baseline, improved]

    steps:
      - uses: actions/checkout@v4

      - name: Check cache
        uses: actions/cache@v4
        with:
          path: models/
          key: model-${{ matrix.model }}-${{ needs.prepare.outputs.data_hash }}

      - name: Train if not cached
        run: |
          if [ ! -f models/${{ matrix.model }}.pkl ]; then
            python train.py --model ${{ matrix.model }}
          fi

      - uses: actions/upload-artifact@v4
        with:
          name: model-${{ matrix.model }}
          path: models/

  compare:
    needs: train
    runs-on: ubuntu-latest
    steps:
      - uses: actions/download-artifact@v4

      - name: Compare models
        run: python compare_models.py

      - name: Select best model
        run: python select_best.py >> $GITHUB_STEP_SUMMARY

Key Insight: Use matrix builds for experimentation, caching for speed, and self-hosted runners for GPU access. Always checkpoint long-running jobs.

Next, we'll explore model validation workflows. :::