GitHub Actions for ML Workflows
Training Pipelines in Actions
ML training jobs have unique requirements: large compute, long runtimes, and expensive resources. Let's build training pipelines that handle these challenges efficiently.
Matrix Builds for Hyperparameter Search
Test multiple configurations in parallel:
name: Hyperparameter Search
on: workflow_dispatch
jobs:
train-matrix:
runs-on: ubuntu-latest
strategy:
matrix:
learning_rate: [0.001, 0.01, 0.1]
batch_size: [32, 64, 128]
fail-fast: false # Don't cancel other jobs if one fails
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Train with config
run: |
python train.py \
--learning-rate ${{ matrix.learning_rate }} \
--batch-size ${{ matrix.batch_size }}
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: results-lr${{ matrix.learning_rate }}-bs${{ matrix.batch_size }}
path: results/
This creates 9 parallel jobs (3 learning rates × 3 batch sizes).
Matrix with Include and Exclude
Fine-tune which combinations to run:
strategy:
matrix:
model: [xgboost, lightgbm, catboost]
dataset: [small, medium, large]
exclude:
# Skip expensive combinations
- model: catboost
dataset: large
include:
# Add specific test case
- model: xgboost
dataset: large
special_flag: "--early-stopping 10"
Caching Dependencies
Speed up workflows with caching:
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Cache pip packages
- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
# Cache model artifacts from previous runs
- name: Cache trained models
uses: actions/cache@v4
with:
path: ~/.cache/models
key: models-${{ hashFiles('data/train.parquet') }}
restore-keys: |
models-
# Cache HuggingFace models
- name: Cache HuggingFace
uses: actions/cache@v4
with:
path: ~/.cache/huggingface
key: hf-${{ hashFiles('requirements.txt') }}
- run: pip install -r requirements.txt
- run: python train.py
Self-Hosted GPU Runners
For GPU training, set up self-hosted runners:
Option 1: Self-Hosted on Your Infrastructure
jobs:
train-gpu:
runs-on: [self-hosted, linux, gpu] # Custom labels
steps:
- uses: actions/checkout@v4
- name: Check GPU availability
run: nvidia-smi
- name: Train on GPU
run: |
python train.py --device cuda
env:
CUDA_VISIBLE_DEVICES: "0"
Option 2: Cloud GPU Runners (Third-Party)
Several services provide GPU runners:
jobs:
train-gpu:
runs-on: gpu-runner-nvidia-t4 # Example label
steps:
- name: Train on cloud GPU
run: python train.py --device cuda
Setting Up Self-Hosted Runner
# On your GPU server
mkdir actions-runner && cd actions-runner
curl -o actions-runner.tar.gz -L https://github.com/actions/runner/releases/download/v2.311.0/actions-runner-linux-x64-2.311.0.tar.gz
tar xzf actions-runner.tar.gz
# Configure with your repo
./config.sh --url https://github.com/OWNER/REPO --token YOUR_TOKEN --labels gpu,linux
# Install and start as service
sudo ./svc.sh install
sudo ./svc.sh start
Handling Long-Running Jobs
Training can take hours. Handle timeouts properly:
jobs:
train:
runs-on: [self-hosted, gpu]
timeout-minutes: 360 # 6 hours max
steps:
- name: Train with checkpointing
run: |
python train.py \
--checkpoint-dir checkpoints/ \
--save-every 1000
continue-on-error: true # Don't fail immediately
- name: Upload checkpoints on any outcome
if: always()
uses: actions/upload-artifact@v4
with:
name: checkpoints
path: checkpoints/
- name: Resume from checkpoint
if: failure()
run: |
python train.py --resume checkpoints/latest.pt
Parallel Data Processing
Split data processing across jobs:
jobs:
prepare-splits:
runs-on: ubuntu-latest
outputs:
splits: ${{ steps.split.outputs.splits }}
steps:
- name: Calculate splits
id: split
run: |
echo 'splits=["0-1000", "1000-2000", "2000-3000"]' >> $GITHUB_OUTPUT
process:
needs: prepare-splits
runs-on: ubuntu-latest
strategy:
matrix:
split: ${{ fromJson(needs.prepare-splits.outputs.splits) }}
steps:
- name: Process data split
run: |
python process.py --range "${{ matrix.split }}"
aggregate:
needs: process
runs-on: ubuntu-latest
steps:
- name: Combine results
run: python combine.py
Resource Management
Control resource usage:
jobs:
train:
runs-on: ubuntu-latest
concurrency:
group: training-${{ github.ref }}
cancel-in-progress: true # Cancel previous runs
steps:
- name: Train with resource limits
run: |
python train.py
env:
OMP_NUM_THREADS: 4
MKL_NUM_THREADS: 4
Complete Training Pipeline
name: Full Training Pipeline
on:
push:
branches: [main]
schedule:
- cron: '0 2 * * 0' # Weekly retraining
jobs:
prepare:
runs-on: ubuntu-latest
outputs:
data_hash: ${{ steps.hash.outputs.hash }}
steps:
- uses: actions/checkout@v4
- id: hash
run: echo "hash=$(sha256sum data/train.parquet | cut -d' ' -f1)" >> $GITHUB_OUTPUT
train:
needs: prepare
runs-on: [self-hosted, gpu]
timeout-minutes: 240
strategy:
matrix:
model: [baseline, improved]
steps:
- uses: actions/checkout@v4
- name: Check cache
uses: actions/cache@v4
with:
path: models/
key: model-${{ matrix.model }}-${{ needs.prepare.outputs.data_hash }}
- name: Train if not cached
run: |
if [ ! -f models/${{ matrix.model }}.pkl ]; then
python train.py --model ${{ matrix.model }}
fi
- uses: actions/upload-artifact@v4
with:
name: model-${{ matrix.model }}
path: models/
compare:
needs: train
runs-on: ubuntu-latest
steps:
- uses: actions/download-artifact@v4
- name: Compare models
run: python compare_models.py
- name: Select best model
run: python select_best.py >> $GITHUB_STEP_SUMMARY
Key Insight: Use matrix builds for experimentation, caching for speed, and self-hosted runners for GPU access. Always checkpoint long-running jobs.
Next, we'll explore model validation workflows. :::