CI/CD for ML Systems
GitHub Actions for ML Pipelines
5 min read
GitHub Actions is the most common CI/CD tool in ML. Interviewers expect you to design workflows for training, testing, and deployment.
Interview Question: Design ML CI/CD
Question: "Design a GitHub Actions workflow for an ML project that includes data validation, model training, and deployment."
Complete Workflow:
# .github/workflows/ml-pipeline.yml
name: ML Pipeline
on:
push:
branches: [main]
paths:
- 'src/**'
- 'data/**'
- 'models/**'
pull_request:
branches: [main]
workflow_dispatch:
inputs:
force_retrain:
description: 'Force model retraining'
type: boolean
default: false
env:
PYTHON_VERSION: '3.11'
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
jobs:
lint-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: 'pip'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install -r requirements-dev.txt
- name: Lint code
run: |
ruff check src/
mypy src/
- name: Run unit tests
run: pytest tests/unit/ -v --cov=src/
data-validation:
runs-on: ubuntu-latest
needs: lint-and-test
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: 'pip'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Validate training data
run: python src/validate_data.py --config configs/data_expectations.yaml
- name: Check for data drift
run: python src/check_drift.py --baseline data/baseline_stats.json
train-model:
runs-on: ubuntu-latest
needs: data-validation
if: |
github.event_name == 'workflow_dispatch' && github.event.inputs.force_retrain == 'true' ||
github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Train model
run: |
python src/train.py \
--config configs/training_config.yaml \
--experiment-name "fraud-detection-${{ github.sha }}"
- name: Run model tests
run: pytest tests/model/ -v
- name: Register model to MLflow
if: success()
run: |
python src/register_model.py \
--run-id ${{ env.MLFLOW_RUN_ID }} \
--model-name fraud-detector
deploy-staging:
runs-on: ubuntu-latest
needs: train-model
environment: staging
steps:
- uses: actions/checkout@v4
- name: Deploy to staging
run: |
kubectl apply -f k8s/staging/ --context staging-cluster
kubectl rollout status deployment/model-serving -n ml-staging
- name: Run integration tests
run: pytest tests/integration/ --env staging
- name: Run smoke tests
run: python tests/smoke_test.py --endpoint ${{ vars.STAGING_ENDPOINT }}
deploy-production:
runs-on: ubuntu-latest
needs: deploy-staging
environment: production
steps:
- uses: actions/checkout@v4
- name: Deploy canary (10%)
run: |
kubectl apply -f k8s/production/canary.yaml
sleep 300 # Wait 5 minutes for canary metrics
- name: Check canary metrics
run: python src/check_canary_metrics.py --threshold 0.01
- name: Full rollout
if: success()
run: |
kubectl apply -f k8s/production/
kubectl rollout status deployment/model-serving -n ml-production
Key Workflow Patterns
Pattern 1: Conditional Training
# Only train when training code or data changes
on:
push:
paths:
- 'src/train.py'
- 'src/model/**'
- 'data/training/**'
- 'configs/training_config.yaml'
Pattern 2: GPU Training with Self-Hosted Runners
train-gpu:
runs-on: [self-hosted, gpu, linux]
steps:
- name: Train with GPU
run: |
nvidia-smi # Verify GPU available
python src/train.py --device cuda
Pattern 3: Caching ML Dependencies
- name: Cache pip packages
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}
- name: Cache model artifacts
uses: actions/cache@v4
with:
path: models/
key: model-${{ hashFiles('data/training/**') }}
Interview Discussion Points
| Topic | What to Say |
|---|---|
| GPU runners | "We use self-hosted runners with GPUs for training, GitHub-hosted for tests" |
| Secrets | "Model registry credentials in GitHub Secrets, rotated quarterly" |
| Artifacts | "Models stored in S3, only pointers in artifacts" |
| Environments | "GitHub Environments for staging/prod with required reviewers" |
| Caching | "Cache pip packages and model checkpoints to speed up workflows" |
Pro Tip: Mention that you separate training from deployment - "Training is expensive and shouldn't block every PR, so we make it conditional or scheduled."
Next, we'll cover deployment strategies for ML models. :::