CI/CD for ML Systems

GitHub Actions for ML Pipelines

5 min read

GitHub Actions is the most common CI/CD tool in ML. Interviewers expect you to design workflows for training, testing, and deployment.

Interview Question: Design ML CI/CD

Question: "Design a GitHub Actions workflow for an ML project that includes data validation, model training, and deployment."

Complete Workflow:

# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'data/**'
      - 'models/**'
  pull_request:
    branches: [main]
  workflow_dispatch:
    inputs:
      force_retrain:
        description: 'Force model retraining'
        type: boolean
        default: false

env:
  PYTHON_VERSION: '3.11'
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}
          cache: 'pip'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install -r requirements-dev.txt

      - name: Lint code
        run: |
          ruff check src/
          mypy src/

      - name: Run unit tests
        run: pytest tests/unit/ -v --cov=src/

  data-validation:
    runs-on: ubuntu-latest
    needs: lint-and-test
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}
          cache: 'pip'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Validate training data
        run: python src/validate_data.py --config configs/data_expectations.yaml

      - name: Check for data drift
        run: python src/check_drift.py --baseline data/baseline_stats.json

  train-model:
    runs-on: ubuntu-latest
    needs: data-validation
    if: |
      github.event_name == 'workflow_dispatch' && github.event.inputs.force_retrain == 'true' ||
      github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Train model
        run: |
          python src/train.py \
            --config configs/training_config.yaml \
            --experiment-name "fraud-detection-${{ github.sha }}"

      - name: Run model tests
        run: pytest tests/model/ -v

      - name: Register model to MLflow
        if: success()
        run: |
          python src/register_model.py \
            --run-id ${{ env.MLFLOW_RUN_ID }} \
            --model-name fraud-detector

  deploy-staging:
    runs-on: ubuntu-latest
    needs: train-model
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to staging
        run: |
          kubectl apply -f k8s/staging/ --context staging-cluster
          kubectl rollout status deployment/model-serving -n ml-staging

      - name: Run integration tests
        run: pytest tests/integration/ --env staging

      - name: Run smoke tests
        run: python tests/smoke_test.py --endpoint ${{ vars.STAGING_ENDPOINT }}

  deploy-production:
    runs-on: ubuntu-latest
    needs: deploy-staging
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Deploy canary (10%)
        run: |
          kubectl apply -f k8s/production/canary.yaml
          sleep 300  # Wait 5 minutes for canary metrics

      - name: Check canary metrics
        run: python src/check_canary_metrics.py --threshold 0.01

      - name: Full rollout
        if: success()
        run: |
          kubectl apply -f k8s/production/
          kubectl rollout status deployment/model-serving -n ml-production

Key Workflow Patterns

Pattern 1: Conditional Training

# Only train when training code or data changes
on:
  push:
    paths:
      - 'src/train.py'
      - 'src/model/**'
      - 'data/training/**'
      - 'configs/training_config.yaml'

Pattern 2: GPU Training with Self-Hosted Runners

train-gpu:
  runs-on: [self-hosted, gpu, linux]
  steps:
    - name: Train with GPU
      run: |
        nvidia-smi  # Verify GPU available
        python src/train.py --device cuda

Pattern 3: Caching ML Dependencies

- name: Cache pip packages
  uses: actions/cache@v4
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}

- name: Cache model artifacts
  uses: actions/cache@v4
  with:
    path: models/
    key: model-${{ hashFiles('data/training/**') }}

Interview Discussion Points

Topic What to Say
GPU runners "We use self-hosted runners with GPUs for training, GitHub-hosted for tests"
Secrets "Model registry credentials in GitHub Secrets, rotated quarterly"
Artifacts "Models stored in S3, only pointers in artifacts"
Environments "GitHub Environments for staging/prod with required reviewers"
Caching "Cache pip packages and model checkpoints to speed up workflows"

Pro Tip: Mention that you separate training from deployment - "Training is expensive and shouldn't block every PR, so we make it conditional or scheduled."

Next, we'll cover deployment strategies for ML models. :::

Quiz

Module 5: CI/CD for ML Systems

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.