CI/CD Fundamentals for ML

Continuous Training (CT) Concept

3 min read

CI/CD gets your model to production. But what happens when that model degrades? Continuous Training (CT) automatically retrains models when needed—completing the ML automation loop.

Beyond CI/CD: The CT Extension

Traditional CI/CD:
  Code Change → Build → Test → Deploy

ML CI/CD + CT:
  Code Change  ─┐
  Data Change  ─┼─→ Build → Test → Deploy ─┐
  Schedule     ─┤                          │
  Drift Alert  ─┘                          │
  ┌────────────────────────────────────────┘
  Monitor → Drift Detected → Retrain Trigger → CI/CD Pipeline

CT Trigger Strategies

Different triggers suit different scenarios:

Trigger When to Use Example
Scheduled Predictable data patterns Retrain weekly for retail demand
Data-based New data available Retrain when 10K new samples arrive
Performance-based Model degradation Retrain when accuracy drops below 0.80
Drift-based Distribution change Retrain when PSI > 0.2
On-demand Business requirement Retrain after major product launch

Implementing CT with GitHub Actions

Scheduled Retraining

name: Scheduled Retraining

on:
  schedule:
    - cron: '0 2 * * 0'  # Every Sunday at 2 AM
  workflow_dispatch:      # Allow manual trigger

jobs:
  check-and-retrain:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Check if retraining needed
        id: check
        run: |
          python scripts/check_retrain_needed.py
          echo "needed=$(cat retrain_needed.txt)" >> $GITHUB_OUTPUT

      - name: Trigger training pipeline
        if: steps.check.outputs.needed == 'true'
        run: |
          gh workflow run train.yml
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Performance-Based Trigger

name: Performance Monitor

on:
  schedule:
    - cron: '0 * * * *'  # Every hour

jobs:
  monitor:
    runs-on: ubuntu-latest
    steps:
      - name: Check model performance
        run: |
          python scripts/check_performance.py \
            --model-name "fraud-detector" \
            --min-accuracy 0.80

      - name: Trigger retraining if degraded
        if: failure()
        run: |
          gh workflow run train.yml \
            --field reason="performance_degradation"

Data Drift Trigger

# scripts/check_drift.py
import pandas as pd
from scipy import stats

def calculate_psi(baseline, current, bins=10):
    """Calculate Population Stability Index."""
    baseline_hist, edges = np.histogram(baseline, bins=bins)
    current_hist, _ = np.histogram(current, bins=edges)

    baseline_pct = baseline_hist / len(baseline)
    current_pct = current_hist / len(current)

    # Avoid division by zero
    baseline_pct = np.where(baseline_pct == 0, 0.0001, baseline_pct)
    current_pct = np.where(current_pct == 0, 0.0001, current_pct)

    psi = np.sum((current_pct - baseline_pct) * np.log(current_pct / baseline_pct))
    return psi

def check_drift(feature_name, threshold=0.2):
    baseline = pd.read_parquet(f"data/baseline/{feature_name}.parquet")
    current = pd.read_parquet(f"data/current/{feature_name}.parquet")

    psi = calculate_psi(baseline[feature_name], current[feature_name])

    if psi > threshold:
        print(f"Drift detected in {feature_name}: PSI={psi:.3f}")
        return True
    return False

CT Architecture Patterns

Pattern 1: Simple Scheduled CT

┌─────────────────────────────────────────┐
│           Simple Scheduled CT           │
├─────────────────────────────────────────┤
│                                         │
│  Cron ──▶ Pull Data ──▶ Train ──▶ Deploy│
│  (Weekly)                               │
│                                         │
└─────────────────────────────────────────┘

Pros: Simple, predictable Cons: May retrain unnecessarily or miss drift

Pattern 2: Monitored CT

┌─────────────────────────────────────────┐
│            Monitored CT                 │
├─────────────────────────────────────────┤
│                                         │
│  Production ──▶ Monitor ──▶ Drift?      │
│       │                        │        │
│       │                        ▼ Yes    │
│       │                    Retrain      │
│       │                        │        │
│       └────────────────────────┘        │
│                   Deploy                │
└─────────────────────────────────────────┘

Pros: Retrains only when needed Cons: Requires monitoring infrastructure

Pattern 3: Shadow CT

┌─────────────────────────────────────────┐
│             Shadow CT                   │
├─────────────────────────────────────────┤
│                                         │
│  Production Model ──▶ Serve Traffic     │
│         │                               │
│         ▼                               │
│  Retrain Continuously (Shadow)          │
│         │                               │
│         ▼                               │
│  Shadow Better? ──Yes──▶ Promote        │
│                                         │
└─────────────────────────────────────────┘

Pros: Always has fresh model ready Cons: Higher compute costs

CT Best Practices

Practice Why
Version everything Know exactly what changed between retrains
Automate validation Never deploy a degraded model
Set alerts, not just triggers Humans should know when CT runs
Budget for compute CT increases training costs
A/B test new models Don't assume retrained is better

Google's MLOps Levels and CT

Google defines MLOps maturity levels based on CT automation:

Level CT Approach
Level 0 Manual retraining
Level 1 Automated training pipeline, manual trigger
Level 2 Automated CT with monitoring triggers

Key Metrics for CT Decisions

# Example: CT decision logic
def should_retrain():
    metrics = get_current_metrics()

    # Performance degradation
    if metrics['accuracy'] < ACCURACY_THRESHOLD:
        return True, "accuracy_drop"

    # Data drift
    if metrics['psi'] > PSI_THRESHOLD:
        return True, "data_drift"

    # Staleness
    if days_since_last_train() > MAX_MODEL_AGE:
        return True, "model_staleness"

    # New data volume
    if new_samples_count() > MIN_NEW_SAMPLES:
        return True, "new_data_available"

    return False, None

Key Insight: CT isn't about retraining more often—it's about retraining at the right time with the right data.

Next, we'll survey the tool landscape for ML CI/CD. :::

Quiz

Module 1: CI/CD Fundamentals for ML

Take Quiz