CI/CD Fundamentals for ML
Continuous Training (CT) Concept
3 min read
CI/CD gets your model to production. But what happens when that model degrades? Continuous Training (CT) automatically retrains models when needed—completing the ML automation loop.
Beyond CI/CD: The CT Extension
Traditional CI/CD:
Code Change → Build → Test → Deploy
ML CI/CD + CT:
Code Change ─┐
Data Change ─┼─→ Build → Test → Deploy ─┐
Schedule ─┤ │
Drift Alert ─┘ │
│
┌────────────────────────────────────────┘
│
▼
Monitor → Drift Detected → Retrain Trigger → CI/CD Pipeline
CT Trigger Strategies
Different triggers suit different scenarios:
| Trigger | When to Use | Example |
|---|---|---|
| Scheduled | Predictable data patterns | Retrain weekly for retail demand |
| Data-based | New data available | Retrain when 10K new samples arrive |
| Performance-based | Model degradation | Retrain when accuracy drops below 0.80 |
| Drift-based | Distribution change | Retrain when PSI > 0.2 |
| On-demand | Business requirement | Retrain after major product launch |
Implementing CT with GitHub Actions
Scheduled Retraining
name: Scheduled Retraining
on:
schedule:
- cron: '0 2 * * 0' # Every Sunday at 2 AM
workflow_dispatch: # Allow manual trigger
jobs:
check-and-retrain:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Check if retraining needed
id: check
run: |
python scripts/check_retrain_needed.py
echo "needed=$(cat retrain_needed.txt)" >> $GITHUB_OUTPUT
- name: Trigger training pipeline
if: steps.check.outputs.needed == 'true'
run: |
gh workflow run train.yml
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Performance-Based Trigger
name: Performance Monitor
on:
schedule:
- cron: '0 * * * *' # Every hour
jobs:
monitor:
runs-on: ubuntu-latest
steps:
- name: Check model performance
run: |
python scripts/check_performance.py \
--model-name "fraud-detector" \
--min-accuracy 0.80
- name: Trigger retraining if degraded
if: failure()
run: |
gh workflow run train.yml \
--field reason="performance_degradation"
Data Drift Trigger
# scripts/check_drift.py
import pandas as pd
from scipy import stats
def calculate_psi(baseline, current, bins=10):
"""Calculate Population Stability Index."""
baseline_hist, edges = np.histogram(baseline, bins=bins)
current_hist, _ = np.histogram(current, bins=edges)
baseline_pct = baseline_hist / len(baseline)
current_pct = current_hist / len(current)
# Avoid division by zero
baseline_pct = np.where(baseline_pct == 0, 0.0001, baseline_pct)
current_pct = np.where(current_pct == 0, 0.0001, current_pct)
psi = np.sum((current_pct - baseline_pct) * np.log(current_pct / baseline_pct))
return psi
def check_drift(feature_name, threshold=0.2):
baseline = pd.read_parquet(f"data/baseline/{feature_name}.parquet")
current = pd.read_parquet(f"data/current/{feature_name}.parquet")
psi = calculate_psi(baseline[feature_name], current[feature_name])
if psi > threshold:
print(f"Drift detected in {feature_name}: PSI={psi:.3f}")
return True
return False
CT Architecture Patterns
Pattern 1: Simple Scheduled CT
┌─────────────────────────────────────────┐
│ Simple Scheduled CT │
├─────────────────────────────────────────┤
│ │
│ Cron ──▶ Pull Data ──▶ Train ──▶ Deploy│
│ (Weekly) │
│ │
└─────────────────────────────────────────┘
Pros: Simple, predictable Cons: May retrain unnecessarily or miss drift
Pattern 2: Monitored CT
┌─────────────────────────────────────────┐
│ Monitored CT │
├─────────────────────────────────────────┤
│ │
│ Production ──▶ Monitor ──▶ Drift? │
│ │ │ │
│ │ ▼ Yes │
│ │ Retrain │
│ │ │ │
│ └────────────────────────┘ │
│ Deploy │
└─────────────────────────────────────────┘
Pros: Retrains only when needed Cons: Requires monitoring infrastructure
Pattern 3: Shadow CT
┌─────────────────────────────────────────┐
│ Shadow CT │
├─────────────────────────────────────────┤
│ │
│ Production Model ──▶ Serve Traffic │
│ │ │
│ ▼ │
│ Retrain Continuously (Shadow) │
│ │ │
│ ▼ │
│ Shadow Better? ──Yes──▶ Promote │
│ │
└─────────────────────────────────────────┘
Pros: Always has fresh model ready Cons: Higher compute costs
CT Best Practices
| Practice | Why |
|---|---|
| Version everything | Know exactly what changed between retrains |
| Automate validation | Never deploy a degraded model |
| Set alerts, not just triggers | Humans should know when CT runs |
| Budget for compute | CT increases training costs |
| A/B test new models | Don't assume retrained is better |
Google's MLOps Levels and CT
Google defines MLOps maturity levels based on CT automation:
| Level | CT Approach |
|---|---|
| Level 0 | Manual retraining |
| Level 1 | Automated training pipeline, manual trigger |
| Level 2 | Automated CT with monitoring triggers |
Key Metrics for CT Decisions
# Example: CT decision logic
def should_retrain():
metrics = get_current_metrics()
# Performance degradation
if metrics['accuracy'] < ACCURACY_THRESHOLD:
return True, "accuracy_drop"
# Data drift
if metrics['psi'] > PSI_THRESHOLD:
return True, "data_drift"
# Staleness
if days_since_last_train() > MAX_MODEL_AGE:
return True, "model_staleness"
# New data volume
if new_samples_count() > MIN_NEW_SAMPLES:
return True, "new_data_available"
return False, None
Key Insight: CT isn't about retraining more often—it's about retraining at the right time with the right data.
Next, we'll survey the tool landscape for ML CI/CD. :::