GitHub Actions for ML Workflows
Model Validation Workflows
3 min read
Training a model is just the first step. Before deployment, you need to validate that the model meets quality standards. Let's build validation workflows that act as quality gates.
The Validation Pipeline
┌─────────────────────────────────────────────────────────────┐
│ Model Validation Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ Train ──▶ Accuracy ──▶ Latency ──▶ Fairness ──▶ Deploy │
│ Check Check Check │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ >0.85? <100ms? Bias OK? │
│ │
└─────────────────────────────────────────────────────────────┘
Accuracy Threshold Gate
name: Model Validation
on:
workflow_run:
workflows: ["Training Pipeline"]
types: [completed]
jobs:
validate-accuracy:
runs-on: ubuntu-latest
if: ${{ github.event.workflow_run.conclusion == 'success' }}
steps:
- uses: actions/checkout@v4
- name: Download trained model
uses: actions/download-artifact@v4
with:
name: trained-model
path: models/
run-id: ${{ github.event.workflow_run.id }}
- name: Evaluate model
id: evaluate
run: |
python scripts/evaluate.py \
--model models/model.pkl \
--test-data data/test.parquet \
--output metrics.json
ACCURACY=$(jq .accuracy metrics.json)
echo "accuracy=$ACCURACY" >> $GITHUB_OUTPUT
- name: Check accuracy threshold
run: |
ACCURACY=${{ steps.evaluate.outputs.accuracy }}
THRESHOLD=0.85
if (( $(echo "$ACCURACY < $THRESHOLD" | bc -l) )); then
echo "::error::Accuracy $ACCURACY below threshold $THRESHOLD"
exit 1
fi
echo "::notice::Accuracy $ACCURACY meets threshold $THRESHOLD"
Latency Validation
validate-latency:
needs: validate-accuracy
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download model
uses: actions/download-artifact@v4
with:
name: trained-model
path: models/
- name: Measure inference latency
id: latency
run: |
python scripts/benchmark.py \
--model models/model.pkl \
--samples 1000 \
--output latency.json
P50=$(jq .p50_ms latency.json)
P99=$(jq .p99_ms latency.json)
echo "p50=$P50" >> $GITHUB_OUTPUT
echo "p99=$P99" >> $GITHUB_OUTPUT
- name: Check latency threshold
run: |
P99=${{ steps.latency.outputs.p99 }}
MAX_LATENCY=100 # 100ms
if (( $(echo "$P99 > $MAX_LATENCY" | bc -l) )); then
echo "::error::P99 latency ${P99}ms exceeds ${MAX_LATENCY}ms"
exit 1
fi
echo "::notice::P99 latency ${P99}ms within threshold"
Fairness and Bias Testing
validate-fairness:
needs: validate-accuracy
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download model
uses: actions/download-artifact@v4
with:
name: trained-model
path: models/
- name: Run fairness analysis
id: fairness
run: |
python scripts/fairness_check.py \
--model models/model.pkl \
--test-data data/test.parquet \
--protected-attributes gender,age_group \
--output fairness.json
- name: Check demographic parity
run: |
# Check if any group has significantly different outcomes
python -c "
import json
with open('fairness.json') as f:
results = json.load(f)
for attr, metrics in results['demographic_parity'].items():
ratio = metrics['ratio']
if ratio < 0.8 or ratio > 1.25:
print(f'::error::Demographic parity violation for {attr}: {ratio}')
exit(1)
print('::notice::All fairness checks passed')
"
Regression Testing Against Baseline
validate-regression:
needs: validate-accuracy
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download new model
uses: actions/download-artifact@v4
with:
name: trained-model
path: models/new/
- name: Download baseline model
run: |
# Fetch production model from registry
python scripts/fetch_baseline.py --output models/baseline/
- name: Compare models
id: compare
run: |
python scripts/compare_models.py \
--new models/new/model.pkl \
--baseline models/baseline/model.pkl \
--test-data data/test.parquet \
--output comparison.json
IMPROVEMENT=$(jq .accuracy_improvement comparison.json)
echo "improvement=$IMPROVEMENT" >> $GITHUB_OUTPUT
- name: Check for regression
run: |
IMPROVEMENT=${{ steps.compare.outputs.improvement }}
MIN_IMPROVEMENT=-0.01 # Allow 1% degradation max
if (( $(echo "$IMPROVEMENT < $MIN_IMPROVEMENT" | bc -l) )); then
echo "::error::Model regression detected: ${IMPROVEMENT}"
exit 1
fi
PR Comment with Results
report-results:
needs: [validate-accuracy, validate-latency, validate-fairness]
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- name: Create validation report
uses: actions/github-script@v7
with:
script: |
const accuracy = '${{ needs.validate-accuracy.outputs.accuracy }}';
const p99 = '${{ needs.validate-latency.outputs.p99 }}';
const body = `## Model Validation Report
| Metric | Value | Threshold | Status |
|--------|-------|-----------|--------|
| Accuracy | ${accuracy} | ≥ 0.85 | ✅ |
| P99 Latency | ${p99}ms | ≤ 100ms | ✅ |
| Fairness | Passed | - | ✅ |
**Ready for deployment** `;
github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body: body
});
Complete Validation Workflow
name: Model Validation
on:
workflow_run:
workflows: ["Training"]
types: [completed]
jobs:
gate-accuracy:
runs-on: ubuntu-latest
outputs:
accuracy: ${{ steps.eval.outputs.accuracy }}
steps:
- uses: actions/checkout@v4
- uses: actions/download-artifact@v4
with:
name: model
run-id: ${{ github.event.workflow_run.id }}
- id: eval
run: |
python evaluate.py --output metrics.json
echo "accuracy=$(jq .accuracy metrics.json)" >> $GITHUB_OUTPUT
- name: Gate check
run: |
if (( $(echo "${{ steps.eval.outputs.accuracy }} < 0.85" | bc -l) )); then
exit 1
fi
gate-latency:
needs: gate-accuracy
runs-on: ubuntu-latest
steps:
- run: python benchmark.py --max-p99 100
gate-fairness:
needs: gate-accuracy
runs-on: ubuntu-latest
steps:
- run: python fairness.py --threshold 0.8
deploy:
needs: [gate-accuracy, gate-latency, gate-fairness]
runs-on: ubuntu-latest
steps:
- run: echo "All gates passed, deploying..."
- run: ./deploy.sh
Key Insight: Validation gates should cover accuracy, latency, fairness, and regression. A model that's accurate but slow or biased shouldn't reach production.
Next, we'll explore reusable workflows and best practices. :::