Why ML CI/CD Is Different
Software CI gates check: does the code compile, do the tests pass, does the linter approve?
ML CI gates need to check additional things:
- Does the training pipeline complete without errors?
- Does the resulting model meet quality thresholds (accuracy, latency, fairness metrics)?
- Has the model been signed and its provenance recorded?
- Does the serving container pass security scanning?
The last two points are where SRE habits pay dividends — ML teams that came from research backgrounds rarely think about supply chain security.
The Pipeline
# .github/workflows/model-release.yml
name: Model Release
on:
push:
paths: ['src/model/**', 'configs/**']
jobs:
train-and-evaluate:
runs-on: self-hosted
steps:
- name: Run training pipeline
run: argo submit workflows/train.yaml --wait
- name: Check quality gates
run: |
ACCURACY=$(python scripts/get_metric.py accuracy)
if (( $(echo "$ACCURACY < 0.90" | bc -l) )); then
echo "Accuracy $ACCURACY below threshold"
exit 1
fi
- name: Sign model artifact
run: cosign sign ${{ env.MODEL_IMAGE }}
- name: Promote to staging
run: mlflow models transition-to-stage --name classifier --stage Staging
Supply Chain Security
The cosign sign step is often skipped in ML pipelines. It shouldn’t be. A signed model artifact gives you:
- Proof that the model was produced by a specific pipeline run
- Tamper evidence — if the artifact is modified after signing, verification fails
- An audit trail when something goes wrong in production
The ML supply chain (training data → code → artifact → deployment) has the same attack surface as software supply chains, and deserves the same treatment.
Self-Hosted Runners
For pipelines that submit to Argo Workflows in a homelab cluster, you need a self-hosted GitHub Actions runner with kubectl access. I run the runner as a Kubernetes Deployment with a service account that has limited permissions — just enough to submit workflows and read pod logs.