Why ML CI/CD Is Different

Software CI gates check: does the code compile, do the tests pass, does the linter approve?

ML CI gates need to check additional things:

The last two points are where SRE habits pay dividends — ML teams that came from research backgrounds rarely think about supply chain security.

The Pipeline

# .github/workflows/model-release.yml
name: Model Release

on:
  push:
    paths: ['src/model/**', 'configs/**']

jobs:
  train-and-evaluate:
    runs-on: self-hosted
    steps:
      - name: Run training pipeline
        run: argo submit workflows/train.yaml --wait

      - name: Check quality gates
        run: |
          ACCURACY=$(python scripts/get_metric.py accuracy)
          if (( $(echo "$ACCURACY < 0.90" | bc -l) )); then
            echo "Accuracy $ACCURACY below threshold"
            exit 1
          fi

      - name: Sign model artifact
        run: cosign sign ${{ env.MODEL_IMAGE }}

      - name: Promote to staging
        run: mlflow models transition-to-stage --name classifier --stage Staging

Supply Chain Security

The cosign sign step is often skipped in ML pipelines. It shouldn’t be. A signed model artifact gives you:

  1. Proof that the model was produced by a specific pipeline run
  2. Tamper evidence — if the artifact is modified after signing, verification fails
  3. An audit trail when something goes wrong in production

The ML supply chain (training data → code → artifact → deployment) has the same attack surface as software supply chains, and deserves the same treatment.

Self-Hosted Runners

For pipelines that submit to Argo Workflows in a homelab cluster, you need a self-hosted GitHub Actions runner with kubectl access. I run the runner as a Kubernetes Deployment with a service account that has limited permissions — just enough to submit workflows and read pod logs.