Building Self-Healing ML Systems

The SRE Runbook Problem

SRE teams write runbooks: step-by-step procedures for responding to incidents. “When alert X fires, do steps 1, 2, 3.” The good teams automate those runbooks into code.

ML systems have an analogous runbook: “When model drift is detected, retrain the model on fresh data, evaluate it against quality gates, promote it if it passes, notify the team.” Most ML teams do this manually. Automating it is the goal of Phase 9.

Argo Events Architecture

Argo Events is a separate system from Argo Workflows. Its job is to listen for events and trigger workflows in response.

[Prometheus Alert] → [Argo Events EventSource] → [Sensor] → [Trigger] → [Argo Workflow]

The EventSource listens on a webhook endpoint. Alertmanager (configured to send webhook notifications) POSTs to that endpoint when a drift alert fires. The Sensor evaluates conditions and fires the Trigger, which submits the retraining Workflow.

The EventSource

apiVersion: argoproj.io/v1alpha1
kind: EventSource
metadata:
  name: drift-alert
spec:
  webhook:
    drift-detected:
      port: "12000"
      endpoint: /drift
      method: POST

The Sensor and Trigger

apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
  name: model-retrain
spec:
  dependencies:
    - name: drift-event
      eventSourceName: drift-alert
      eventName: drift-detected
  triggers:
    - template:
        name: retrain-workflow
        argoWorkflow:
          operation: submit
          source:
            resource:
              apiVersion: argoproj.io/v1alpha1
              kind: Workflow
              metadata:
                generateName: retrain-
              spec:
                workflowTemplateRef:
                  name: model-retrain-pipeline

Quality Gates Still Apply

Self-healing doesn’t mean unconditional retraining. The retrain workflow includes the same quality gates as the CI pipeline. If the new model is worse than the current production model, it fails the gate and the current model stays deployed. A human gets notified.

Automation that can make things worse without human review is not self-healing — it’s chaos engineering on yourself.