The SRE Runbook Problem
SRE teams write runbooks: step-by-step procedures for responding to incidents. “When alert X fires, do steps 1, 2, 3.” The good teams automate those runbooks into code.
ML systems have an analogous runbook: “When model drift is detected, retrain the model on fresh data, evaluate it against quality gates, promote it if it passes, notify the team.” Most ML teams do this manually. Automating it is the goal of Phase 9.
Argo Events Architecture
Argo Events is a separate system from Argo Workflows. Its job is to listen for events and trigger workflows in response.
[Prometheus Alert] → [Argo Events EventSource] → [Sensor] → [Trigger] → [Argo Workflow]
The EventSource listens on a webhook endpoint. Alertmanager (configured to send webhook notifications) POSTs to that endpoint when a drift alert fires. The Sensor evaluates conditions and fires the Trigger, which submits the retraining Workflow.
The EventSource
apiVersion: argoproj.io/v1alpha1
kind: EventSource
metadata:
name: drift-alert
spec:
webhook:
drift-detected:
port: "12000"
endpoint: /drift
method: POST
The Sensor and Trigger
apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
name: model-retrain
spec:
dependencies:
- name: drift-event
eventSourceName: drift-alert
eventName: drift-detected
triggers:
- template:
name: retrain-workflow
argoWorkflow:
operation: submit
source:
resource:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: retrain-
spec:
workflowTemplateRef:
name: model-retrain-pipeline
Quality Gates Still Apply
Self-healing doesn’t mean unconditional retraining. The retrain workflow includes the same quality gates as the CI pipeline. If the new model is worse than the current production model, it fails the gate and the current model stays deployed. A human gets notified.
Automation that can make things worse without human review is not self-healing — it’s chaos engineering on yourself.