The Familiar Part
ML model serving endpoints are services. They respond to HTTP requests, they have latency, error rates, and throughput. Monitor them with Prometheus + Grafana exactly as you would any microservice.
# Standard service SLOs
- alert: ModelHighLatency
expr: histogram_quantile(0.99, rate(inference_duration_seconds_bucket[5m])) > 0.5
for: 5m
- alert: ModelErrorRate
expr: rate(inference_errors_total[5m]) / rate(inference_requests_total[5m]) > 0.01
for: 5m
These fire when 99th-percentile latency exceeds 500ms or error rate exceeds 1%. Standard SRE stuff.
The New Part: Data and Model Drift
Here’s where ML monitoring diverges from service monitoring. Your model might have perfect latency and zero errors, and still be delivering wrong predictions — because the data it’s seeing in production has drifted from the data it was trained on.
Input drift: The statistical distribution of features has changed. Users are submitting different kinds of inputs than your training set represented.
Concept drift: The relationship between inputs and correct outputs has changed. The model’s internal mapping is now stale even if inputs look similar.
Evidently AI generates drift reports by comparing a reference dataset (your training distribution) to production data collected over a window:
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_df, current_data=production_df)
report.save_html("drift_report.html")
Connecting Drift to Action
The real value is connecting drift detection to your alerting pipeline. When Evidently detects significant drift (using statistical tests like KS or PSI), it emits a metric that Prometheus scrapes, which fires an alert, which triggers an Argo Events workflow that kicks off retraining.
That closed loop — detect → alert → retrain → redeploy — is what Phase 9 is about.