Monitoring ML Models in Production

The Familiar Part

ML model serving endpoints are services. They respond to HTTP requests, they have latency, error rates, and throughput. Monitor them with Prometheus + Grafana exactly as you would any microservice.

# Standard service SLOs
- alert: ModelHighLatency
  expr: histogram_quantile(0.99, rate(inference_duration_seconds_bucket[5m])) > 0.5
  for: 5m

- alert: ModelErrorRate
  expr: rate(inference_errors_total[5m]) / rate(inference_requests_total[5m]) > 0.01
  for: 5m

These fire when 99th-percentile latency exceeds 500ms or error rate exceeds 1%. Standard SRE stuff.

The New Part: Data and Model Drift

Here’s where ML monitoring diverges from service monitoring. Your model might have perfect latency and zero errors, and still be delivering wrong predictions — because the data it’s seeing in production has drifted from the data it was trained on.

Input drift: The statistical distribution of features has changed. Users are submitting different kinds of inputs than your training set represented.

Concept drift: The relationship between inputs and correct outputs has changed. The model’s internal mapping is now stale even if inputs look similar.

Evidently AI generates drift reports by comparing a reference dataset (your training distribution) to production data collected over a window:

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_df, current_data=production_df)
report.save_html("drift_report.html")

Connecting Drift to Action

The real value is connecting drift detection to your alerting pipeline. When Evidently detects significant drift (using statistical tests like KS or PSI), it emits a metric that Prometheus scrapes, which fires an alert, which triggers an Argo Events workflow that kicks off retraining.

That closed loop — detect → alert → retrain → redeploy — is what Phase 9 is about.