Why Chaos Engineering for ML?

SREs use chaos engineering to validate that systems behave correctly under failure conditions. You inject failures in a controlled way to find gaps in your reliability story before production finds them for you.

ML systems have failure modes that software systems don’t:

The Strangest Failure

In Phase 11, I used Chaos Mesh to inject network latency between the inference service and the feature store. The result was not what I expected.

The inference service didn’t error. It didn’t time out (the timeout was 2 seconds, and latency injection was 800ms). It slowed down — and started serving predictions that were silently wrong.

Why? The feature store request was timing out at the feature computation layer, and the inference service was falling back to default feature values (zeros, in this case) silently. The model was running on garbage input and returning confident-looking predictions.

This was not caught by my latency alert. P99 latency was within SLO because most requests were fast — only feature store calls were slow, and those happened to fall within the timeout window.

What I Fixed

  1. Fail loudly: If any feature computation fails or times out, return an error rather than silently using default values.
  2. Add input validation: Log and alert when input features are outside expected ranges (all-zeros is a red flag for a real request).
  3. Separate SLOs: Track feature store availability as its own SLO, not just downstream inference latency.

The Meta-Lesson

This failure mode — a system that appears healthy by traditional metrics while producing wrong outputs — is specific to ML. Traditional service monitoring won’t catch it. You need model-aware monitoring, and you need to chaos-test the failure modes that are unique to your ML architecture.

Phase 11 validated that the monitoring from Phase 8 actually worked, and revealed a gap that I hadn’t anticipated. That’s the value of chaos engineering: it finds the surprises before users do.