Deploying ML Models Like Microservices

Models Are Just Services

The ML community often treats model deployment as a special case. SREs know better: a model serving endpoint is a microservice that takes a request and returns a response. The same deployment patterns apply.

Blue/green: Run two identical environments (blue = current, green = new). Switch traffic 100% to green when you’re confident. Rollback is instant — flip back to blue.

Canary: Gradually shift traffic from old to new. 5% → 25% → 50% → 100%, with automated rollback if error rate or latency degrades.

Why Models Need Canary More Than Most Services

For software services, a canary catches bugs. For ML models, it also catches performance regressions on production data distribution that didn’t appear in offline evaluation. Real users submit inputs that your test set didn’t cover.

A model that achieves 94% accuracy on your holdout set might perform at 87% on production traffic in week three, as input distribution drifts. Canary with real-time monitoring lets you catch this before 100% of users see the degraded model.

BentoML for Serving

BentoML packages a model + inference code into a Docker image with a standardized HTTP API. Deploy it as a Kubernetes Deployment behind a Service, and you get a standard microservice.

import bentoml

@bentoml.service(
    resources={"cpu": "2", "memory": "1Gi"},
    traffic={"timeout": 10},
)
class ClassifierService:
    model = bentoml.models.get("classifier:latest")

    @bentoml.api
    def predict(self, input: dict) -> dict:
        return {"prediction": self.model.run(input)}

Canary with Kubernetes

With Kubernetes Services and two Deployments (v1 and v2), you can implement canary by controlling replica counts:

9 replicas of v1 + 1 replica of v2 = 10% canary
Kubernetes distributes traffic proportionally via kube-proxy

For more sophisticated traffic splitting (header-based, sticky sessions, percentage-based at ingress), use a service mesh or Argo Rollouts.