Series

MLOps Journey

Building a production ML platform on Kubernetes from scratch — infrastructure to chaos engineering.

11 episodes

Bootstrapping a Kubernetes Cluster for ML

Why k3s over full K8s for a constrained homelab — and what you give up

GitOps Is Not Just Git + Ops

What FluxCD taught me about declarative infrastructure and why it matters for ML

Designing Storage for ML Pipelines

Operational vs artifact storage — fast/ephemeral vs durable/versioned, and why it matters

Tracking ML Experiments Like Production Systems

MLflow as a release manager — the analogy every SRE engineer immediately gets

Running ML Pipelines on Kubernetes with Argo Workflows

Comparing ML pipeline orchestration to job schedulers your SRE team already knows

Deploying ML Models Like Microservices

Blue/green and canary for models — the SRE playbook applied to inference

CI/CD for Machine Learning Systems

Quality gates for models differ from software release gates — here's how and why

GitHub ActionsSupply Chain

Monitoring ML Models in Production

What's the same as traditional service monitoring, and what's entirely new

Building Self-Healing ML Systems

An SRE runbook that executes itself — event-driven retraining explained

Argo EventsAutomation

Securing ML Systems on Kubernetes

Walking through CIS benchmark findings and how each relates to ML-specific risk

What Happens When ML Systems Fail?

The strangest failure I induced with Chaos Mesh — and what it revealed