← Blog
Series

MLOps Journey

Building a production ML platform on Kubernetes from scratch — infrastructure to chaos engineering.

11 episodes
E01
Bootstrapping a Kubernetes Cluster for ML
Why k3s over full K8s for a constrained homelab — and what you give up
Mar 23, 2026
K8sGitOps
E02
GitOps Is Not Just Git + Ops
What FluxCD taught me about declarative infrastructure and why it matters for ML
Mar 23, 2025
FluxCDGitOps
E03
Designing Storage for ML Pipelines
Operational vs artifact storage — fast/ephemeral vs durable/versioned, and why it matters
Mar 23, 2025
MinIOSecrets
E04
Tracking ML Experiments Like Production Systems
MLflow as a release manager — the analogy every SRE engineer immediately gets
Mar 23, 2025
MLflowDVC
E05
Running ML Pipelines on Kubernetes with Argo Workflows
Comparing ML pipeline orchestration to job schedulers your SRE team already knows
Mar 23, 2025
ArgoDAGs
E06
Deploying ML Models Like Microservices
Blue/green and canary for models — the SRE playbook applied to inference
Mar 23, 2025
BentoMLCanary
E07
CI/CD for Machine Learning Systems
Quality gates for models differ from software release gates — here's how and why
Mar 23, 2025
GitHub ActionsSupply Chain
E08
Monitoring ML Models in Production
What's the same as traditional service monitoring, and what's entirely new
Mar 23, 2025
EvidentlyDrift
E09
Building Self-Healing ML Systems
An SRE runbook that executes itself — event-driven retraining explained
Mar 23, 2025
Argo EventsAutomation
E10
Securing ML Systems on Kubernetes
Walking through CIS benchmark findings and how each relates to ML-specific risk
Mar 23, 2025
OPAVault
E11
What Happens When ML Systems Fail?
The strangest failure I induced with Chaos Mesh — and what it revealed
Mar 23, 2025
Chaos MeshSLOs