E01
Bootstrapping a Kubernetes Cluster for ML
Why k3s over full K8s for a constrained homelab — and what you give up
→ E02
GitOps Is Not Just Git + Ops
What FluxCD taught me about declarative infrastructure and why it matters for ML
Mar 23, 2025
FluxCDGitOps
→ E03
Designing Storage for ML Pipelines
Operational vs artifact storage — fast/ephemeral vs durable/versioned, and why it matters
Mar 23, 2025
MinIOSecrets
→ E04
Tracking ML Experiments Like Production Systems
MLflow as a release manager — the analogy every SRE engineer immediately gets
→ E05
Running ML Pipelines on Kubernetes with Argo Workflows
Comparing ML pipeline orchestration to job schedulers your SRE team already knows
→ E06
Deploying ML Models Like Microservices
Blue/green and canary for models — the SRE playbook applied to inference
Mar 23, 2025
BentoMLCanary
→ E07
CI/CD for Machine Learning Systems
Quality gates for models differ from software release gates — here's how and why
Mar 23, 2025
GitHub ActionsSupply Chain
→ E08
Monitoring ML Models in Production
What's the same as traditional service monitoring, and what's entirely new
Mar 23, 2025
EvidentlyDrift
→ E09
Building Self-Healing ML Systems
An SRE runbook that executes itself — event-driven retraining explained
Mar 23, 2025
Argo EventsAutomation
→ E10
Securing ML Systems on Kubernetes
Walking through CIS benchmark findings and how each relates to ML-specific risk
→ E11
What Happens When ML Systems Fail?
The strangest failure I induced with Chaos Mesh — and what it revealed
Mar 23, 2025
Chaos MeshSLOs
→