← Home
Blog

Things that work
in production

Practical writing on ML infrastructure, SRE, and Kubernetes. Opinions earned from clusters that failed, models that drifted, and runbooks that saved the day.

16 posts
1 series
Blog
series
MLOps Journey
Building a production ML platform on Kubernetes from scratch — infrastructure to chaos engineering.
11 episodes · Mar 2025 – Mar 2026
post
SLOs as a Conversation Tool, Not a Metric
The most valuable thing about Service Level Objectives isn't the number — it's what defining one forces you to discuss
Mar 10, 2026
SREObservabilityCulture
post
Python Dependency Hell in ML Projects
Why your ML environment works on your laptop and breaks in production — and how to fix it for good
Feb 24, 2026
MLOpsPythonContainers
post
Writing Runbooks That Actually Help
Most runbooks are useless at 3 a.m. Here's how to write ones that aren't
Feb 10, 2026
SRECultureObservability
post
Requests, Limits, and the Lies We Tell the Scheduler
Why misconfigured resource requests are the root cause of half your mysterious cluster problems
Jan 28, 2026
K8sPerformanceSRE
post
The On-Call Tax
What pager fatigue actually costs you — and how to measure it
Jan 15, 2026
SREObservabilityCulture