Blog

Things that work
in production

Practical writing on ML infrastructure, SRE, and Kubernetes. Opinions earned from clusters that failed, models that drifted, and runbooks that saved the day.

16 posts

1 series

Blog

Building a production ML platform on Kubernetes from scratch — infrastructure to chaos engineering.

11 episodes · Mar 2025 – Mar 2026

SLOs as a Conversation Tool, Not a Metric

The most valuable thing about Service Level Objectives isn't the number — it's what defining one forces you to discuss

SREObservabilityCulture

Python Dependency Hell in ML Projects

Why your ML environment works on your laptop and breaks in production — and how to fix it for good

MLOpsPythonContainers

Writing Runbooks That Actually Help

Most runbooks are useless at 3 a.m. Here's how to write ones that aren't

SRECultureObservability

Requests, Limits, and the Lies We Tell the Scheduler

Why misconfigured resource requests are the root cause of half your mysterious cluster problems

K8sPerformanceSRE

The On-Call Tax

What pager fatigue actually costs you — and how to measure it

SREObservabilityCulture