SRE → MLOps Engineer

Thomas
Nyambati

Building production-grade ML systems on Kubernetes. I care about reliability, observability, and security — not just model accuracy.


About

Senior Platform & SRE Engineer with 9+ years of experience designing large-scale cloud platforms, Kubernetes infrastructure, and high-volume observability stacks.

At DeliveryHero I architected a cloud bootstrapping platform that automated AWS account provisioning across the org, and maintained an observability stack handling 3M+ active series — with SLO monitoring that brought incident response down to an average of 3 minutes.

I’m drawn to the intersection of reliability and machine learning: where production discipline meets model chaos. I write about the trade-offs, the failures, and the tooling that actually holds up under load.

Current stack
  • Kubernetes / EKS / GKE
  • ArgoCD / GitOps
  • Prometheus + Mimir
  • Grafana / Loki / Tempo
  • Terraform / Terragrunt
  • Go / Python
  • Karpenter / HPA / VPA
  • Helm / Helmfile
  • GitHub Actions
  • AWS / GCP

Blog
post
Bootstrapping a Kubernetes Cluster for ML
Why k3s over full K8s for a constrained homelab — and what you give up
Mar 23, 2026
K8sGitOps
post
SLOs as a Conversation Tool, Not a Metric
The most valuable thing about Service Level Objectives isn't the number — it's what defining one forces you to discuss
Mar 10, 2026
SREObservabilityCulture
post
Python Dependency Hell in ML Projects
Why your ML environment works on your laptop and breaks in production — and how to fix it for good
Feb 24, 2026
MLOpsPythonContainers
post
Writing Runbooks That Actually Help
Most runbooks are useless at 3 a.m. Here's how to write ones that aren't
Feb 10, 2026
SRECultureObservability
post
Requests, Limits, and the Lies We Tell the Scheduler
Why misconfigured resource requests are the root cause of half your mysterious cluster problems
Jan 28, 2026
K8sPerformanceSRE
View all 16 posts →

Projects
🏗️ In progress
MLOps Homelab Platform
Production-grade MLOps stack on Kubernetes — 16 GB homelab, 11 phases, real security constraints. GitOps-first, observable by default.
k3sFluxCDMLflowArgo
🔄 In progress
Self-Healing ML Pipelines
Event-driven retraining loop — drift alert triggers Argo Events, quality gates control promotion. An SRE runbook that runs itself.
Argo EventsEvidentlyPrometheus
🔒 In progress
Zero-Trust ML Infrastructure
OPA/Gatekeeper policies, Vault-backed secrets, default-deny NetworkPolicies, and a CIS Kubernetes Benchmark self-assessment log.
OPAVaultkube-bench
📡 In progress
ML Observability Stack
Custom Prometheus metrics from inference services, Grafana dashboards for model drift, and Evidently reports on schedule.
EvidentlyGrafanaFluent Bit
View all 4 projects →

Contact

Let's talk.

Open to conversations about MLOps, platform engineering, SRE, or just building stuff on Kubernetes. Find me on GitHub or send a mail.