The On-Call Tax — Thomas Nyambati

Introduction

Every engineer who has carried a pager knows the feeling: 3 a.m., the phone screams, you stumble to your laptop half-asleep and stare at a dashboard that tells you something is on fire but not why. You fix it, you write a perfunctory incident report, and you go back to bed — if sleep comes at all.

This is the on-call tax. It is rarely measured, rarely discussed at planning time, and almost never factored into velocity estimates. But it is real, and it compounds.

What the tax looks like

The direct cost is obvious: interrupted sleep, broken focus, time spent on reactive work instead of planned work. A single 3 a.m. page costs you more than the 20 minutes it takes to resolve — it costs you the next morning too, the context-switch recovery, the low-grade anxiety of carrying the device again tomorrow night.

The indirect cost is subtler. Engineers on heavy rotation stop proposing ambitious work. They stop refactoring the thing that causes half the alerts, because that requires the kind of deep concentration that a pager makes impossible. They start designing conservatively, adding complexity to avoid risk — and complexity is itself a source of future incidents.

Measuring it honestly

Start here: pull your alerting history for the last 90 days and bucket every page by hour. Count how many fired outside business hours. Now count unique engineers who were woken up. That number, divided by your team size, is your interruption rate.

Then look at what each alert resolved to:

Actionable — alert fired, engineer made a change, service recovered
Transient — alert fired, engineer watched, it self-resolved
Noise — alert fired, nothing was wrong

In most teams I have worked with, transient and noise categories account for 40–60% of after-hours pages. That means nearly half the cost of your on-call rotation is pure waste.

Reducing it

The best on-call rotations I have seen share a few properties.

Alerts are tied to user impact. If a pod restarts and the service is healthy, nobody gets paged. The question is always: is a user failing right now? Everything else is a dashboard metric, not an alert.

Runbooks are written before the alert fires. Not after. If your incident response requires tribal knowledge, you are creating a knowledge bottleneck and burning out your most senior engineers.

The rotation is short and well-compensated. One week on, several weeks off. And engineers are actually compensated for it — not in vague promises of comp time, but in explicit, consistent policy.

Postmortems are blameless and actioned. The only point of a postmortem is to make the same incident impossible or cheaper. If your postmortems produce action items that nobody picks up, you are doing retrospective theatre.

The honest conversation to have

The on-call tax is a leadership conversation as much as a technical one. If your on-call rotation is burning out engineers, no amount of better tooling will fix it permanently — because the root cause is either insufficient staffing, insufficient automation, or insufficient prioritisation of reliability work.

The measurement above gives you the data to have that conversation. Bring the numbers to your manager. Make the invisible cost visible. That is the first step toward paying it down.