Introduction

Most teams that adopt SLOs focus on the wrong thing: the percentage. They debate whether 99.9% or 99.95% is the right target, argue about measurement windows, and implement alerting on error budget burn rate. Then they check the number weekly and call it done.

The percentage matters. But the conversation that produces it matters more.

What an SLO definition actually requires you to agree on

To write an SLO, you must answer questions that most teams have never explicitly discussed:

What does “available” mean for this service?

For a synchronous API, this is usually “responding within X ms with a 2xx status.” But what counts as a valid request? Should health check endpoints be included? What about requests from internal services vs. external users? What about requests that fail due to caller error (4xx)?

Each of these decisions changes your measured availability by potentially several percentage points. Teams that have never discussed them are measuring different things even when they’re looking at the same dashboard.

Who is the customer, and what do they actually care about?

An availability SLO protects uptime. But if your service processes async jobs, uptime is less relevant than throughput and latency at the p95. If you’re serving a search index, a user cares that results are fresh — a staleness SLO might be more meaningful than an availability one.

The right SLO type follows directly from “what would a user notice if we got worse.” This question forces a product conversation, not an infrastructure one.

What is an acceptable failure rate?

“99.9% availability” sounds precise, but 0.1% of what? Per request? Per user session? Per distinct feature invocation? An error that affects 0.1% of API calls might affect 40% of users if those errors cluster. Error budget calculations can look very different depending on the denominator.

How much tolerance do we have?

This is the budget conversation. A 0.1% error budget over 30 days is 43.2 minutes of total allowed downtime. Is that enough? The answer depends on your deployment frequency, your incident resolution speed, and your users’ actual pain tolerance. A team deploying 20 times a day needs a different error budget than a team deploying once a week.

The value of the disagreement

In my experience, the most useful thing that happens during an SLO definition exercise is discovering that people on the same team have incompatible mental models of what the service is supposed to do.

An engineer thinks the service should retry failed upstream calls transparently. The product manager thinks failures should surface immediately so users can make informed decisions. The SRE thinks the latency budget should accommodate retries. None of them are wrong — they’re answering different questions, and nobody realised it until the SLO definition forced the question.

That disagreement, surfaced and resolved before it becomes an incident, is worth more than any specific percentage.

How to run the conversation

Bring three groups: whoever runs the service (SRE or platform team), whoever builds it (dev team), and whoever defines success for it (product or business owner). Ask these questions in order:

  1. What is the user-visible action this service enables?
  2. What does “that action failed” look like from the user’s perspective?
  3. What rate of failure is noticeable? Annoying? Unacceptable?
  4. What is our current actual failure rate?

Question 4 is often the most clarifying. Teams that have never measured their actual error rate are usually surprised. Setting an SLO target that is stricter than your current actual performance means you’re already out of budget before you’ve written a line of alerting config.

Start with a target that is achievable given your current baseline, then tighten it as you improve. An SLO that nobody believes in — because it’s perpetually breached — is worse than no SLO.

What changes after you have one

Done well, an SLO becomes the primary lens through which the team makes tradeoffs. Should we take this maintenance window? Check the error budget. Should we prioritise this reliability fix over this feature? Check the burn rate. Should we invest in a faster deployment pipeline? The SLO tells you what headroom you have.

It also changes the on-call conversation. Instead of “the service is degraded,” the question becomes “are we burning error budget?” A service that is technically degraded but well within budget might not require an engineer’s weekend. A service that is nominally healthy but burning budget at 10× normal rate needs immediate attention even if no alert has fired.

This reframe — from “is something broken” to “are we meeting our commitments” — is the real payoff of doing SLOs properly.