background-shape
SLOs and Error Budgets That Engineers Actually Use
June 5, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — An SLO is only useful if it changes behavior. Pick three SLIs per service, set targets you’d actually defend, alert on burn rate not absolute values, and tie the budget to a clear policy. Skip any of those and you have dashboards, not a reliability program.

I’ve helped four orgs roll out SLOs in the last six years. The first three failed quietly. The fourth stuck, and the difference wasn’t the math or the tooling. It was that the team treated the error budget as a real artifact, the way they’d treat a sprint capacity number, not as a number that lives in a Grafana panel during the QBR.

This post is the playbook that came out of those failures. It assumes you’ve read the Google SRE workbook chapter on SLOs and want practical scaffolding for actually using them. If you’re earlier than that, skim the workbook first; it’s free and worth a Saturday.

A quick reminder before we get into mechanics. Digital Immune Systems needs SLOs as their feedback loop. Without them you can’t decide when to remediate, when to deploy, or when to stop. They’re not optional.

The three-SLI rule

Every service can have a long list of things you’d like to measure. Resist. Pick three SLIs maximum per service. The reason is that you’ll use SLOs during incidents, and an on-caller can hold three numbers in their head at 3am. They cannot hold seven.

For an HTTP service, the canonical three are:

  1. Availability — fraction of requests that returned a non-5xx response.
  2. Latency — fraction of requests that returned under some threshold (e.g. 300ms at p95).
  3. Correctness — fraction of requests that returned the right answer. This is the one most teams skip and most regret.

Correctness is service-specific. For a payments API it’s “did the right charge happen.” For a search API it’s “did relevant results return.” It’s hard to measure, and that’s the point. If you can’t define it, you don’t understand what your service is for.

Writing an SLI in PromQL

A useful SLI is a ratio of good events to total events, measured over a window. The window is usually 28 or 30 days. Here’s an availability SLI for a service with Prometheus metrics:

sum(rate(http_requests_total{service="checkout", code!~"5.."}[28d]))
/
sum(rate(http_requests_total{service="checkout"}[28d]))

Two things to notice. First, we exclude 5xx as “bad” but count 4xx as “good.” Client errors aren’t your fault and shouldn’t burn your budget. Second, this is a recording rule candidate. Don’t query 28d windows live; pre-aggregate.

A latency SLI uses a histogram:

sum(rate(http_request_duration_seconds_bucket{service="checkout", le="0.3"}[28d]))
/
sum(rate(http_request_duration_seconds_count{service="checkout"}[28d]))

This gives you “fraction of requests under 300ms over 28 days.” Your SLO might be that this ratio stays above 0.95. The error budget is then 5% of requests allowed to be slow.

Burn-rate alerts, not threshold alerts

The mistake I see most often is alerting when the SLI crosses the SLO. By then it’s already breached for the whole window. You want to alert when you’re burning the budget faster than the window allows.

A 30-day budget of 1% means you have 1% of requests to spend over 30 days. Burning that in an hour means a 720x burn rate. The multi-window multi-burn-rate alert from the workbook codifies this:

groups:
  - name: checkout-slo-burn
    rules:
      - alert: CheckoutAvailabilityBurnHigh
        expr: |
          (
            sum(rate(http_requests_total{service="checkout", code=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="checkout"}[5m]))
          ) > (14.4 * 0.01)
          and
          (
            sum(rate(http_requests_total{service="checkout", code=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{service="checkout"}[1h]))
          ) > (14.4 * 0.01)
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Checkout burning 30d error budget at >14.4x"

The 14.4 is the burn-rate multiplier that consumes 2% of a 30-day budget in one hour. The dual-window pattern requires the burn to hold over both 5m and 1h to fire, which kills spurious blips. The companion alert at a lower multiplier (6x) over longer windows (30m and 6h) catches slower bleeds at ticket severity rather than page.

The error budget policy

This is the part everybody skips. Numbers don’t change behavior. Policies do. Write down, before any of this is in production, what happens when:

  • The budget is healthy (>50% remaining): ship freely, take risks.
  • The budget is low (<25% remaining): no risky deploys, no migrations, on-call gets a heads-up.
  • The budget is exhausted: feature freeze, reliability work only, until the window rolls.

Put this in a doc your team reviewed and your manager signed. It will be tested within three months. The first time the budget is exhausted and someone wants to ship anyway is the moment that decides whether you actually have an SLO program or just dashboards.

The policy is also where you handle nuance. “Feature freeze” doesn’t mean nothing ships; it means no new features. Security patches always ship. Reliability fixes always ship. Write that out so on-calls don’t relitigate it under pressure.

The review cadence

SLOs that aren’t reviewed are dead. The cadence I run:

  • Weekly: a 10-minute scan of burn rates per SLO. Anything trending toward exhaustion goes on the agenda for the next planning meeting.
  • Monthly: a 30-minute SLO review. Are the targets still right? Have any SLIs become noisy? Should any be retired?
  • Quarterly: a deeper review tied to OKRs. The error budget consumed last quarter should be a line item in the next quarter’s plan.

The monthly review is where most adjustments happen. SLIs drift as the service evolves. A latency threshold of 300ms might have been right when the service did one thing; if it’s grown to call three downstreams, the threshold needs revisiting. Reviews catch this. Without them, the SLO slowly stops describing the service.

Don’t SLO everything

A reasonable default is: SLO your user-facing services and your top three internal services. Everything else gets monitoring but not SLOs. The reason is that SLOs are expensive. They need owners, reviews, and burn-rate alerts. Spreading them across 80 microservices means none of them get attention.

If you have a service mesh, the Linkerd SLO docs are a decent starting point because the mesh emits the metrics for you. But mesh metrics aren’t enough on their own; they only see what the proxy sees.

A worked example

For a checkout service handling 1M requests/day with a 99.5% availability SLO:

  • Budget per 30 days: 5,000 failed requests.
  • Daily burn: ~166 failed requests is “on track.”
  • A bad deploy that adds 1,000 5xx in 10 minutes burns 20% of the budget. That should page.
  • A flaky downstream that adds 50 5xx/hour for a day burns 24% of the budget. That should ticket.

Both happen monthly in real systems. The SLO program is what tells you which one to chase first.

Gotchas

A few patterns that have bitten me:

  • Counting 4xx as bad. Don’t. Your users will deliberately send malformed requests; you can’t fix that by code-reviewing harder.
  • SLOs on background jobs measured the same way as APIs. Use throughput or freshness SLIs for async work, not availability.
  • No SLO ownership. If “the team” owns the SLO, nobody does. Name a single engineer per SLO. Rotate quarterly.
  • Resetting the budget when it breaches. No. Let it stay underwater. That’s the signal.
  • Multiplying SLIs to compound them. A “composite” SLO that ANDs availability and latency reads worse than its parts. Keep them separate.
  • Alerting on raw error rate as a backup. If you do this, your team will ignore the SLO alerts in favor of the raw ones. Pick one.

Communicating SLOs outside engineering

A common failure: engineering treats the SLO seriously, product treats it as a number, leadership ignores it. Three habits that help:

  1. Report SLO health in the same forum you report shipping velocity. Same slide, same audience. If reliability isn’t on the slide, it doesn’t compete.
  2. Translate budget burn into product language. “5% of users had a slow checkout this month” beats “0.95 SLI against a 0.99 SLO target” every time.
  3. Make the budget policy visible. When you do freeze features, the announcement should reference the SLO, not just say “we’re stopping work.” This trains the rest of the org to take the budget seriously.

If you can’t get leadership to engage with these, the SLO program will reach a ceiling. Engineering can buy itself reliability up to a point. Beyond that, product trade-offs need leadership air cover.

Wrapping Up

SLOs are a discipline, not a tool. The math is straightforward. The hard part is the policy and the cultural commitment to actually freeze features when the budget is gone. If your leadership won’t sign that policy, you don’t have SLOs; you have telemetry with extra steps. Make leadership sign first, then build the dashboards.

In the next reliability post I’ll get into chaos engineering on Kubernetes, which is the cheapest way to discover which of your SLOs are actually defended by your architecture. Spoiler: usually fewer than you think.