Error Budgets and Burn Rates | Hi, I'm Muhammad Amal

Slo article cover illustration on a gradient background

September 26, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Error budget = (1 - SLO target) × total events. Burn rate = current bad-event rate / sustainable rate. Multi-window alerts (5m + 1h, 30m + 6h) catch both fast outages and slow degradations. Action policy: pause risk-introducing changes when budget burns hot.

After SLOs in practice , the alerting half. Burn rate is the practical lever for “when to act on SLO budget.”

The math

For 99.9% availability SLO over 30 days on a service doing 1000 req/min:

Total requests/30 days = 43.2M
Allowed errors = 0.001 × 43.2M = 43,200 errors
Allowed error rate (avg) = 0.1% = 1 per 1000 requests

Burn rate = how fast you’re using the budget relative to sustainable:

burn rate = (errors / total) / SLO error rate
         = current error rate / 0.001

Current 0.001 (0.1%): burn rate 1× — exactly sustainable
Current 0.005 (0.5%): burn rate 5× — using up 30-day budget in 6 days
Current 0.05 (5%): burn rate 50× — full budget in 14 hours

The higher the burn rate, the more urgent.

Multi-window alerts

A single threshold alert (“burn rate > 10”) flaps. Multi-window pattern from Google SRE workbook:

Fast burn:

- alert: SLOFastBurn
  expr: |
    (
      slo:burn_rate:5m{service="api"} > 14.4
    AND
      slo:burn_rate:1h{service="api"} > 14.4
    )
  for: 2m
  labels: { severity: critical }

14.4× burn rate over both 5m AND 1h windows = consuming 5% of budget in 1 hour. Real fire.

Slow burn:

- alert: SLOSlowBurn
  expr: |
    (
      slo:burn_rate:30m{service="api"} > 6
    AND
      slo:burn_rate:6h{service="api"} > 6
    )
  for: 15m
  labels: { severity: warning }

6× burn over 30m and 6h = consuming 10% of budget in 6 hours. Trending bad.

Two windows prevent: fast = “must be currently bad” + long = “not just a blip.”

Recording rules for burn rate

Computing burn rate in alerts directly is slow. Pre-compute with recording rules:

- name: api_slo
  interval: 30s
  rules:
    - record: slo:errors:rate5m
      expr: |
        sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
        /
        sum(rate(http_requests_total{service="api"}[5m]))

    - record: slo:errors:rate1h
      expr: |
        sum(rate(http_requests_total{service="api",status=~"5.."}[1h]))
        /
        sum(rate(http_requests_total{service="api"}[1h]))

    - record: slo:errors:rate30m
      expr: |
        sum(rate(http_requests_total{service="api",status=~"5.."}[30m]))
        /
        sum(rate(http_requests_total{service="api"}[30m]))

    - record: slo:errors:rate6h
      expr: |
        sum(rate(http_requests_total{service="api",status=~"5.."}[6h]))
        /
        sum(rate(http_requests_total{service="api"}[6h]))

    - record: slo:burn_rate:5m
      expr: slo:errors:rate5m / 0.001
    - record: slo:burn_rate:1h
      expr: slo:errors:rate1h / 0.001
    - record: slo:burn_rate:30m
      expr: slo:errors:rate30m / 0.001
    - record: slo:burn_rate:6h
      expr: slo:errors:rate6h / 0.001

Alerts query the recording rules, not raw histograms.

Dashboard

Per service, an SLO panel:

┌──────────────────────────────────────────────────────┐
│ API Availability SLO                                  │
├──────────────────────────────────────────────────────┤
│ Current 30d performance:    99.94%   (target 99.9%)  │
│ Error budget remaining:     65%      (good)          │
│ Burn rate (1h):             0.4×     (sustainable)   │
│ Burn rate (6h):             0.6×     (sustainable)   │
└──────────────────────────────────────────────────────┘

Plus a 30-day error budget burndown chart. Visualizes “are we using budget linearly?” Spikes correlate to incidents.

Action policy

SLOs without action policies are decoration. Three concrete tiers:

Budget healthy (> 50%): normal operation. Ship features. Take risks.

Budget tight (10-50%): review risky changes carefully. No code freezes; just elevated scrutiny.

Budget exhausted (< 10%): pause risky feature work. Focus on stability. Postmortem the incidents that ate the budget.

This is the SRE-book version. Real adoption varies:

Most teams: informal, “we should be careful when budget’s low”
Mature SRE shops: explicit code freeze policy when budget gone
Big tech: PRs blocked from deploy automatically when budget violated

Even informal awareness changes behavior. A monthly “here’s our SLO status” review surfaces patterns.

Resetting budget

Two reset patterns:

Rolling 30-day window: budget continuously calculated over last 30 days. New errors enter; old errors leave. Smooth.

Calendar-aligned (monthly): budget resets the 1st of each month. Discontinuity but easier to communicate.

Most engineering teams use rolling. Customer-facing SLAs often use calendar.

When SLOs and alerts conflict

You might have an SLO alert AND a separate “high error rate” alert. Don’t.

Pick one source of truth for “is this service OK.” Usually: SLO burn rate is the alert.

Separately, you can have alerts for:

“Something we want to know but isn’t user-facing” (cron job failed)
“Predictive” (disk filling, expiry coming up)
“Anomaly detection” (unusual traffic pattern)

Reserve “page someone” for SLO burn rate alerts on user-facing services.

Common Pitfalls

Burn rate alerts on too-short window. 5-minute burn rate alone is noisy. Always pair with longer window.

Alerting on SLO breach not burn rate. By the time you’ve broken SLO, it’s too late.

No exclusion for planned events. Maintenance windows shouldn’t count. Implement via silences or recorded-rule filters.

Different SLOs per region without rollup. Customers don’t care about us-east-1 vs eu-west-1. Combined SLO.

Linear budget burndown. Budget consumed evenly is healthy. Spikes = incidents. Spikes should be unusual; if every week has them, your SLO is too tight or your reliability is too low.

Wrapping Up

Multi-window burn rate alerts on pre-computed recording rules. Action policy that ties budget to behavior. Wednesday: Prometheus cardinality + cost control .

The math

Multi-window alerts

Recording rules for burn rate

Dashboard

Action policy

Resetting budget

When SLOs and alerts conflict

Common Pitfalls

Wrapping Up

Related posts

Service-Level Objectives in Practice

Alerting on Sensor Anomalies in IIoT

SLOs and Error Budgets for Distributed AI Pipelines

SLOs and Burn Rate Alerting in 2025, A Practical Guide

SLOs and Error Budgets That Engineers Actually Use

Productivity Metrics That Actually Matter

November Retro, Security Hardening Sprint

Audit Logging for Backend APIs

Let’s Start a Project