SLOs and Burn Rate Alerting in 2025, A Practical Guide

Aiops article cover illustration on a gradient background

May 19, 2025 · 7 min read · by Muhammad Amal programming

TL;DR — Pick three SLIs per service, set SLOs from real user pain, write multi-window multi-burn-rate alerts using the Google SRE workbook formula, and let Sloth generate the rules so you stop hand-editing PromQL.

The SLO conversation in 2025 finally settled on something that works. The Google SRE workbook’s multi-window multi-burn-rate approach won. Vendors stopped inventing competing formulas. Sloth and OpenSLO are stable enough that you can generate the rules from declarative YAML and trust the output. There’s no excuse left for handcrafting burn rate alerts.

What hasn’t been solved is the easy mistake of writing SLOs that don’t reflect user pain. A 99.9% SLO on http_requests_total divided by total requests is a number, not a promise. If your real users care about checkout completing in under 2 seconds, that’s the SLI, and the SLO needs to be defined on it. The math comes second.

This guide walks through SLO design from the user backwards: pick SLIs, set targets, generate the rules, and tune the alerts so the on-call rotation actually trusts them. We’ll use Prometheus 3.0 for the metrics and Sloth for rule generation. The output is a small bundle of recording and alerting rules that a senior SRE will recognize and trust.

1. Picking the Right SLI

The SLI is a ratio of good events to total events, observed from the user’s side. Three pitfalls:

Measuring at the wrong layer. Latency from inside the pod ignores the load balancer and the network. Measure at the edge.
Treating all requests equally. Health check /healthz requests don’t count. Filter them out.
Counting 4xx as bad. Most 4xx are client errors. They count as availability events but not as availability failures. The SLI should be 5xx / (2xx + 5xx), not 5xx / total.

The three SLIs that fit most user-facing services:

Availability: fraction of requests that returned a non-5xx response.
Latency: fraction of requests served under a target threshold.
Correctness (for write paths): fraction of requests whose downstream effect succeeded.

You don’t need more than three. Adding a fourth dilutes attention.

2. Defining the Targets

Targets come from the product, not from the SRE team. The right conversation is:

“If we serve fewer than 99.5% of checkouts in under 2 seconds for a full day, the business reads this as a degraded experience. Anything above 99.5% over 30 days is acceptable.”

That sentence is your SLO. It maps to:

SLI: http_request_duration_seconds_bucket{route="/checkout",le="2"} / total checkout requests
SLO: 99.5% over a 30-day window
Error budget: 0.5% of total requests, distributed over 30 days

A 99.5% SLO on a 30-day window with 10 RPS averages out to about 130k requests of budget. That’s a lot. Most teams aim too high. 99.9% on a busy service is 14k requests of budget — gone in one bad deploy.

3. Declaring SLOs in Sloth

Sloth turns this YAML into Prometheus rules. It’s the cleanest way to keep SLO definitions in version control.

# slos/checkout.yaml
version: prometheus/v1
service: checkout
labels:
  team: payments
  tier: critical
slos:
  - name: availability
    objective: 99.5
    sli:
      events:
        error_query: sum(rate(http_requests_total{job="checkout",code=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total{job="checkout"}[{{.window}}]))
    alerting:
      name: CheckoutAvailability
      labels:
        severity: page
      annotations:
        summary: "Checkout availability SLO burn"
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

  - name: latency-2s
    objective: 99.0
    sli:
      events:
        error_query: |
          sum(rate(http_request_duration_seconds_count{job="checkout"}[{{.window}}]))
          -
          sum(rate(http_request_duration_seconds_bucket{job="checkout",le="2"}[{{.window}}]))
        total_query: sum(rate(http_request_duration_seconds_count{job="checkout"}[{{.window}}]))
    alerting:
      name: CheckoutLatency
      labels:
        severity: page
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

Generate the rules:

sloth generate -i slos/checkout.yaml -o rules/checkout.yaml

The output is 20+ recording rules and 4 alerting rules. We’ll look at the alerting structure next.

4. The Multi-Window Multi-Burn-Rate Pattern

This is the formula that finally killed flapping SLO alerts. The idea: fire a page only when the error budget is burning fast enough to be exhausted in a meaningful window, and confirm with a longer window so we don’t page on a 30-second blip.

For a 30-day SLO with budget B:

Critical page: 14.4x burn rate, sustained for 1h, confirmed by 5m
High page: 6x burn rate, sustained for 6h, confirmed by 30m
Ticket: 1x burn rate, sustained for 3 days

The 14.4x is calibrated to consume 2% of the 30-day budget in 1 hour. The 6x consumes 5% in 6 hours. Numbers are from the SRE workbook, don’t second-guess them on day one.

Here’s the generated alert (cleaned up for readability):

- alert: CheckoutAvailability-PageCritical
  expr: |
    (
      slo:sli_error:ratio_rate5m{slo="availability"} > (14.4 * 0.005)
      and
      slo:sli_error:ratio_rate1h{slo="availability"} > (14.4 * 0.005)
    )
    or
    (
      slo:sli_error:ratio_rate30m{slo="availability"} > (6 * 0.005)
      and
      slo:sli_error:ratio_rate6h{slo="availability"} > (6 * 0.005)
    )
  labels:
    severity: critical
    slo: availability
  annotations:
    summary: "Checkout availability SLO burning fast"
    runbook: "https://runbooks.internal/slo/checkout-availability"

The 0.005 is 1 - 0.995, the allowed error fraction. Multiply by the burn rate to get the threshold ratio.

5. Recording Rules That Don’t Kill Prometheus

Burn rate alerts query the same ratio at four windows. Without recording rules you’d evaluate that expression repeatedly. Sloth generates them for you, but it helps to know what they look like.

- record: slo:sli_error:ratio_rate5m
  expr: |
    sum(rate(http_requests_total{job="checkout",code=~"5.."}[5m]))
    /
    sum(rate(http_requests_total{job="checkout"}[5m]))
  labels:
    slo: availability
    service: checkout

- record: slo:sli_error:ratio_rate1h
  expr: |
    sum(rate(http_requests_total{job="checkout",code=~"5.."}[1h]))
    /
    sum(rate(http_requests_total{job="checkout"}[1h]))
  labels:
    slo: availability
    service: checkout

Repeat for 30m, 6h, 1d, 3d. The alert then references the recorded series and is cheap to evaluate.

6. The Error Budget Burndown Panel

Grafana 11.4 makes this trivial once the recording rules exist. The panel shows remaining error budget as a percentage.

1 - (
  sum_over_time(slo:sli_error:ratio_rate5m{slo="availability"}[30d])
  /
  (24 * 30 * 12)  -- five-minute samples per 30-day window
) / 0.005

That’s a rough approximation. Sloth ships a slo:current_burn_rate and slo:error_budget_remaining directly. Use those.

+-------------------------------------------+
|  Error Budget Remaining (30d)             |
|                                           |
|   [######################__________] 68%  |
|                                           |
|   Last incident: 14 days ago              |
|   Current burn rate: 0.4x                 |
+-------------------------------------------+

Put this panel on the team’s TV. Watching the budget tick down is the right cultural feedback loop.

7. Notification Routing

The alert labels drive routing. In Alertmanager:

route:
  group_by: ['service', 'slo']
  routes:
    - matchers:
        - severity = critical
      receiver: pagerduty-oncall
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h

    - matchers:
        - severity = warning
      receiver: ticket-queue
      group_wait: 5m
      group_interval: 1h
      repeat_interval: 24h

Critical pages the on-call. Warning files a ticket. There is no third tier. If you find yourself wanting “loud Slack notification but not a page”, that’s a critical alert with a paging tool that supports Slack — not a separate severity.

8. Common Pitfalls

Four mistakes I’ve seen kill SLO programs.

Setting the SLO before measuring the SLI. Measure for 30 days first. Pick a target that’s slightly better than your current measured performance. Aspirational SLOs guarantee a flapping alert from week one.
Treating SLO violations as outages. An SLO violation is a budget signal, not an incident. The on-call doesn’t get paged on the SLO itself, they get paged on burn rate. Don’t conflate the two.
Including synthetic checks in the SLI. Synthetic monitoring is great for redundancy and dead-canary detection, but it’s not user pain. Keep synthetics on a separate alerting path.
Ignoring the long-window confirmation in burn rate alerts. The 1h confirmation on the 5m fast burn is the entire reason this approach doesn’t flap. People delete it because the alert “didn’t fire fast enough” during an obvious outage. Leave it alone.

9. Troubleshooting

Three failure modes you’ll hit.

9.1 Alert fires every Monday morning

Your 5m window catches the first burst of traffic and your 1h confirmation hasn’t built up yet. This is a feature, not a bug, but if it’s noisy add a 10-minute for: clause to debounce.

9.2 Burn rate is always zero

Your SLI denominator includes traffic that doesn’t have errors. Common culprit: health check endpoints with code=200 dominate. Filter them out of both the numerator and denominator.

9.3 Error budget claims 100% remaining but service was down yesterday

The 30-day window is forgiving. A 1-hour outage at 10 RPS costs 36k requests of budget. If your total 30-day traffic is 100M requests, that’s 0.036% of budget — the burndown panel barely moves. This is correct. Outages on busy services are cheap; outages on slow services are catastrophic. Pick SLOs that punish what actually matters.

10. Wrapping Up

SLOs in 2025 are a solved problem if you accept the solution. Declarative SLO definitions in version control, Sloth to generate the rules, multi-window multi-burn-rate alerts, a single burn-rate Grafana panel on the team’s wall. Stop arguing about the math, focus on picking SLIs that match what users feel.

For the alerting layer’s downstream side, see incident response automation with LangGraph, a step by step tutorial for how the page becomes a workflow. For the canonical reference on the math, the Google SRE workbook chapter on alerting on SLOs is still the best read.

1. Picking the Right SLI

2. Defining the Targets

3. Declaring SLOs in Sloth

4. The Multi-Window Multi-Burn-Rate Pattern

5. Recording Rules That Don’t Kill Prometheus

6. The Error Budget Burndown Panel

7. Notification Routing

8. Common Pitfalls

9. Troubleshooting

9.1 Alert fires every Monday morning

9.2 Burn rate is always zero

9.3 Error budget claims 100% remaining but service was down yesterday

10. Wrapping Up

Related posts

Anomaly Detection on Prometheus Metrics, A Hands On Guide

SLOs and Error Budgets for Distributed AI Pipelines

Postmortem Automation with LLMs, Drafts That Don't Lie

Chaos Engineering with AI Augmented Hypotheses

Incident Response Automation with LangGraph, A Step by Step Tutorial

Building an SRE Copilot for On Call Engineers

AI Driven Log Analysis at Scale, A Production Tutorial

Auto Remediation Pipelines with LLM Agents and Argo Events

Let’s Start a Project