Blameless Postmortems That Actually Change Behavior

Sre article cover illustration on a gradient background

June 24, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — A blameless postmortem is not a writeup; it’s a decision-making meeting. The artifact that matters is the action item list, owned by named humans, with deadlines, tracked weekly. Skip any of those and your postmortems will be repeated documents about repeated incidents.

I’ve sat through somewhere north of 200 postmortems. The good ones I remember because they changed how a team worked. The bad ones blur together because they all looked the same: a slide deck about a P1, a five-whys diagram, a list of action items, and three months later the same incident.

The variable that predicts whether a postmortem matters isn’t the document. It isn’t even the meeting. It’s whether the actions get done. Everything in this post is in service of that.

I’m assuming you’ve already read the Etsy Code As Craft post on blameless postmortems or the Google SRE workbook chapter. If not, do that first. This post is the operational layer on top.

A related prerequisite. Postmortems without SLOs and error budgets are easier to write but harder to act on, because you don’t have a budget that the incident dented. The budget is the receipt that says this incident cost something measurable. Without it, every action item competes with feature work as a matter of opinion.

What blameless actually means

“Blameless” doesn’t mean “no one is responsible.” It means the postmortem assumes everyone involved made the best decision they could with the information they had at the time. The investigation focuses on the system, not on who pushed the bad button.

This matters because the alternative (blame) is psychologically expensive and epistemically wrong. The person who pushed the bad button usually pushed it because the system made it easy to push. If you punish them, the next person hides the next mistake, and your incident learning rate goes to zero.

Concretely:

“Alice deployed without running tests” → wrong framing.
“Our deploy tool doesn’t require tests for hotfixes, and Alice used the hotfix path because the on-call playbook for this scenario recommended it” → right framing.

The first leads to a slack to Alice. The second leads to a change in the deploy tool and the playbook.

The five questions

Skip the five-whys ritual. It’s a procrustean bed. Replace it with five questions that I’ve found generate better discussion:

What did the responder believe was happening, and what was actually happening? This is the gap that defines the incident. Closing it is the work.
What signal did we have, and how long did it take to surface? Latency to detection is the highest-leverage improvement target.
What action shortened the incident? What action lengthened it? Both questions. The second is the one nobody asks.
What property of the system made this incident possible? What property would have prevented it? Property-level thinking, not action-level.
What will we test to verify this can’t recur? If nothing, the incident isn’t resolved; it’s deferred.

Five questions, 90 minutes, one facilitator who isn’t on the affected team. The outsider keeps the conversation honest.

The template

A postmortem document I’d run today, in markdown:

# Postmortem: <one-line headline>

- Severity: SEV-N
- Duration: HH:MM (start) → HH:MM (end), TZ
- Detection: how, by whom
- Impact: users affected, requests failed, SLO budget consumed (%)
- Authors: <names>
- Status: draft | review | resolved

## TL;DR
Two sentences. What broke, what we did, what we changed.

## Timeline
A bulleted log in chronological order. Real timestamps. Quote chat messages verbatim. No editorialization.

## What was actually happening
The technical narrative of root cause. Diagrams welcome.

## What we believed was happening (during)
Where the responders' mental model diverged from reality. This is the gap.

## What helped, what hurt
Two lists. Specific actions, not feelings.

## Action items
Each with: owner (name), deadline (date), success criterion (testable), priority (P0/P1/P2).

## Verification
How we'll confirm this can't recur. A test, a chaos experiment, a monitoring change.

## Open questions
Things we don't know. Worth listing because they sometimes become the next incident.

Two design choices worth flagging. First, the “what we believed vs what was actually happening” split. Most postmortems collapse these. Keeping them separate forces you to identify the mental-model gap, which is usually where the real learning is.

Second, the verification section. An action item without a verification step is wishful thinking. “We added a retry” is not verification. “We added a chaos experiment that drops 30% of payments traffic and verifies checkout latency stays under 500ms” is verification.

The action item discipline

The whole point. Here’s the discipline that turns action items into changes:

Every action item has a named human owner. Not “the team.” A person.
Every action item has a deadline. Default two weeks for P0, four for P1, eight for P2.
P0s block other work for the owner until done. This is a hard rule, not a guideline.
A weekly postmortem-action standup reviews open items. Five minutes per item. Are you blocked? Is the deadline still real?
Past-due items escalate to manager. Not as a punishment, as a signal that something is wrong with the prioritization.

If you do all five, the median time-to-close on P0 action items drops from “never” to about three weeks. I have data from two organizations to back this up; numbers happy to share offline.

Anti-patterns I keep seeing

A few patterns that make postmortems theater:

“Better monitoring” as an action item. Vague. Replace with “alert on X metric crossing Y, owned by Z, deployed by date.”
“Improve documentation.” Same problem. Replace with “the on-call playbook for failure mode F has a runbook for it by date.”
Action items assigned to “the team.” Means nobody. Pick a human.
No deadline. It will never get done.
Postmortem authored by the incident commander only. Bias. Get a co-author who wasn’t running the incident.
No timeline in the doc. The timeline is the most-rereread section. Skipping it is a waste of the work the responders already did.
Posting the postmortem and never revisiting it. Bring it to the next retro. Ask if the actions held.

When to call a postmortem

Not every incident needs one. My rule:

Sev 1 / Sev 2: always.
Sev 3 with novel root cause: yes.
Sev 3 with familiar root cause: short writeup, no meeting.
Near-miss that would have been Sev 2: yes.
“We had to manually intervene but no users were affected”: yes; these are the cheapest learning opportunities.

The last category is the most underused. Near-misses are gifts. They tell you about a failure mode without the user pain. Treat them with the same seriousness as a real incident.

The 30/60/90 ritual

A small ritual that has worked for me. Every postmortem gets revisited at 30, 60, and 90 days:

30 days: are the P0 action items done? If not, escalate.
60 days: did the verification step actually run? Did it pass?
90 days: have we had a recurrence? If yes, the postmortem was insufficient. New postmortem on the recurrence has to address why the first one didn’t catch the property.

This ritual is a calendar invite, owned by an SRE manager, ten minutes long. It sounds boring. It is. Boring is what works.

Gotchas

Blameless in the document, blameful in the hallway. Doesn’t work. If a manager is privately blaming, the blameless framing fails. Leadership behavior trumps process.
No facilitator. Self-facilitated postmortems by the affected team are biased toward the team’s narrative. Bring an outsider.
Postmortem as performance. Don’t read the postmortem aloud in the meeting. Pre-read. Use the meeting for questions and action items.
Confusing root cause with first cause. A root cause is a property of the system. A first cause is the first event in the timeline. They’re not the same.
Hiding postmortems. Internal-only is fine. Org-only is a mistake. Other teams want to learn from your failures. Let them.
No followup metric. Track “% of P0 action items closed within deadline” as a quarterly number. Make it visible. It will move.

Wrapping Up

Blameless postmortems work when the artifact is the action item list and the discipline is the followup ritual. They fail when the writeup is the deliverable. If your team is producing documents but not changes, the process is theater. Switch to the action-item-first model, name owners, set deadlines, escalate misses.

This post closes the human-process side of the Digital Immune System pillars . Next, and last for this month, I’ll cover synthetic monitoring paired with canary deploys, which is where you catch failure modes before they become postmortems at all.

What blameless actually means

The five questions

The template

The action item discipline

Anti-patterns I keep seeing

When to call a postmortem

The 30/60/90 ritual

Gotchas

Wrapping Up

Related posts

Postmortem Automation with LLMs, Drafts That Don't Lie

Synthetic Monitoring and Canary Deploys, A Practical Pairing

Service Mesh Resilience, Istio Ambient vs Linkerd in 2024

eBPF Plus OpenTelemetry, The Observability Pairing for 2024

Auto Remediation on Kubernetes, Argo Events and Policy as Code

Chaos Engineering on Kubernetes, Litmus and Chaos Mesh in 2024

SLOs and Error Budgets That Engineers Actually Use

Digital Immune Systems for Engineers, What Gartner Got Right

Let’s Start a Project