Postmortem Automation with LLMs, Drafts That Don't Lie

Postmortem Automation with LLMs, Drafts That Don't Lie

May 23, 2025 · 9 min read · by Muhammad Amal programming

TL;DR — Pull timeline from Slack, PagerDuty, and Argo CD into a structured record, let claude-3.7-sonnet draft only the narrative sections, hard-anchor every claim to a source URL, and never let the model invent a root cause.

The first generation of LLM postmortem tools all made the same mistake. They fed the model a Slack channel transcript and asked for a postmortem. The output was confident, well-written, and frequently wrong about timestamps, cause, and impact. A senior SRE would catch the lies in five minutes, but only after wasting the 30 minutes spent reading them.

The right model is narrower. Treat the postmortem as a structured document with two kinds of fields: timeline data (objective, pulled from systems) and narrative data (subjective, drafted by the model from the timeline data). The model never authors the timeline. It only writes prose that’s traceable back to a structured entry. Every paragraph the model produces has a footnote pointing to a Slack permalink, an Argo CD revision, or a PagerDuty event ID.

The result is a draft that takes the on-call from a blank page to a 30-minute review. They edit the prose, fix the model’s interpretation, and ship the postmortem. They don’t waste time arguing with hallucinated facts.

1. The Postmortem Schema

Treat the postmortem as YAML first. Markdown is the rendering, not the source of truth.

# schema/postmortem.yaml
incident_id: PD-12345
title: "Checkout 5xx spike, 14 May 2025"
status: draft  # draft | review | published
severity: sev2
window:
  detected_at: 2025-05-14T14:23:11Z
  acknowledged_at: 2025-05-14T14:24:02Z
  resolved_at: 2025-05-14T15:07:33Z
impact:
  users_affected: "approximately 2.1% of checkout requests"
  duration_minutes: 44
  revenue_estimate_usd: 8400
timeline:
  - at: 2025-05-14T14:23:11Z
    source: pagerduty
    source_url: https://acme.pagerduty.com/incidents/Q5...
    event: "PagerDuty incident triggered: CheckoutP99Latency"
  - at: 2025-05-14T14:18:00Z
    source: argocd
    source_url: https://argocd.acme.com/applications/checkout?revision=a3f...
    event: "Argo CD synced checkout to a3f1c9 (PR #4421)"
  - at: 2025-05-14T14:25:42Z
    source: slack
    source_url: https://acme.slack.com/archives/C12/p17158...
    event: "Alice: rolling back to previous revision"
narrative:
  summary: null  # to be drafted
  what_happened: null
  why_it_happened: null
  what_we_did: null
  what_we_learned: null
action_items: []

The model only writes into narrative. It never touches timeline, window, or impact. Those are facts.

2. Pulling the Timeline

Three sources matter for most incidents: Slack, PagerDuty, and the deploy system. Pull from all three with parallel async calls.

# timeline/collect.py
import asyncio
import httpx
from datetime import datetime, timedelta

async def collect_timeline(pd_incident_id: str) -> list[dict]:
    pd = await fetch_pagerduty(pd_incident_id)
    window_start = pd["created_at"] - timedelta(minutes=15)
    window_end = (pd["resolved_at"] or datetime.utcnow()) + timedelta(minutes=15)
    channel = pd["custom_details"].get("slack_channel")
    affected_services = pd["custom_details"].get("services", [])

    slack, deploys, pd_events = await asyncio.gather(
        fetch_slack_messages(channel, window_start, window_end),
        fetch_argo_deploys(affected_services, window_start, window_end),
        fetch_pd_events(pd_incident_id),
    )

    events = []
    events.extend([{"at": m["ts"], "source": "slack",
                    "source_url": m["permalink"],
                    "event": f"{m['user']}: {m['text'][:200]}"}
                   for m in slack])
    events.extend([{"at": d["finished_at"], "source": "argocd",
                    "source_url": d["url"],
                    "event": f"Argo CD synced {d['app']} to {d['revision'][:8]}"}
                   for d in deploys])
    events.extend([{"at": e["created_at"], "source": "pagerduty",
                    "source_url": e["html_url"],
                    "event": e["summary"]}
                   for e in pd_events])

    events.sort(key=lambda x: x["at"])
    return events

This produces a clean chronological event list with permalinks. Every event is verifiable. The model will read this and produce prose; the human reviewer can click any link to verify a claim.

3. The Slack Fetcher

Slack’s conversations.history returns paginated messages. Walk the cursor.

# slack/fetch.py
import os
import httpx

SLACK_TOKEN = os.environ["SLACK_TOKEN"]

async def fetch_slack_messages(channel: str, start, end) -> list[dict]:
    out = []
    cursor = None
    async with httpx.AsyncClient(timeout=15) as c:
        while True:
            params = {
                "channel": channel,
                "oldest": str(int(start.timestamp())),
                "latest": str(int(end.timestamp())),
                "limit": 200,
            }
            if cursor:
                params["cursor"] = cursor
            r = await c.get("https://slack.com/api/conversations.history",
                            headers={"Authorization": f"Bearer {SLACK_TOKEN}"},
                            params=params)
            data = r.json()
            if not data.get("ok"):
                raise RuntimeError(data.get("error"))
            for m in data["messages"]:
                if m.get("subtype") in ("channel_join", "bot_message"):
                    continue
                pr = await c.get("https://slack.com/api/chat.getPermalink",
                                 headers={"Authorization": f"Bearer {SLACK_TOKEN}"},
                                 params={"channel": channel, "message_ts": m["ts"]})
                out.append({
                    "ts": datetime.fromtimestamp(float(m["ts"])).isoformat() + "Z",
                    "user": m.get("user", "unknown"),
                    "text": m.get("text", ""),
                    "permalink": pr.json().get("permalink"),
                })
            cursor = data.get("response_metadata", {}).get("next_cursor")
            if not cursor:
                break
    return out

Skipping bot_message and channel_join cuts the noise by 70%. If you have status update bots posting every 5 minutes, you’ll want to skip those too.

4. The Narrative Drafter

The drafting prompt is strict. No invention, footnote everything, blameless tone.

# narrative/draft.py
from anthropic import AsyncAnthropic
from pydantic import BaseModel
import json

class Narrative(BaseModel):
    summary: str
    what_happened: str
    why_it_happened: str
    what_we_did: str
    what_we_learned: str

SYSTEM = """You draft postmortem narrative sections from a structured timeline.
Rules:
1. Use only facts from the provided timeline. Never invent timestamps, names,
   or technical details.
2. Every claim must be footnotable to a timeline entry. Inline footnotes like
   [^1], [^2] where [^N] is the timeline index (0-indexed).
3. Blameless tone. No names. Use "the on-call engineer", "the deploy".
4. Past tense.
5. "Why it happened" is the proximate cause only. Do not speculate about
   contributing factors. Set what_we_learned to a sentence acknowledging the
   draft is partial if the timeline doesn't support a confident conclusion.
6. Each section under 200 words."""

claude = AsyncAnthropic()

async def draft(timeline: list[dict], impact: dict) -> Narrative:
    indexed = [{"index": i, **e} for i, e in enumerate(timeline)]
    msg = await claude.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=2048,
        system=SYSTEM,
        messages=[{"role": "user", "content": json.dumps({
            "timeline": indexed,
            "impact": impact,
        })}],
    )
    return Narrative.model_validate_json(msg.content[0].text)

Two prompt rules carry the weight: footnote everything, and don’t speculate about why-it-happened beyond the proximate cause. The first kills hallucination. The second kills “the team didn’t follow process” boilerplate that everyone hates.

5. Footnote Validation

Trust nothing. After the model returns the narrative, validate that every footnote points to a real timeline index.

# validate/footnotes.py
import re

FOOTNOTE_RE = re.compile(r"\[\^(\d+)\]")

def validate_footnotes(narrative: Narrative, timeline: list[dict]) -> list[str]:
    issues = []
    max_idx = len(timeline) - 1
    for field in narrative.model_fields:
        text = getattr(narrative, field)
        for match in FOOTNOTE_RE.finditer(text):
            idx = int(match.group(1))
            if idx > max_idx or idx < 0:
                issues.append(f"{field}: footnote [^{idx}] out of range")
    # also check the model didn't leave unfootnoted paragraphs
    for field in ("what_happened", "why_it_happened", "what_we_did"):
        text = getattr(narrative, field)
        if not FOOTNOTE_RE.search(text):
            issues.append(f"{field}: no footnotes, likely hallucinated")
    return issues

If validation produces issues, retry the draft once with the issues listed in the user message. Don’t retry more than once — the model rarely improves on a third pass.

6. Rendering the Markdown

Final rendering is mechanical. Timeline is a table, narrative has its footnotes hyperlinked back to the timeline.

# render/markdown.py
def render(doc: dict) -> str:
    timeline = doc["timeline"]
    n = doc["narrative"]
    out = [
        f"# {doc['title']}",
        "",
        f"**Status**: {doc['status']}  |  **Severity**: {doc['severity']}",
        f"**Detected**: {doc['window']['detected_at']}  |  "
        f"**Resolved**: {doc['window']['resolved_at']}",
        "",
        "## Impact",
        f"- Users affected: {doc['impact']['users_affected']}",
        f"- Duration: {doc['impact']['duration_minutes']} minutes",
        f"- Revenue estimate: ${doc['impact']['revenue_estimate_usd']:,}",
        "",
        "## Summary", "", n["summary"], "",
        "## What happened", "", n["what_happened"], "",
        "## Why it happened (proximate)", "", n["why_it_happened"], "",
        "## What we did", "", n["what_we_did"], "",
        "## What we learned (draft)", "", n["what_we_learned"], "",
        "## Timeline", "",
        "| Time | Source | Event |",
        "|------|--------|-------|",
    ]
    for i, e in enumerate(timeline):
        link = f"[link]({e['source_url']})" if e.get("source_url") else ""
        out.append(f"| [{i}] {e['at']} | {e['source']} {link} | {e['event']} |")
    return "\n".join(out)

The [N] index in the timeline table matches the [^N] footnotes in the narrative. Click-through navigation works in any decent markdown renderer.

7. The Pull Request

The pipeline writes the postmortem to a git repo as a PR. The reviewer is the incident commander.

# pipeline/run.py
async def run(pd_incident_id: str):
    timeline = await collect_timeline(pd_incident_id)
    impact = await estimate_impact(pd_incident_id)
    narrative = await draft(timeline, impact)
    issues = validate_footnotes(narrative, timeline)
    if issues:
        narrative = await draft(timeline, impact, retry_issues=issues)
        # second pass; if still bad, ship with TODOs
    doc = {
        "incident_id": pd_incident_id,
        "title": f"{summarize_title(timeline)}, {timeline[0]['at'][:10]}",
        "status": "draft",
        "severity": detect_severity(pd_incident_id),
        "window": {
            "detected_at": timeline[0]["at"],
            "resolved_at": find_resolved_at(timeline),
        },
        "impact": impact,
        "timeline": timeline,
        "narrative": narrative.model_dump(),
    }
    md = render(doc)
    await open_pr(
        path=f"postmortems/{pd_incident_id}.md",
        content=md,
        title=f"Postmortem draft: {doc['title']}",
        body="Auto-generated draft. Please review timeline accuracy and "
             "rewrite the narrative as needed.",
        reviewers=[commander_for(pd_incident_id)],
    )

The PR description sets the right expectation. This is a draft. Verify the timeline. Rewrite the narrative. The model is a starting point, not the final word.

8. The Action Items Section

Action items are too consequential for the model to author. Leave them blank. The commander fills them in during the review. If you really want the model to suggest candidates, structure it as a separate “candidate” list that’s clearly not committed.

action_items_candidates:
  - description: "Add circuit breaker on inventory-cache client"
    evidence_footnote: 7
    status: candidate  # candidate | proposed | committed
  - description: "Add SLO alert on inventory-cache latency"
    evidence_footnote: 12
    status: candidate

The commander promotes candidates to committed with an owner and due date. This is the only place I’d accept AI suggestions for action items, and only with an explicit candidate status.

9. Common Pitfalls

Four mistakes that bite.

Letting the model infer cause from absence. If the timeline doesn’t show a deploy, the model will sometimes invent one (“a recent deploy likely caused…”). Force the model to cite a footnote for any causal claim. Reject sentences without footnotes during validation.
No deduplication of Slack noise. PagerDuty webhooks and Slack bots that post the same status events. Dedupe by message content similarity, not just by source.
Trying to draft action items. Models are particularly bad at this. They write generic “improve monitoring” items that nobody will own. Leave the section blank.
Skipping the human review. The draft is a draft. If you publish without review you’re back to the bad version of the tool. The 30-minute review is the whole point.

10. Troubleshooting

Three failures you’ll hit.

10.1 Timeline has gaps

The Slack channel didn’t exist before the incident, or PagerDuty events weren’t piped to your collector. Pre-create an incident channel at PagerDuty trigger time, and make sure your PagerDuty integration logs every state change.

10.2 Model writes “we don’t know” everywhere

Your timeline is too sparse. The model is being honest. Add more sources — Argo CD revisions, GitHub deploy events, Kubernetes events for the affected services. The richer the timeline, the better the narrative.

10.3 Footnotes point to the wrong entries

The model sometimes shifts indices by one. Always 0-index the timeline you send to the model and 0-index the validation. Don’t switch to 1-indexed for display — keep both internal and display 0-indexed to avoid the off-by-one.

11. Wrapping Up

The hardest part of postmortem automation is restraint. Don’t ask the model to do the parts where it’s bad — root cause, action items, impact estimation. Ask it only to write narrative prose footnoted to facts you’ve already pulled. The human review then becomes editing prose, not arguing with hallucinations.

For the incident response state machine that feeds this pipeline, see incident response automation with LangGraph, a step by step tutorial. For the canonical reference on blameless writing, the Etsy debriefing facilitation guide is still the right read after all these years.

1. The Postmortem Schema

2. Pulling the Timeline

3. The Slack Fetcher

4. The Narrative Drafter

5. Footnote Validation

6. Rendering the Markdown

7. The Pull Request

8. The Action Items Section

9. Common Pitfalls

10. Troubleshooting

10.1 Timeline has gaps

10.2 Model writes “we don’t know” everywhere

10.3 Footnotes point to the wrong entries

11. Wrapping Up

Related posts

Incident Response Automation with LangGraph, A Step by Step Tutorial

Blameless Postmortems That Actually Change Behavior

Chaos Engineering with AI Augmented Hypotheses

SLOs and Burn Rate Alerting in 2025, A Practical Guide

Anomaly Detection on Prometheus Metrics, A Hands On Guide

Building an SRE Copilot for On Call Engineers

AI Driven Log Analysis at Scale, A Production Tutorial

Auto Remediation Pipelines with LLM Agents and Argo Events

Let’s Start a Project