Incident Response Automation with LangGraph, A Step by Step Tutorial

May 16, 2025 · 8 min read · by Muhammad Amal programming

TL;DR — Model the incident as a state machine with seven nodes, persist state every transition, let the LLM author messages but never decide escalation, and put the human approval gate before any external comms.

Incident response is a process, not a conversation. Most incident bots are conversation-shaped and that’s why they fall over the moment the incident lasts more than ten minutes. If you model the incident as a typed state machine, the right LLM-friendly primitives become obvious: each state has a small set of allowed transitions, each transition produces an artifact, and every artifact is auditable.

LangGraph 0.2 was built for exactly this shape. It’s a small library that gives you typed state, conditional edges, persistence, and human-in-the-loop interrupts. Pair it with claude-3.7-sonnet for the language work — drafting status updates, summarizing investigation findings — and you get a system that runs the boring parts of incident response without surprising anyone.

This tutorial builds the full graph end-to-end. The persistence layer is Postgres. The comms layer is Slack and Statuspage. PagerDuty is the trigger. The graph has seven nodes, four of which are LLM-assisted and three of which are pure plumbing.

1. The State Machine

Seven nodes, with explicit transitions. Draw this on a whiteboard before writing code.

   triggered
      |
      v
   triaging  <----+
      |           |
      v           |  (new info)
   investigating -+
      |
      v
   stabilizing
      |
      v
   resolving
      |
      v
   resolved
      |
      v
   handoff_postmortem

There’s only one cycle, between triaging and investigating. Everything else is forward-only. If you find yourself adding more cycles, your process is wrong, not your graph.

2. The State Object

A typed dict with explicit fields. No Any, no implicit globals.

# graph/state.py
from typing import TypedDict, Literal
from datetime import datetime

class Action(TypedDict):
    name: str
    at: datetime
    actor: str  # "human:alice" or "agent:triage"
    result: str

class StatusUpdate(TypedDict):
    at: datetime
    text: str
    audience: Literal["internal", "public"]
    posted: bool

class IncidentState(TypedDict):
    incident_id: str
    pd_id: str
    started_at: datetime
    services: list[str]
    severity: Literal["sev1", "sev2", "sev3"]
    phase: Literal["triggered", "triaging", "investigating",
                   "stabilizing", "resolving", "resolved",
                   "handoff_postmortem"]
    findings: list[str]
    actions: list[Action]
    updates: list[StatusUpdate]
    commander: str | None
    needs_human: bool
    resolved_at: datetime | None

That’s the contract. Every node reads this, transforms it, returns the next version. LangGraph 0.2 handles persistence between transitions.

3. Persisting State

LangGraph 0.2 ships with a PostgresSaver checkpointer. Use it. Don’t try to roll your own.

# graph/persist.py
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
import os

async def make_saver():
    saver = AsyncPostgresSaver.from_conn_string(os.environ["DATABASE_URL"])
    await saver.setup()
    return saver

saver.setup() creates the tables on first run. The schema is opinionated but sensible — checkpoints, thread metadata, and writes are separated.

4. The Triage Node

The triage node is the first LLM-assisted node. It reads the alert, calls a small set of read-only tools, and produces a structured triage report.

# graph/nodes/triage.py
from anthropic import AsyncAnthropic
import json

claude = AsyncAnthropic()

TRIAGE_SYSTEM = """You are an incident triage assistant. Given an alert and recent
context, produce a brief assessment. Be specific and quote evidence. If you are
under 0.6 confidence on severity or scope, set needs_human=true."""

async def triage_node(state: IncidentState) -> IncidentState:
    context = await build_triage_context(state)
    msg = await claude.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=1024,
        system=TRIAGE_SYSTEM,
        messages=[{"role": "user", "content": json.dumps({
            "alert": state.get("alert"),
            "context": context,
        })}],
    )
    report = json.loads(msg.content[0].text)
    state["findings"].append(report["assessment"])
    state["severity"] = report.get("severity", state["severity"])
    state["needs_human"] = report.get("needs_human", False)
    state["actions"].append({
        "name": "triage.assess",
        "at": datetime.utcnow(),
        "actor": "agent:triage",
        "result": report["assessment"][:200],
    })
    if state["phase"] == "triggered":
        state["phase"] = "triaging"
    return state

The triage report goes into findings. The phase advances to triaging. The model doesn’t decide the next node — the conditional edge does.

5. Conditional Edges

This is the heart of LangGraph. Edges decide where to go based on state.

# graph/edges.py
def from_triage(state: IncidentState) -> str:
    if state["needs_human"]:
        return "human_review"
    if state["severity"] == "sev1":
        return "investigate"
    if len(state["findings"]) < 2:
        return "investigate"
    return "stabilize"

def from_investigate(state: IncidentState) -> str:
    if state["needs_human"]:
        return "human_review"
    # if we have a hypothesis and evidence, move on
    if any("hypothesis:" in f for f in state["findings"][-3:]):
        return "stabilize"
    if len(state["actions"]) > 15:
        return "human_review"
    return "investigate"  # keep going

def from_stabilize(state: IncidentState) -> str:
    if state.get("stabilized"):
        return "resolve"
    if len(state["actions"]) > 25:
        return "human_review"
    return "stabilize"

The edges are dumb on purpose. They check state, not reasoning. The LLM never says “let’s move on” — it sets state["stabilized"] = True and the edge handles the routing.

6. The Communications Node

Status updates are LLM-drafted but human-approved. The human_review node is a LangGraph 0.2 interrupt — execution pauses until a human resumes the graph.

# graph/nodes/comms.py
from langgraph.types import interrupt, Command

async def draft_status_update(state: IncidentState) -> IncidentState:
    last_findings = state["findings"][-3:]
    msg = await claude.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=400,
        system="""Draft a brief incident status update. No speculation. No
        promises. Past tense for what happened, present tense for current
        state, no future tense unless the commander has committed. Under 100 words.""",
        messages=[{"role": "user", "content": json.dumps({
            "phase": state["phase"],
            "severity": state["severity"],
            "findings": last_findings,
            "minutes_elapsed": elapsed_minutes(state),
        })}],
    )
    draft = msg.content[0].text
    # interrupt for human approval
    approval = interrupt({
        "type": "status_update_approval",
        "draft": draft,
        "audience": "public",
    })
    if approval["approved"]:
        await post_to_statuspage(approval["text"])
        state["updates"].append({
            "at": datetime.utcnow(),
            "text": approval["text"],
            "audience": "public",
            "posted": True,
        })
    return state

interrupt() is LangGraph 0.2’s mechanism for human-in-the-loop. The graph pauses, returns to the caller, and waits for Command(resume=...). State is persisted while paused. If the process crashes, the graph resumes from the same point.

7. The Slack Interface

The commander interacts via Slack. A few slash commands and interactive blocks.

# slack/handlers.py
@app.command("/incident-status")
async def incident_status(ack, command):
    await ack()
    state = await load_state(command["text"])
    blocks = render_state(state)
    return blocks

@app.action("approve_update")
async def approve_update(ack, action, body):
    await ack()
    incident_id = action["value"].split(":")[0]
    text = body["state"]["values"]["edit"]["text"]["value"]
    config = {"configurable": {"thread_id": incident_id}}
    await graph.ainvoke(Command(resume={
        "approved": True,
        "text": text,
    }), config=config)

The approval block has an editable text field pre-filled with the draft. The commander tweaks and clicks approve. The graph resumes with the edited text. If the commander clicks “edit and post manually”, approved=False and the state update reflects it.

8. Bringing It All Together

The full graph wiring:

# graph/build.py
from langgraph.graph import StateGraph, END

async def build_graph():
    saver = await make_saver()
    g = StateGraph(IncidentState)
    g.add_node("triage", triage_node)
    g.add_node("investigate", investigate_node)
    g.add_node("stabilize", stabilize_node)
    g.add_node("resolve", resolve_node)
    g.add_node("comms", draft_status_update)
    g.add_node("human_review", human_review_node)
    g.add_node("postmortem_handoff", postmortem_handoff_node)

    g.set_entry_point("triage")
    g.add_conditional_edges("triage", from_triage, {
        "investigate": "investigate",
        "stabilize": "stabilize",
        "human_review": "human_review",
    })
    g.add_conditional_edges("investigate", from_investigate, {
        "investigate": "investigate",
        "stabilize": "stabilize",
        "human_review": "human_review",
    })
    g.add_conditional_edges("stabilize", from_stabilize, {
        "stabilize": "stabilize",
        "resolve": "resolve",
        "human_review": "human_review",
    })
    # comms can be triggered from any node based on time elapsed
    g.add_edge("resolve", "postmortem_handoff")
    g.add_edge("postmortem_handoff", END)
    g.add_edge("human_review", "investigate")

    return g.compile(checkpointer=saver, interrupt_before=["comms"])

interrupt_before=["comms"] makes every external comms attempt require explicit human approval. This is the rule that keeps the bot from posting nonsense to your public statuspage.

9. The PagerDuty Entry Point

# api/pd_webhook.py
from fastapi import FastAPI, Request

app = FastAPI()
graph = None

@app.on_event("startup")
async def startup():
    global graph
    graph = await build_graph()

@app.post("/pd/incident")
async def pd_incident(req: Request):
    event = await req.json()
    if event["event"]["event_type"] != "incident.triggered":
        return {"ok": True}
    pd = event["event"]["data"]
    state: IncidentState = {
        "incident_id": pd["id"],
        "pd_id": pd["id"],
        "started_at": datetime.utcnow(),
        "services": [pd.get("service", {}).get("summary", "unknown")],
        "severity": "sev2",
        "phase": "triggered",
        "findings": [],
        "actions": [],
        "updates": [],
        "commander": None,
        "needs_human": False,
        "resolved_at": None,
    }
    config = {"configurable": {"thread_id": pd["id"]}}
    # fire and forget; the graph persists itself
    asyncio.create_task(graph.ainvoke(state, config=config))
    return {"ok": True}

The thread_id is the PagerDuty incident ID. Resuming, querying, or interrupting any incident is just that thread ID.

10. Common Pitfalls

Four mistakes that bite.

Letting the LLM advance the state machine. The LLM produces findings and drafts. The conditional edges advance the state. Mixing the two means the model can decide to skip steps, and it will.
Forgetting to set interrupt_before on comms. Public status updates need human review. Internal Slack updates can be auto-posted, but anything customers see goes through approval.
Storing transient context in state. State is persisted and audited. Don’t store 500 KB of tool results in it. Store findings (the conclusions), not raw evidence.
No timeout on the whole incident. Set a 4-hour ceiling. Past 4 hours, force a human_review and freeze auto-comms. Long incidents need fully human-driven communication.

11. Troubleshooting

Three failures you’ll hit.

11.1 Graph hangs after an interrupt

You’re not calling Command(resume=...) correctly. The resume value goes into the node where interrupt() returned. If you pass it at the top level, LangGraph 0.2 won’t know which interrupt it’s for. Use the thread_id in config and pass Command(resume=...) to ainvoke.

11.2 Postgres saver tables missing

You forgot await saver.setup() on first run. It’s idempotent, so just call it at startup. Don’t try to create the tables by hand — the schema changes between LangGraph minor versions.

11.3 Investigate loop never terminates

Your conditional edge isn’t catching the exit condition. Add a hard limit: if len(state["actions"]) > 20, force a transition to human_review. Models can chase their tails for a long time.

12. Wrapping Up

Treating incident response as a state machine makes the boundary between human and automation explicit. The graph runs the process. The LLM authors the words. The human approves anything customers see. Once you have this skeleton, adding capabilities is just adding nodes.

For the upstream alert pipeline that feeds this graph, see auto remediation pipelines with LLM agents and Argo Events. For the postmortem handoff, see postmortem automation with LLMs, drafts that don’t lie. The LangGraph team’s own docs cover the persistence and interrupt model in more detail than I have room for here.

1. The State Machine

2. The State Object

3. Persisting State

4. The Triage Node

5. Conditional Edges

6. The Communications Node

7. The Slack Interface

8. Bringing It All Together

9. The PagerDuty Entry Point

10. Common Pitfalls

11. Troubleshooting

11.1 Graph hangs after an interrupt

11.2 Postgres saver tables missing

11.3 Investigate loop never terminates

12. Wrapping Up

Related posts

Postmortem Automation with LLMs, Drafts That Don't Lie

Chaos Engineering with AI Augmented Hypotheses

SLOs and Burn Rate Alerting in 2025, A Practical Guide

Anomaly Detection on Prometheus Metrics, A Hands On Guide

Building an SRE Copilot for On Call Engineers

AI Driven Log Analysis at Scale, A Production Tutorial

Auto Remediation Pipelines with LLM Agents and Argo Events

AIOps in May 2025, What Actually Works in Production

Let’s Start a Project