Incident Response Automation with LangGraph, A Step by Step Tutorial
TL;DR — Model the incident as a state machine with seven nodes, persist state every transition, let the LLM author messages but never decide escalation, and put the human approval gate before any external comms.
Incident response is a process, not a conversation. Most incident bots are conversation-shaped and that’s why they fall over the moment the incident lasts more than ten minutes. If you model the incident as a typed state machine, the right LLM-friendly primitives become obvious: each state has a small set of allowed transitions, each transition produces an artifact, and every artifact is auditable.
LangGraph 0.2 was built for exactly this shape. It’s a small library that gives you typed state, conditional edges, persistence, and human-in-the-loop interrupts. Pair it with claude-3.7-sonnet for the language work — drafting status updates, summarizing investigation findings — and you get a system that runs the boring parts of incident response without surprising anyone.
This tutorial builds the full graph end-to-end. The persistence layer is Postgres. The comms layer is Slack and Statuspage. PagerDuty is the trigger. The graph has seven nodes, four of which are LLM-assisted and three of which are pure plumbing.
1. The State Machine
Seven nodes, with explicit transitions. Draw this on a whiteboard before writing code.
triggered
|
v
triaging <----+
| |
v | (new info)
investigating -+
|
v
stabilizing
|
v
resolving
|
v
resolved
|
v
handoff_postmortem
There’s only one cycle, between triaging and investigating. Everything else is forward-only. If you find yourself adding more cycles, your process is wrong, not your graph.
2. The State Object
A typed dict with explicit fields. No Any, no implicit globals.
# graph/state.py
from typing import TypedDict, Literal
from datetime import datetime
class Action(TypedDict):
name: str
at: datetime
actor: str # "human:alice" or "agent:triage"
result: str
class StatusUpdate(TypedDict):
at: datetime
text: str
audience: Literal["internal", "public"]
posted: bool
class IncidentState(TypedDict):
incident_id: str
pd_id: str
started_at: datetime
services: list[str]
severity: Literal["sev1", "sev2", "sev3"]
phase: Literal["triggered", "triaging", "investigating",
"stabilizing", "resolving", "resolved",
"handoff_postmortem"]
findings: list[str]
actions: list[Action]
updates: list[StatusUpdate]
commander: str | None
needs_human: bool
resolved_at: datetime | None
That’s the contract. Every node reads this, transforms it, returns the next version. LangGraph 0.2 handles persistence between transitions.
3. Persisting State
LangGraph 0.2 ships with a PostgresSaver checkpointer. Use it. Don’t try to roll your own.
# graph/persist.py
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
import os
async def make_saver():
saver = AsyncPostgresSaver.from_conn_string(os.environ["DATABASE_URL"])
await saver.setup()
return saver
saver.setup() creates the tables on first run. The schema is opinionated but sensible — checkpoints, thread metadata, and writes are separated.
4. The Triage Node
The triage node is the first LLM-assisted node. It reads the alert, calls a small set of read-only tools, and produces a structured triage report.
# graph/nodes/triage.py
from anthropic import AsyncAnthropic
import json
claude = AsyncAnthropic()
TRIAGE_SYSTEM = """You are an incident triage assistant. Given an alert and recent
context, produce a brief assessment. Be specific and quote evidence. If you are
under 0.6 confidence on severity or scope, set needs_human=true."""
async def triage_node(state: IncidentState) -> IncidentState:
context = await build_triage_context(state)
msg = await claude.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1024,
system=TRIAGE_SYSTEM,
messages=[{"role": "user", "content": json.dumps({
"alert": state.get("alert"),
"context": context,
})}],
)
report = json.loads(msg.content[0].text)
state["findings"].append(report["assessment"])
state["severity"] = report.get("severity", state["severity"])
state["needs_human"] = report.get("needs_human", False)
state["actions"].append({
"name": "triage.assess",
"at": datetime.utcnow(),
"actor": "agent:triage",
"result": report["assessment"][:200],
})
if state["phase"] == "triggered":
state["phase"] = "triaging"
return state
The triage report goes into findings. The phase advances to triaging. The model doesn’t decide the next node — the conditional edge does.
5. Conditional Edges
This is the heart of LangGraph. Edges decide where to go based on state.
# graph/edges.py
def from_triage(state: IncidentState) -> str:
if state["needs_human"]:
return "human_review"
if state["severity"] == "sev1":
return "investigate"
if len(state["findings"]) < 2:
return "investigate"
return "stabilize"
def from_investigate(state: IncidentState) -> str:
if state["needs_human"]:
return "human_review"
# if we have a hypothesis and evidence, move on
if any("hypothesis:" in f for f in state["findings"][-3:]):
return "stabilize"
if len(state["actions"]) > 15:
return "human_review"
return "investigate" # keep going
def from_stabilize(state: IncidentState) -> str:
if state.get("stabilized"):
return "resolve"
if len(state["actions"]) > 25:
return "human_review"
return "stabilize"
The edges are dumb on purpose. They check state, not reasoning. The LLM never says “let’s move on” — it sets state["stabilized"] = True and the edge handles the routing.
6. The Communications Node
Status updates are LLM-drafted but human-approved. The human_review node is a LangGraph 0.2 interrupt — execution pauses until a human resumes the graph.
# graph/nodes/comms.py
from langgraph.types import interrupt, Command
async def draft_status_update(state: IncidentState) -> IncidentState:
last_findings = state["findings"][-3:]
msg = await claude.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=400,
system="""Draft a brief incident status update. No speculation. No
promises. Past tense for what happened, present tense for current
state, no future tense unless the commander has committed. Under 100 words.""",
messages=[{"role": "user", "content": json.dumps({
"phase": state["phase"],
"severity": state["severity"],
"findings": last_findings,
"minutes_elapsed": elapsed_minutes(state),
})}],
)
draft = msg.content[0].text
# interrupt for human approval
approval = interrupt({
"type": "status_update_approval",
"draft": draft,
"audience": "public",
})
if approval["approved"]:
await post_to_statuspage(approval["text"])
state["updates"].append({
"at": datetime.utcnow(),
"text": approval["text"],
"audience": "public",
"posted": True,
})
return state
interrupt() is LangGraph 0.2’s mechanism for human-in-the-loop. The graph pauses, returns to the caller, and waits for Command(resume=...). State is persisted while paused. If the process crashes, the graph resumes from the same point.
7. The Slack Interface
The commander interacts via Slack. A few slash commands and interactive blocks.
# slack/handlers.py
@app.command("/incident-status")
async def incident_status(ack, command):
await ack()
state = await load_state(command["text"])
blocks = render_state(state)
return blocks
@app.action("approve_update")
async def approve_update(ack, action, body):
await ack()
incident_id = action["value"].split(":")[0]
text = body["state"]["values"]["edit"]["text"]["value"]
config = {"configurable": {"thread_id": incident_id}}
await graph.ainvoke(Command(resume={
"approved": True,
"text": text,
}), config=config)
The approval block has an editable text field pre-filled with the draft. The commander tweaks and clicks approve. The graph resumes with the edited text. If the commander clicks “edit and post manually”, approved=False and the state update reflects it.
8. Bringing It All Together
The full graph wiring:
# graph/build.py
from langgraph.graph import StateGraph, END
async def build_graph():
saver = await make_saver()
g = StateGraph(IncidentState)
g.add_node("triage", triage_node)
g.add_node("investigate", investigate_node)
g.add_node("stabilize", stabilize_node)
g.add_node("resolve", resolve_node)
g.add_node("comms", draft_status_update)
g.add_node("human_review", human_review_node)
g.add_node("postmortem_handoff", postmortem_handoff_node)
g.set_entry_point("triage")
g.add_conditional_edges("triage", from_triage, {
"investigate": "investigate",
"stabilize": "stabilize",
"human_review": "human_review",
})
g.add_conditional_edges("investigate", from_investigate, {
"investigate": "investigate",
"stabilize": "stabilize",
"human_review": "human_review",
})
g.add_conditional_edges("stabilize", from_stabilize, {
"stabilize": "stabilize",
"resolve": "resolve",
"human_review": "human_review",
})
# comms can be triggered from any node based on time elapsed
g.add_edge("resolve", "postmortem_handoff")
g.add_edge("postmortem_handoff", END)
g.add_edge("human_review", "investigate")
return g.compile(checkpointer=saver, interrupt_before=["comms"])
interrupt_before=["comms"] makes every external comms attempt require explicit human approval. This is the rule that keeps the bot from posting nonsense to your public statuspage.
9. The PagerDuty Entry Point
# api/pd_webhook.py
from fastapi import FastAPI, Request
app = FastAPI()
graph = None
@app.on_event("startup")
async def startup():
global graph
graph = await build_graph()
@app.post("/pd/incident")
async def pd_incident(req: Request):
event = await req.json()
if event["event"]["event_type"] != "incident.triggered":
return {"ok": True}
pd = event["event"]["data"]
state: IncidentState = {
"incident_id": pd["id"],
"pd_id": pd["id"],
"started_at": datetime.utcnow(),
"services": [pd.get("service", {}).get("summary", "unknown")],
"severity": "sev2",
"phase": "triggered",
"findings": [],
"actions": [],
"updates": [],
"commander": None,
"needs_human": False,
"resolved_at": None,
}
config = {"configurable": {"thread_id": pd["id"]}}
# fire and forget; the graph persists itself
asyncio.create_task(graph.ainvoke(state, config=config))
return {"ok": True}
The thread_id is the PagerDuty incident ID. Resuming, querying, or interrupting any incident is just that thread ID.
10. Common Pitfalls
Four mistakes that bite.
- Letting the LLM advance the state machine. The LLM produces findings and drafts. The conditional edges advance the state. Mixing the two means the model can decide to skip steps, and it will.
- Forgetting to set
interrupt_beforeon comms. Public status updates need human review. Internal Slack updates can be auto-posted, but anything customers see goes through approval. - Storing transient context in state. State is persisted and audited. Don’t store 500 KB of tool results in it. Store findings (the conclusions), not raw evidence.
- No timeout on the whole incident. Set a 4-hour ceiling. Past 4 hours, force a
human_reviewand freeze auto-comms. Long incidents need fully human-driven communication.
11. Troubleshooting
Three failures you’ll hit.
11.1 Graph hangs after an interrupt
You’re not calling Command(resume=...) correctly. The resume value goes into the node where interrupt() returned. If you pass it at the top level, LangGraph 0.2 won’t know which interrupt it’s for. Use the thread_id in config and pass Command(resume=...) to ainvoke.
11.2 Postgres saver tables missing
You forgot await saver.setup() on first run. It’s idempotent, so just call it at startup. Don’t try to create the tables by hand — the schema changes between LangGraph minor versions.
11.3 Investigate loop never terminates
Your conditional edge isn’t catching the exit condition. Add a hard limit: if len(state["actions"]) > 20, force a transition to human_review. Models can chase their tails for a long time.
12. Wrapping Up
Treating incident response as a state machine makes the boundary between human and automation explicit. The graph runs the process. The LLM authors the words. The human approves anything customers see. Once you have this skeleton, adding capabilities is just adding nodes.
For the upstream alert pipeline that feeds this graph, see auto remediation pipelines with LLM agents and Argo Events. For the postmortem handoff, see postmortem automation with LLMs, drafts that don’t lie. The LangGraph team’s own docs cover the persistence and interrupt model in more detail than I have room for here.