Stateful Agent Graphs, Checkpointing and Human in the Loop

Langgraph article cover illustration on a gradient background

February 10, 2026 · 8 min read · by Muhammad Amal programming

TL;DR — Checkpointers make every super-step durable and resumable / interrupt() pauses the graph and surfaces a payload for a human / resume by re-invoking the same thread with a Command.

An agent that runs to completion in one shot is the easy case. The hard case — and the realistic one — is an agent that needs to pause. Pause because a human must approve a risky action. Pause because the process crashed and you want to resume, not restart. Pause because a long workflow has to survive a deploy.

All three needs reduce to one capability: durable state. LangGraph’s checkpointer system gives you that. After every super-step, the framework snapshots the entire graph state to a backend you choose. The graph becomes resumable from any point, inspectable after the fact, and — combined with the interrupt() primitive — safely pausable for human review.

I treat checkpointing as non-negotiable for anything that touches production. The in-memory saver is fine for a unit test; everything else needs a real backend. This article wires SQLite and Postgres checkpointers into a LangGraph 0.3 graph and builds a proper human-in-the-loop approval gate on top. If you want the agent-team context first, see building an autonomous engineering squad with LangGraph .

What a Checkpointer Actually Does

A LangGraph run advances in super-steps — one tick of the graph, during which a set of nodes execute. After each super-step, if a checkpointer is attached, LangGraph writes a checkpoint: the full channel state, the next nodes to run, and metadata. Each checkpoint is keyed by a thread_id and an auto-incrementing checkpoint id.

That gives you three things for free. Durability — a crash loses at most the current super-step. Time travel — you can list a thread’s checkpoints and rewind to any of them. Human-in-the-loop — the graph can stop, persist, and wait indefinitely, because the state is safely on disk rather than in a process that has to stay alive.

super-step 1 ──checkpoint──▶ super-step 2 ──checkpoint──▶ [interrupt]
                                                              │
                                              state persisted, process free
                                                              │
                              human responds ──▶ resume from checkpoint

Project Setup

# pyproject.toml
[project]
name = "durable-agent"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
    "langgraph==0.3.5",
    "langgraph-checkpoint-sqlite==2.0.5",
    "langgraph-checkpoint-postgres==2.0.18",
    "langchain-anthropic==0.3.9",
    "psycopg[binary,pool]==3.2.4",
]

python3.12 -m venv .venv
source .venv/bin/activate
pip install -e .
export ANTHROPIC_API_KEY="sk-ant-..."

The SQLite and Postgres savers ship as separate packages — they are not bundled with langgraph itself. Install the one you need.

A Graph Worth Checkpointing

We’ll build a small deployment-decision graph: it drafts a change, then pauses for human approval before “applying” it. Realistic enough to show every primitive.

# state.py
from typing import Annotated, Literal, TypedDict
from operator import add


class DeployState(TypedDict):
    change_request: str
    proposed_action: str
    approved: bool
    log: Annotated[list[str], add]
    outcome: Literal["applied", "rejected", "pending", ""]

# nodes.py
from langchain_anthropic import ChatAnthropic
from state import DeployState

_llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


def propose(state: DeployState) -> dict:
    """Draft a concrete deployment action from the change request."""
    prompt = (
        "You are a release engineer. Propose one concrete deployment action "
        f"for this change request, in a single sentence:\n{state['change_request']}"
    )
    try:
        result = _llm.invoke([("human", prompt)])
    except Exception as exc:
        raise RuntimeError(f"propose node failed: {exc}") from exc
    action = result.content if isinstance(result.content, str) else str(result.content)
    return {"proposed_action": action, "log": ["proposed: " + action]}


def apply_change(state: DeployState) -> dict:
    """Execute the approved action. Real impl would call your deploy API."""
    return {"outcome": "applied", "log": ["applied: " + state["proposed_action"]]}


def reject_change(state: DeployState) -> dict:
    return {"outcome": "rejected", "log": ["rejected by reviewer"]}

Adding the Human-in-the-Loop Gate

The interrupt() function is the heart of human-in-the-loop in LangGraph 0.3. Called inside a node, it stops the graph, persists state via the checkpointer, and surfaces a payload to the caller. The graph does not resume until you re-invoke the thread with a Command(resume=...).

# gate.py
from langgraph.types import interrupt
from state import DeployState


def approval_gate(state: DeployState) -> dict:
    """Pause for a human decision on the proposed action."""
    decision = interrupt(
        {
            "question": "Approve this deployment action?",
            "proposed_action": state["proposed_action"],
            "change_request": state["change_request"],
        }
    )
    # Execution reaches here only AFTER a resume.
    # `decision` is whatever value the caller passed to Command(resume=...).
    if isinstance(decision, dict):
        approved = bool(decision.get("approved", False))
    else:
        approved = bool(decision)
    return {"approved": approved, "log": [f"human decision: approved={approved}"]}

Two facts about interrupt() that trip people up. First, the node re-runs from the top on resume — code before interrupt() executes twice, so keep it side-effect-free. Second, interrupt() requires a checkpointer; without one there is nowhere to persist the paused state and the call raises.

Wiring the Graph

# graph.py
from langgraph.graph import StateGraph, START, END
from state import DeployState
from nodes import propose, apply_change, reject_change
from gate import approval_gate


def route_on_approval(state: DeployState) -> str:
    return "apply" if state["approved"] else "reject"


def build_graph() -> StateGraph:
    g = StateGraph(DeployState)
    g.add_node("propose", propose)
    g.add_node("gate", approval_gate)
    g.add_node("apply", apply_change)
    g.add_node("reject", reject_change)

    g.add_edge(START, "propose")
    g.add_edge("propose", "gate")
    g.add_conditional_edges("gate", route_on_approval, {"apply": "apply", "reject": "reject"})
    g.add_edge("apply", END)
    g.add_edge("reject", END)
    return g

The SQLite Checkpointer

For single-process apps and local development, SQLite is the right call. The saver opens with a context manager so the connection is cleaned up properly. Call .setup() once to create the checkpoint tables.

# run_sqlite.py
from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.types import Command
from graph import build_graph


def main() -> None:
    with SqliteSaver.from_conn_string("checkpoints.db") as saver:
        saver.setup()  # idempotent; creates tables if absent
        app = build_graph().compile(checkpointer=saver)
        config = {"configurable": {"thread_id": "deploy-42"}}

        initial = {
            "change_request": "Roll out the new caching layer to 10% of traffic.",
            "proposed_action": "",
            "approved": False,
            "log": [],
            "outcome": "pending",
        }

        # First invocation runs until interrupt(), then returns.
        result = app.invoke(initial, config=config)
        interrupts = result.get("__interrupt__", [])
        if interrupts:
            payload = interrupts[0].value
            print("PAUSED for approval:")
            print(f"  action: {payload['proposed_action']}")

        # ... time passes, process can exit entirely, state is on disk ...

        # Resume the SAME thread_id with the human's decision.
        final = app.invoke(
            Command(resume={"approved": True}),
            config=config,
        )
        print(f"outcome: {final['outcome']}")
        for line in final["log"]:
            print(f"  {line}")


if __name__ == "__main__":
    main()

The flow is the whole point. The first invoke runs to the interrupt and returns with an __interrupt__ entry. The process could exit here — the state is in checkpoints.db. The second invoke, with a Command(resume=...) and the same thread_id, picks up exactly where it stopped.

The Postgres Checkpointer

For concurrent workloads or multi-instance deployments, SQLite’s single-writer model becomes a bottleneck. Postgres is the production answer. The Postgres saver wants a connection pool.

# run_postgres.py
from psycopg_pool import ConnectionPool
from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.types import Command
from graph import build_graph

DB_URI = "postgresql://agent:secret@localhost:5432/agentdb"


def main() -> None:
    pool = ConnectionPool(
        conninfo=DB_URI,
        max_size=20,
        kwargs={"autocommit": True, "prepare_threshold": 0},
    )
    try:
        saver = PostgresSaver(pool)
        saver.setup()  # run once per database; creates checkpoint tables
        app = build_graph().compile(checkpointer=saver)
        config = {"configurable": {"thread_id": "deploy-99"}}

        result = app.invoke(
            {
                "change_request": "Enable the new auth flow for beta users.",
                "proposed_action": "",
                "approved": False,
                "log": [],
                "outcome": "pending",
            },
            config=config,
        )
        if result.get("__interrupt__"):
            print("paused; resuming with rejection")
            final = app.invoke(Command(resume={"approved": False}), config=config)
            print(f"outcome: {final['outcome']}")
    finally:
        pool.close()

autocommit=True and prepare_threshold=0 in the pool kwargs matter — they avoid prepared-statement conflicts that otherwise surface as cryptic errors under concurrency. The LangGraph persistence guide documents the saver contract in full.

Inspecting and Time-Traveling State

Because every super-step is checkpointed, you can audit a thread after it runs and even rewind it.

# inspect.py
from langgraph.checkpoint.sqlite import SqliteSaver
from graph import build_graph


def audit(thread_id: str) -> None:
    with SqliteSaver.from_conn_string("checkpoints.db") as saver:
        app = build_graph().compile(checkpointer=saver)
        config = {"configurable": {"thread_id": thread_id}}

        # Current state snapshot.
        snap = app.get_state(config)
        print(f"next nodes: {snap.next}")
        print(f"outcome: {snap.values.get('outcome')}")

        # Full checkpoint history, newest first.
        for state in app.get_state_history(config):
            cid = state.config["configurable"]["checkpoint_id"]
            print(f"  checkpoint {cid[:8]} -> next={state.next}")


if __name__ == "__main__":
    audit("deploy-42")

get_state returns the latest snapshot; get_state_history yields every checkpoint. To resume from an older checkpoint, pass its checkpoint_id in the config — that’s time travel, and it’s invaluable for debugging a run that went sideways.

Common Pitfalls

Side effects before interrupt(). The node re-executes from the top on resume. Any API call, write, or mutation above the interrupt() line runs twice. Keep the pre-interrupt section pure, or guard side effects so they’re idempotent.

Resuming with a different thread_id. Resume targets the original thread. A new thread_id starts a fresh run with empty state and the human decision is silently lost.

Skipping .setup(). Both the SQLite and Postgres savers need their tables created. Forget .setup() and the first checkpoint write fails with a missing-table error.

Using InMemorySaver in production. It works, until the process restarts and every paused thread vanishes. In-memory is for tests only.

Treating interrupt()’s return as fixed. The return value is whatever the caller passes to Command(resume=...). Validate it inside the node — never assume it’s the shape you expected.

Troubleshooting

Symptom: interrupt() raises about a missing checkpointer. Cause: the graph was compiled without one. Fix: pass checkpointer=saver to .compile(); interrupt() cannot persist paused state otherwise.

Symptom: resume restarts the graph from the beginning. Cause: the thread_id in the resume config differs from the original, or it changed between calls. Fix: reuse the exact same thread_id.

Symptom: a side effect happens twice per run. Cause: the side effect sits before interrupt() in a node that gets re-executed on resume. Fix: move it after interrupt() or make it idempotent.

Symptom: Postgres errors mention prepared statements under load. Cause: the pool isn’t configured for the saver’s access pattern. Fix: set prepare_threshold=0 and autocommit=True in the pool kwargs.

Symptom: no such table: checkpoints on first write. Cause: .setup() was never called for this database file. Fix: call saver.setup() once before compiling the graph.

Symptom: an old run’s state leaks into a new one. Cause: thread_id reuse across logically separate runs. Fix: generate a unique thread_id per run, or scope it by request id.

What’s Next

Durable checkpointing turns a fragile one-shot agent into a system that survives crashes, deploys, and human review cycles. From here, add an expiry policy so paused threads don’t wait forever, expose the __interrupt__ payload through an API for a real approval UI, and layer time-travel debugging into your incident response. Once state is durable, the rest of agent reliability gets a lot easier.

What a Checkpointer Actually Does

Project Setup

A Graph Worth Checkpointing

Adding the Human-in-the-Loop Gate

Wiring the Graph

The SQLite Checkpointer

The Postgres Checkpointer

Inspecting and Time-Traveling State

Common Pitfalls

Troubleshooting

What’s Next

Related posts

Building an Autonomous Engineering Squad with LangGraph

Production Multi Agent Systems with LangGraph 0.2, A Hands On Tutorial

Instrumenting LLM Calls with OpenTelemetry Traces

Catching Regressions with an AI Reviewer Agent on Pull Requests

Orchestrating Multi-Agent Workflows with CrewAI

Incident Response Automation with LangGraph, A Step by Step Tutorial

Long Running Autonomous Agent Workflows, Checkpoints and Retries

AutoGen 0.4 Deep Dive, What Changed and How to Use It

Let’s Start a Project