Long Running Autonomous Agent Workflows, Checkpoints and Retries
TL;DR — Long-running agent workflows need four things, durable checkpoints, idempotent steps, narrow retry policies, and circuit breakers. Skip any one and you’ll lose work or burn through tokens overnight.
The longest agent run I’ve shipped to production took eleven minutes. The cheapest failure mode I’ve seen lost a forty-minute research run because the host got patched mid-run. The most expensive cost a client about $400 in API charges because a buggy retry loop hammered Claude in a tight cycle. All three came down to the same set of patterns done wrong.
If your agents are doing anything more interesting than synchronous request-response, you’re now building durable workflows. Same problems as Temporal or Airflow, just with LLM costs attached. This post is the playbook I use to make those workflows survive operations realities.
I’ll cover four pillars, checkpointing so runs survive restarts, idempotency so retries don’t double-charge, retry policies that fail fast on bugs and slow on transients, and circuit breakers that cap blast radius. Code targets Python 3.12, langgraph==0.2.74, psycopg[binary,pool]==3.2.4, tenacity==9.0.0, and langgraph-checkpoint-postgres==2.0.13.
1. Checkpoint after every step that costs money
The rule is straightforward. Anything that costs money or has side effects must be checkpointed before the next step starts. If your process dies after the LLM call but before the checkpoint, you’ll re-call the LLM on restart. That’s how you burn $400.
LangGraph’s PostgresSaver handles this if you wire it correctly.
# checkpointed.py
import os
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage
from psycopg_pool import ConnectionPool
class State(TypedDict):
messages: Annotated[list[BaseMessage], add_messages]
topic: str
research: str
draft: str
iteration: int
pool = ConnectionPool(
conninfo=os.environ["DATABASE_URL"],
max_size=20,
kwargs={"autocommit": True, "prepare_threshold": 0},
)
saver = PostgresSaver(pool)
saver.setup() # run once at deploy time
llm = ChatOpenAI(model="gpt-4o", temperature=0)
def research(state: State) -> dict:
notes = llm.invoke([HumanMessage(content=f"Research {state['topic']}")]).content
return {"research": notes, "messages": [AIMessage(content="research done")]}
def draft(state: State) -> dict:
text = llm.invoke([HumanMessage(content=f"Draft from: {state['research']}")]).content
return {"draft": text, "iteration": state.get("iteration", 0) + 1}
builder = StateGraph(State)
builder.add_node("research", research)
builder.add_node("draft", draft)
builder.set_entry_point("research")
builder.add_edge("research", "draft")
builder.add_edge("draft", END)
graph = builder.compile(checkpointer=saver)
The checkpoint happens automatically between nodes. State after research is persisted with the thread ID before draft starts. Kill the process during draft, restart, call graph.invoke(None, config={"configurable": {"thread_id": "t1"}}), and it resumes from the saved state.
The non-obvious part, checkpointing has a cost. Postgres write per node, typically 5 to 30ms. For workflows with hundreds of fast nodes, this adds up. Two ways to manage it.
Coarse-grained nodes, do more work per node so you write less frequently. The trade-off is bigger replay cost on failure, but the steady-state savings are real.
Async checkpointing, available in LangGraph 0.2 via AsyncPostgresSaver. The persistence happens in a background task. Faster steady-state, but a crash window of unwritten checkpoints. I default to sync for anything financially significant and async for things where a few seconds of replay is fine.
2. Idempotency, the unloved superpower
A retry is only safe if the step is idempotent. Pure LLM calls are usually idempotent enough, but anything with a side effect, a database write, an email send, an external API mutation, must be designed for it.
The pattern is an idempotency key derived from the request, threaded through every side effect.
import hashlib
import uuid
def idempotency_key(thread_id: str, node_name: str, step: int) -> str:
raw = f"{thread_id}:{node_name}:{step}"
return hashlib.sha256(raw.encode()).hexdigest()[:32]
def send_summary_email(state: State) -> dict:
key = idempotency_key(state["thread_id"], "send_summary_email", state["iteration"])
# SendGrid, Postmark, etc all support idempotency keys
sg.send(
to=state["recipient"],
subject="Research summary",
body=state["draft"],
idempotency_key=key,
)
return {"sent": True}
The key idea, same input means same key, and the downstream API dedupes. SendGrid, Stripe, OpenAI, Anthropic, all the modern APIs support this. Use it.
For your own services, expose an idempotency table.
CREATE TABLE idempotency_records (
key TEXT PRIMARY KEY,
response_body JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
A handler checks the table first, returns the cached response if present, otherwise does the work and writes the row. Use a transaction with SELECT FOR UPDATE so concurrent retries don’t race.
3. Retry policies, narrow and tiered
The retry mistakes I see most.
Too broad. Retrying on Exception means your KeyError from a typo retries three times and costs you three LLM calls before the bug surfaces.
Too aggressive. Retrying with no backoff or no jitter creates a thundering herd against a flaky upstream.
Too symmetric. Different failure types want different retry behavior. A 429 wants slow backoff, a 500 wants quick retry, a 400 wants no retry at all.
tenacity gives you the primitives. Combine them.
from tenacity import (
retry, stop_after_attempt, wait_exponential_jitter,
retry_if_exception_type, before_sleep_log,
)
import httpx
import openai
import logging
log = logging.getLogger(__name__)
class TransientLLMError(Exception):
"""Wraps retryable upstream errors."""
@retry(
reraise=True,
stop=stop_after_attempt(4),
wait=wait_exponential_jitter(initial=1, max=30, jitter=2),
retry=retry_if_exception_type(TransientLLMError),
before_sleep=before_sleep_log(log, logging.WARNING),
)
def call_llm(prompt: str) -> str:
try:
resp = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
timeout=30.0,
)
return resp.choices[0].message.content
except openai.RateLimitError as e:
raise TransientLLMError("rate limited") from e
except openai.APITimeoutError as e:
raise TransientLLMError("timeout") from e
except openai.InternalServerError as e:
raise TransientLLMError(f"server error {e.status_code}") from e
# everything else, no retry
The pattern, classify errors at the boundary. Only the ones you’ve decided are retryable get wrapped in TransientLLMError. Everything else, including your bugs, propagates immediately. The tenacity config retries only the wrapped class, with exponential backoff and jitter.
For LangGraph’s per-node retries, the same principle applies.
from langgraph.pregel.types import RetryPolicy
llm_retry = RetryPolicy(
max_attempts=4,
initial_interval=2.0,
backoff_factor=2.0,
jitter=True,
retry_on=(TransientLLMError,),
)
builder.add_node("research", research, retry=llm_retry)
I cover the LangGraph specifics more in the production tutorial.
4. Deadlines, the missing primitive
LLM calls without deadlines are the most expensive bugs you can ship. The model might hang, the network might stall, your code might be in a retry loop. Without a deadline, your worker sits forever.
Set deadlines at three layers.
HTTP timeout. The SDK has one, configure it. OpenAI’s default is 600 seconds, which is wildly generous for chat completions. I set it to 30 for most calls, 90 for long generations.
from openai import OpenAI
client = OpenAI(timeout=30.0, max_retries=0) # we do our own retries
Per-step deadline. Wrap the step in asyncio.wait_for or concurrent.futures with a timeout.
import asyncio
async def with_deadline(coro, seconds: float):
try:
return await asyncio.wait_for(coro, timeout=seconds)
except asyncio.TimeoutError:
raise TransientLLMError(f"step exceeded {seconds}s deadline")
Whole-workflow deadline. Pass a started_at timestamp through state, and check it at each node entry.
import time
class State(TypedDict):
# ...
started_at: float
deadline_seconds: float
def check_deadline(state: State) -> None:
elapsed = time.time() - state["started_at"]
if elapsed > state["deadline_seconds"]:
raise RuntimeError(f"workflow exceeded {state['deadline_seconds']}s")
I’d recommend all three. The HTTP timeout catches network hangs, the per-step deadline catches slow model behavior, and the whole-workflow deadline catches runaway loops.
5. Circuit breakers, the blast radius cap
The scenario, your retries succeed individually but the upstream is degraded and 95% of calls fail. Your workers retry, eat API quotas, log errors, and don’t make progress. A circuit breaker says “if the upstream is mostly failing, stop calling it for a while.”
pybreaker does this cleanly.
# pip install "pybreaker==1.2.0"
import pybreaker
llm_breaker = pybreaker.CircuitBreaker(
fail_max=5,
reset_timeout=60,
exclude=[ValueError], # don't trip on validation errors
)
@llm_breaker
def call_llm_protected(prompt: str) -> str:
return call_llm(prompt) # the tenacity-decorated version
try:
answer = call_llm_protected("hello")
except pybreaker.CircuitBreakerError:
# breaker is open, upstream is degraded
raise TransientLLMError("circuit open, upstream degraded")
Five failures inside one minute trips it. The next 60 seconds, every call returns immediately with CircuitBreakerError, no API call made. After 60 seconds, the breaker enters half-open, one trial call decides whether to close it.
I run a separate breaker per upstream and per model class. OpenAI’s gpt-4o and Anthropic’s claude-3-7-sonnet get their own breakers, because a degraded GPT-4 doesn’t mean Claude is down. The agent layer can then fall back to the alternate model if its breaker is closed.
6. Putting it together, a durable agent step
A single agent step with all four pillars wired in.
import asyncio, time, hashlib
import pybreaker
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type
import openai
from openai import OpenAI
client = OpenAI(timeout=30.0, max_retries=0)
llm_breaker = pybreaker.CircuitBreaker(fail_max=5, reset_timeout=60)
class TransientLLMError(Exception): ...
@retry(
reraise=True,
stop=stop_after_attempt(4),
wait=wait_exponential_jitter(initial=1, max=15, jitter=2),
retry=retry_if_exception_type(TransientLLMError),
)
def _do_call(prompt: str, idem_key: str) -> str:
try:
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
extra_headers={"Idempotency-Key": idem_key},
)
return resp.choices[0].message.content
except (openai.RateLimitError, openai.APITimeoutError, openai.InternalServerError) as e:
raise TransientLLMError(str(e)) from e
@llm_breaker
def call_llm_durable(prompt: str, thread_id: str, step: str) -> str:
idem = hashlib.sha256(f"{thread_id}:{step}:{prompt}".encode()).hexdigest()[:32]
return _do_call(prompt, idem)
async def step_with_deadline(prompt: str, thread_id: str, step: str, deadline: float = 60.0) -> str:
return await asyncio.wait_for(
asyncio.to_thread(call_llm_durable, prompt, thread_id, step),
timeout=deadline,
)
That’s the production wrapper. Idempotency key prevents double-charge on retry. Tenacity narrows what’s retried. Circuit breaker caps blast radius. Deadline catches hangs. Combined with LangGraph’s checkpointing, you have a step that’s safe to call inside a workflow that might run for an hour.
7. Visibility into the running state
Long-running workflows produce a question you’ll get from product and ops alike, “what’s it doing right now?” The answer needs to be cheap to fetch and accurate to the second.
The pattern that works for me, write a heartbeat row to a status table on every node entry. Frontend or status page reads from this table directly, never from the checkpointer.
CREATE TABLE workflow_status (
thread_id TEXT PRIMARY KEY,
workflow_name TEXT NOT NULL,
current_step TEXT NOT NULL,
started_at TIMESTAMPTZ NOT NULL,
last_heartbeat TIMESTAMPTZ NOT NULL,
progress_percent INT,
status TEXT NOT NULL, -- running, succeeded, failed, paused
error_message TEXT
);
def heartbeat(thread_id: str, step: str, progress: int = None) -> None:
with pool.connection() as conn:
conn.execute(
"""
INSERT INTO workflow_status (thread_id, workflow_name, current_step,
started_at, last_heartbeat, progress_percent, status)
VALUES (%s, %s, %s, now(), now(), %s, 'running')
ON CONFLICT (thread_id) DO UPDATE
SET current_step = EXCLUDED.current_step,
last_heartbeat = now(),
progress_percent = EXCLUDED.progress_percent
""",
(thread_id, "research", step, progress),
)
Call it at the top of each node. The 2 to 5 ms overhead is dwarfed by the LLM cost. The payoff is a status endpoint that’s just SELECT * FROM workflow_status WHERE thread_id = ? and gives you everything support and product want.
For stuck-workflow detection, a periodic job that flags rows with last_heartbeat < now() - interval '5 minutes' AND status = 'running' catches workers that died without writing failure status.
Common Pitfalls
The ones that bit me.
- Checkpointing inside a transaction. If your node opens a DB transaction and the checkpoint write fails, the transaction rolls back but the LangGraph state thinks the node succeeded. Always commit your side effects before the node returns, never share a transaction with the checkpointer.
- Retry policies that retry indefinitely.
stop=stop_after_attempt(100)is not retry policy, it’s denial-of-service against your own budget. Cap attempts at 3 to 5 and let the outer system handle persistent failures. - Treating all errors as transient. If your code throws
KeyError, your retry will run twice and produce the same error twice. Slow your loop down by classifying errors carefully. - No backoff between retries. Hot loops against a rate-limited API just guarantee more rate limits. Exponential backoff with jitter is table stakes.
Troubleshooting
Three real failures.
Workflow appears to hang at a single node. Usually a missing timeout on the LLM call combined with a slow model response. Set timeout=30.0 on the OpenAI client and add an outer asyncio.wait_for. Inspect tracing for the actual model response time, often it’s not the model, it’s a tool call that hung.
Same step runs twice on restart. Checkpoint didn’t persist before the crash. Check that autocommit=True is set on your Postgres pool, and that you’re not in a transaction that rolled back. Run a manual SELECT * FROM checkpoints WHERE thread_id = '...' to see what’s there.
Token usage spikes during outage recovery. Your retries are working too well, and after the upstream recovers, a backlog of pending workflows all retry at once. Add a token bucket or a global concurrency limiter at the worker level, not just per-call.
Wrapping Up
Long-running agent workflows are durable workflow problems, dressed up. The patterns that make them reliable, checkpointing, idempotency, narrow retries, deadlines, circuit breakers, are the same patterns that make any distributed system reliable. The new thing is the failure mode of LLM calls, which fail in surprising ways and cost real money to retry.
If you build with these four pillars from day one, you spend your time on agent design instead of postmortems. Skip them and you’ll add them later anyway, in production, with a missing eyebrow.
The tenacity docs are the best reference for retry shaping. For workflow durability concepts beyond LangGraph, Temporal’s documentation is worth reading even if you never use Temporal, the mental model translates directly.