Building an Autonomous Engineering Squad with LangGraph
TL;DR — Model your dev team as a typed state graph / give each agent one job and a clear handoff / route on structured output, not vibes.
A single chat-loop agent can write a function. It cannot ship a feature. The moment a task spans planning, multi-file edits, and a review gate, the one-prompt-to-rule-them-all approach collapses into a context window full of half-finished reasoning. The model forgets the plan it wrote four turns ago and starts solving a different problem.
The fix is not a bigger model. It’s structure. An autonomous engineering squad splits the work across specialized agents — a planner, a coder, a reviewer — each with a narrow role, a focused prompt, and explicit handoffs. LangGraph gives you the substrate for that: a directed graph where nodes are agents, edges are control flow, and a shared typed state object is the only thing that crosses node boundaries.
I’ve run this pattern on real internal tooling work, and the difference is stark. When the reviewer rejects a change, control loops back to the coder with a concrete critique instead of the whole conversation history. State stays small, roles stay sharp, and you can actually debug what happened. This article builds that squad from scratch on LangGraph 0.3 and Python 3.12.
The Squad Topology
Three agents, one supervisor edge set. The planner decomposes a feature request into an ordered task list. The coder executes one task at a time. The reviewer inspects the diff and either approves or sends it back. A conditional edge after the reviewer decides whether to loop or finish.
┌──────────┐
│ planner │
└────┬─────┘
│
┌────▼─────┐ reject
│ coder │◄──────────────┐
└────┬─────┘ │
│ │
┌────▼─────┐ │
│ reviewer │────────────────┘
└────┬─────┘
│ approve + tasks remain → coder
│ approve + done → END
The contract that makes this work is the state schema. Every node reads and writes the same TypedDict. Nothing else is shared.
Project Setup
Pin everything. Agent frameworks move fast and an unpinned langgraph will break your graph between a Tuesday and a Wednesday.
# pyproject.toml
[project]
name = "engineering-squad"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
"langgraph==0.3.5",
"langchain-anthropic==0.3.9",
"langchain-core==0.3.40",
"pydantic==2.10.6",
]
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e .
export ANTHROPIC_API_KEY="sk-ant-..."
Defining the Shared State
The state object is the spine of the whole graph. Keep it flat and explicit. The Annotated reducer on messages tells LangGraph to append rather than overwrite when multiple nodes write to it.
# state.py
from __future__ import annotations
from typing import Annotated, Literal, TypedDict
from operator import add
from langchain_core.messages import BaseMessage
class Task(TypedDict):
id: int
description: str
status: Literal["pending", "done"]
class SquadState(TypedDict):
feature_request: str
plan: list[Task]
current_task_id: int | None
diffs: Annotated[list[str], add]
review_verdict: Literal["approve", "reject", ""]
review_notes: str
revision_count: int
messages: Annotated[list[BaseMessage], add]
revision_count is not cosmetic. Without a hard ceiling, a stubborn reviewer and an equally stubborn coder will ping-pong until you run out of API budget. We enforce the cap in the router later.
The Planner Node
The planner runs once. It turns a vague request into a concrete, ordered task list. The critical move is forcing structured output — never parse a task list out of free-form prose.
# nodes/planner.py
from langchain_anthropic import ChatAnthropic
from pydantic import BaseModel, Field
from state import SquadState, Task
_llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)
class PlanItem(BaseModel):
description: str = Field(description="A single, self-contained engineering task")
class Plan(BaseModel):
tasks: list[PlanItem] = Field(min_length=1, max_length=8)
_PLANNER_SYSTEM = """You are a staff engineer breaking a feature request into
an ordered list of small, independently shippable tasks. Each task should be
completable in a single focused edit. Do not include testing or deployment
tasks; assume a separate pipeline handles those. Return 3 to 6 tasks."""
def planner_node(state: SquadState) -> dict:
structured = _llm.with_structured_output(Plan)
try:
plan: Plan = structured.invoke(
[
("system", _PLANNER_SYSTEM),
("human", state["feature_request"]),
]
)
except Exception as exc: # network, schema-validation, rate limit
raise RuntimeError(f"planner failed to produce a plan: {exc}") from exc
tasks: list[Task] = [
{"id": i, "description": item.description, "status": "pending"}
for i, item in enumerate(plan.tasks)
]
return {
"plan": tasks,
"current_task_id": tasks[0]["id"],
"revision_count": 0,
}
with_structured_output is the workhorse. It hands the model a JSON schema derived from the Pydantic class and validates the response. If the model returns malformed JSON, the call raises — and we let it, because a squad with no plan has nothing to do.
The Coder Node
The coder picks up current_task_id, looks at any prior diffs, and produces a unified diff. In production you’d wire this to a real file-edit tool; here we keep the output as a diff string so the example stays self-contained and testable.
# nodes/coder.py
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import AIMessage
from state import SquadState
_llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)
_CODER_SYSTEM = """You are a senior engineer. Implement exactly one task as a
unified diff. Output only the diff, no prose. If the reviewer left notes,
address every point. Keep changes minimal and focused on the current task."""
def _current_task(state: SquadState):
for task in state["plan"]:
if task["id"] == state["current_task_id"]:
return task
raise RuntimeError(f"no task with id {state['current_task_id']}")
def coder_node(state: SquadState) -> dict:
task = _current_task(state)
context = "\n\n".join(state["diffs"]) or "(no prior changes)"
notes = state["review_notes"] or "(first attempt)"
prompt = (
f"Task: {task['description']}\n\n"
f"Existing changes so far:\n{context}\n\n"
f"Reviewer notes to address:\n{notes}"
)
try:
result = _llm.invoke([("system", _CODER_SYSTEM), ("human", prompt)])
except Exception as exc:
raise RuntimeError(f"coder failed on task {task['id']}: {exc}") from exc
diff = result.content if isinstance(result.content, str) else str(result.content)
return {
"diffs": [diff],
"messages": [AIMessage(content=f"coder: drafted task {task['id']}")],
}
Note the coder never sees the full message history — only the current task, the accumulated diffs, and the latest reviewer notes. That’s a deliberate context diet. It keeps every coder invocation cheap and on-topic.
The Reviewer Node
The reviewer is the quality gate. It must return a verdict and, on rejection, an actionable critique. Structured output again — a free-text “looks good to me” is not a routable signal.
# nodes/reviewer.py
from langchain_anthropic import ChatAnthropic
from pydantic import BaseModel, Field
from typing import Literal
from state import SquadState
_llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)
class Review(BaseModel):
verdict: Literal["approve", "reject"]
notes: str = Field(description="Concrete, actionable feedback. Empty if approved.")
_REVIEWER_SYSTEM = """You are a meticulous code reviewer. Inspect the latest
diff against the stated task. Reject only for correctness, security, or clear
contract violations — not style nitpicks. When you reject, give specific,
actionable notes the author can act on without guessing."""
def reviewer_node(state: SquadState) -> dict:
latest_diff = state["diffs"][-1] if state["diffs"] else ""
task_desc = next(
t["description"] for t in state["plan"]
if t["id"] == state["current_task_id"]
)
structured = _llm.with_structured_output(Review)
try:
review: Review = structured.invoke(
[
("system", _REVIEWER_SYSTEM),
("human", f"Task: {task_desc}\n\nDiff:\n{latest_diff}"),
]
)
except Exception as exc:
raise RuntimeError(f"reviewer failed: {exc}") from exc
return {
"review_verdict": review.verdict,
"review_notes": review.notes,
"revision_count": state["revision_count"] + 1,
}
Wiring the Graph
Now the topology. Nodes are registered, edges connect them, and a conditional edge after the reviewer holds the routing logic. This router is where the squad’s behavior actually lives.
# graph.py
from langgraph.graph import StateGraph, START, END
from state import SquadState
from nodes.planner import planner_node
from nodes.coder import coder_node
from nodes.reviewer import reviewer_node
MAX_REVISIONS = 4
def route_after_review(state: SquadState) -> str:
"""Decide what happens after a review: retry, advance, or stop."""
if state["review_verdict"] == "reject":
if state["revision_count"] >= MAX_REVISIONS:
# Give up on this task rather than burn budget forever.
return "advance"
return "retry"
return "advance"
def advance_or_finish(state: SquadState) -> dict:
"""Mark the current task done and select the next pending one."""
plan = [
{**t, "status": "done"} if t["id"] == state["current_task_id"] else t
for t in state["plan"]
]
next_task = next((t for t in plan if t["status"] == "pending"), None)
return {
"plan": plan,
"current_task_id": next_task["id"] if next_task else None,
"review_notes": "",
"review_verdict": "",
"revision_count": 0,
}
def has_more_work(state: SquadState) -> str:
return "coder" if state["current_task_id"] is not None else END
def build_squad() -> StateGraph:
g = StateGraph(SquadState)
g.add_node("planner", planner_node)
g.add_node("coder", coder_node)
g.add_node("reviewer", reviewer_node)
g.add_node("advance", advance_or_finish)
g.add_edge(START, "planner")
g.add_edge("planner", "coder")
g.add_edge("coder", "reviewer")
g.add_conditional_edges(
"reviewer",
route_after_review,
{"retry": "coder", "advance": "advance"},
)
g.add_conditional_edges("advance", has_more_work, {"coder": "coder", END: END})
return g
Two conditional edges do all the steering. route_after_review handles the per-task retry loop with a hard cap. has_more_work drives the outer loop across tasks. Notice advance is its own node — bundling state mutation into a node rather than a router keeps routers pure functions, which makes them trivially unit-testable.
Running the Squad
Compile with a checkpointer so a crash doesn’t lose the run. The recursion_limit is your last line of defense against a graph that won’t terminate.
# run.py
from langgraph.checkpoint.memory import InMemorySaver
from graph import build_squad
def main() -> None:
app = build_squad().compile(checkpointer=InMemorySaver())
config = {"configurable": {"thread_id": "feat-001"}, "recursion_limit": 50}
initial = {
"feature_request": "Add rate limiting to the public REST API "
"with per-key quotas and a 429 response including Retry-After.",
"plan": [],
"current_task_id": None,
"diffs": [],
"review_verdict": "",
"review_notes": "",
"revision_count": 0,
"messages": [],
}
final = app.invoke(initial, config=config)
print(f"completed {len(final['plan'])} tasks")
for i, diff in enumerate(final["diffs"]):
print(f"--- change {i} ---\n{diff[:400]}\n")
if __name__ == "__main__":
main()
For a production deployment you’d swap InMemorySaver for a durable backend — a SQLite or Postgres checkpointer — so runs survive process restarts and can be inspected after the fact. The LangGraph persistence docs
cover the full checkpointer API.
Common Pitfalls
Sharing the message history with every node. It’s tempting to give each agent the whole messages list “for context.” Don’t. Each node should receive the minimum it needs. A coder that sees the planner’s chain of thought will second-guess the plan.
No revision ceiling. A reviewer that keeps finding new objections and a coder that keeps half-fixing them will loop until your bill spikes. MAX_REVISIONS plus recursion_limit are both mandatory, not optional.
Routers that mutate state. Conditional-edge functions in LangGraph should return a string, not a dict. If you need to mutate state at a branch point, do it in a dedicated node like advance. Mixing the two makes the graph impossible to reason about.
Free-text verdicts. If the reviewer returns “I think this is fine,” your router has to parse English. Structured output with a Literal field turns the verdict into a typed value you can branch on with confidence.
Temperature drift. Run planner and reviewer at temperature=0. Determinism in the control-flow agents makes runs reproducible; save any creativity for nodes where it actually helps.
Troubleshooting
Symptom: GraphRecursionError raised mid-run. Cause: the graph never reached END within recursion_limit steps, usually because the task list never empties. Fix: confirm advance_or_finish flips task status to done, and verify has_more_work returns END when current_task_id is None.
Symptom: with_structured_output raises a validation error. Cause: the model returned JSON that doesn’t satisfy the Pydantic schema, often because the prompt asked for something the schema forbids. Fix: align the system prompt with the schema constraints — if Plan caps tasks at 8, say so in the prompt — and consider catching the error to retry once.
Symptom: the coder ignores reviewer notes and resubmits the same diff. Cause: review_notes is being cleared before the coder reads it. Fix: only reset review_notes inside advance_or_finish, after a task is accepted — never on the retry path.
Symptom: every task gets approved instantly, even broken ones. Cause: the reviewer prompt is too permissive or the diff isn’t reaching the node. Fix: log state["diffs"][-1] at the top of reviewer_node; if it’s empty, the coder node isn’t writing to the diffs reducer correctly.
Symptom: state from a previous run bleeds into a new one. Cause: reusing the same thread_id. Fix: generate a fresh thread_id per feature request, or call the graph with a new one each time.
What’s Next
You now have a squad that plans, codes, reviews, and loops with a hard budget. The natural next steps are durable checkpointing so long runs survive restarts, and a human approval gate before changes land. From there, give the coder real file-system tools and the reviewer access to a test runner, and the squad starts to resemble an actual engineering team — one you can audit edge by edge.