Building an Autonomous Engineering Squad with LangGraph

Langgraph article cover illustration on a gradient background

February 3, 2026 · 10 min read · by Muhammad Amal programming

TL;DR — Model your dev team as a typed state graph / give each agent one job and a clear handoff / route on structured output, not vibes.

A single chat-loop agent can write a function. It cannot ship a feature. The moment a task spans planning, multi-file edits, and a review gate, the one-prompt-to-rule-them-all approach collapses into a context window full of half-finished reasoning. The model forgets the plan it wrote four turns ago and starts solving a different problem.

The fix is not a bigger model. It’s structure. An autonomous engineering squad splits the work across specialized agents — a planner, a coder, a reviewer — each with a narrow role, a focused prompt, and explicit handoffs. LangGraph gives you the substrate for that: a directed graph where nodes are agents, edges are control flow, and a shared typed state object is the only thing that crosses node boundaries.

I’ve run this pattern on real internal tooling work, and the difference is stark. When the reviewer rejects a change, control loops back to the coder with a concrete critique instead of the whole conversation history. State stays small, roles stay sharp, and you can actually debug what happened. This article builds that squad from scratch on LangGraph 0.3 and Python 3.12.

The Squad Topology

Three agents, one supervisor edge set. The planner decomposes a feature request into an ordered task list. The coder executes one task at a time. The reviewer inspects the diff and either approves or sends it back. A conditional edge after the reviewer decides whether to loop or finish.

        ┌──────────┐
        │ planner  │
        └────┬─────┘
             │
        ┌────▼─────┐      reject
        │  coder   │◄──────────────┐
        └────┬─────┘               │
             │                     │
        ┌────▼─────┐                │
        │ reviewer │────────────────┘
        └────┬─────┘
             │ approve + tasks remain → coder
             │ approve + done → END

The contract that makes this work is the state schema. Every node reads and writes the same TypedDict. Nothing else is shared.

Project Setup

Pin everything. Agent frameworks move fast and an unpinned langgraph will break your graph between a Tuesday and a Wednesday.

# pyproject.toml
[project]
name = "engineering-squad"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
    "langgraph==0.3.5",
    "langchain-anthropic==0.3.9",
    "langchain-core==0.3.40",
    "pydantic==2.10.6",
]

python3.12 -m venv .venv
source .venv/bin/activate
pip install -e .
export ANTHROPIC_API_KEY="sk-ant-..."

Defining the Shared State

The state object is the spine of the whole graph. Keep it flat and explicit. The Annotated reducer on messages tells LangGraph to append rather than overwrite when multiple nodes write to it.

# state.py
from __future__ import annotations
from typing import Annotated, Literal, TypedDict
from operator import add
from langchain_core.messages import BaseMessage


class Task(TypedDict):
    id: int
    description: str
    status: Literal["pending", "done"]


class SquadState(TypedDict):
    feature_request: str
    plan: list[Task]
    current_task_id: int | None
    diffs: Annotated[list[str], add]
    review_verdict: Literal["approve", "reject", ""]
    review_notes: str
    revision_count: int
    messages: Annotated[list[BaseMessage], add]

revision_count is not cosmetic. Without a hard ceiling, a stubborn reviewer and an equally stubborn coder will ping-pong until you run out of API budget. We enforce the cap in the router later.

The Planner Node

The planner runs once. It turns a vague request into a concrete, ordered task list. The critical move is forcing structured output — never parse a task list out of free-form prose.

# nodes/planner.py
from langchain_anthropic import ChatAnthropic
from pydantic import BaseModel, Field
from state import SquadState, Task

_llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class PlanItem(BaseModel):
    description: str = Field(description="A single, self-contained engineering task")


class Plan(BaseModel):
    tasks: list[PlanItem] = Field(min_length=1, max_length=8)


_PLANNER_SYSTEM = """You are a staff engineer breaking a feature request into
an ordered list of small, independently shippable tasks. Each task should be
completable in a single focused edit. Do not include testing or deployment
tasks; assume a separate pipeline handles those. Return 3 to 6 tasks."""


def planner_node(state: SquadState) -> dict:
    structured = _llm.with_structured_output(Plan)
    try:
        plan: Plan = structured.invoke(
            [
                ("system", _PLANNER_SYSTEM),
                ("human", state["feature_request"]),
            ]
        )
    except Exception as exc:  # network, schema-validation, rate limit
        raise RuntimeError(f"planner failed to produce a plan: {exc}") from exc

    tasks: list[Task] = [
        {"id": i, "description": item.description, "status": "pending"}
        for i, item in enumerate(plan.tasks)
    ]
    return {
        "plan": tasks,
        "current_task_id": tasks[0]["id"],
        "revision_count": 0,
    }

with_structured_output is the workhorse. It hands the model a JSON schema derived from the Pydantic class and validates the response. If the model returns malformed JSON, the call raises — and we let it, because a squad with no plan has nothing to do.

The Coder Node

The coder picks up current_task_id, looks at any prior diffs, and produces a unified diff. In production you’d wire this to a real file-edit tool; here we keep the output as a diff string so the example stays self-contained and testable.

# nodes/coder.py
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import AIMessage
from state import SquadState

_llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)

_CODER_SYSTEM = """You are a senior engineer. Implement exactly one task as a
unified diff. Output only the diff, no prose. If the reviewer left notes,
address every point. Keep changes minimal and focused on the current task."""


def _current_task(state: SquadState):
    for task in state["plan"]:
        if task["id"] == state["current_task_id"]:
            return task
    raise RuntimeError(f"no task with id {state['current_task_id']}")


def coder_node(state: SquadState) -> dict:
    task = _current_task(state)
    context = "\n\n".join(state["diffs"]) or "(no prior changes)"
    notes = state["review_notes"] or "(first attempt)"

    prompt = (
        f"Task: {task['description']}\n\n"
        f"Existing changes so far:\n{context}\n\n"
        f"Reviewer notes to address:\n{notes}"
    )
    try:
        result = _llm.invoke([("system", _CODER_SYSTEM), ("human", prompt)])
    except Exception as exc:
        raise RuntimeError(f"coder failed on task {task['id']}: {exc}") from exc

    diff = result.content if isinstance(result.content, str) else str(result.content)
    return {
        "diffs": [diff],
        "messages": [AIMessage(content=f"coder: drafted task {task['id']}")],
    }

Note the coder never sees the full message history — only the current task, the accumulated diffs, and the latest reviewer notes. That’s a deliberate context diet. It keeps every coder invocation cheap and on-topic.

The Reviewer Node

The reviewer is the quality gate. It must return a verdict and, on rejection, an actionable critique. Structured output again — a free-text “looks good to me” is not a routable signal.

# nodes/reviewer.py
from langchain_anthropic import ChatAnthropic
from pydantic import BaseModel, Field
from typing import Literal
from state import SquadState

_llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class Review(BaseModel):
    verdict: Literal["approve", "reject"]
    notes: str = Field(description="Concrete, actionable feedback. Empty if approved.")


_REVIEWER_SYSTEM = """You are a meticulous code reviewer. Inspect the latest
diff against the stated task. Reject only for correctness, security, or clear
contract violations — not style nitpicks. When you reject, give specific,
actionable notes the author can act on without guessing."""


def reviewer_node(state: SquadState) -> dict:
    latest_diff = state["diffs"][-1] if state["diffs"] else ""
    task_desc = next(
        t["description"] for t in state["plan"]
        if t["id"] == state["current_task_id"]
    )
    structured = _llm.with_structured_output(Review)
    try:
        review: Review = structured.invoke(
            [
                ("system", _REVIEWER_SYSTEM),
                ("human", f"Task: {task_desc}\n\nDiff:\n{latest_diff}"),
            ]
        )
    except Exception as exc:
        raise RuntimeError(f"reviewer failed: {exc}") from exc

    return {
        "review_verdict": review.verdict,
        "review_notes": review.notes,
        "revision_count": state["revision_count"] + 1,
    }

Wiring the Graph

Now the topology. Nodes are registered, edges connect them, and a conditional edge after the reviewer holds the routing logic. This router is where the squad’s behavior actually lives.

# graph.py
from langgraph.graph import StateGraph, START, END
from state import SquadState
from nodes.planner import planner_node
from nodes.coder import coder_node
from nodes.reviewer import reviewer_node

MAX_REVISIONS = 4


def route_after_review(state: SquadState) -> str:
    """Decide what happens after a review: retry, advance, or stop."""
    if state["review_verdict"] == "reject":
        if state["revision_count"] >= MAX_REVISIONS:
            # Give up on this task rather than burn budget forever.
            return "advance"
        return "retry"
    return "advance"


def advance_or_finish(state: SquadState) -> dict:
    """Mark the current task done and select the next pending one."""
    plan = [
        {**t, "status": "done"} if t["id"] == state["current_task_id"] else t
        for t in state["plan"]
    ]
    next_task = next((t for t in plan if t["status"] == "pending"), None)
    return {
        "plan": plan,
        "current_task_id": next_task["id"] if next_task else None,
        "review_notes": "",
        "review_verdict": "",
        "revision_count": 0,
    }


def has_more_work(state: SquadState) -> str:
    return "coder" if state["current_task_id"] is not None else END


def build_squad() -> StateGraph:
    g = StateGraph(SquadState)
    g.add_node("planner", planner_node)
    g.add_node("coder", coder_node)
    g.add_node("reviewer", reviewer_node)
    g.add_node("advance", advance_or_finish)

    g.add_edge(START, "planner")
    g.add_edge("planner", "coder")
    g.add_edge("coder", "reviewer")
    g.add_conditional_edges(
        "reviewer",
        route_after_review,
        {"retry": "coder", "advance": "advance"},
    )
    g.add_conditional_edges("advance", has_more_work, {"coder": "coder", END: END})
    return g

Two conditional edges do all the steering. route_after_review handles the per-task retry loop with a hard cap. has_more_work drives the outer loop across tasks. Notice advance is its own node — bundling state mutation into a node rather than a router keeps routers pure functions, which makes them trivially unit-testable.

Running the Squad

Compile with a checkpointer so a crash doesn’t lose the run. The recursion_limit is your last line of defense against a graph that won’t terminate.

# run.py
from langgraph.checkpoint.memory import InMemorySaver
from graph import build_squad


def main() -> None:
    app = build_squad().compile(checkpointer=InMemorySaver())
    config = {"configurable": {"thread_id": "feat-001"}, "recursion_limit": 50}

    initial = {
        "feature_request": "Add rate limiting to the public REST API "
        "with per-key quotas and a 429 response including Retry-After.",
        "plan": [],
        "current_task_id": None,
        "diffs": [],
        "review_verdict": "",
        "review_notes": "",
        "revision_count": 0,
        "messages": [],
    }

    final = app.invoke(initial, config=config)
    print(f"completed {len(final['plan'])} tasks")
    for i, diff in enumerate(final["diffs"]):
        print(f"--- change {i} ---\n{diff[:400]}\n")


if __name__ == "__main__":
    main()

For a production deployment you’d swap InMemorySaver for a durable backend — a SQLite or Postgres checkpointer — so runs survive process restarts and can be inspected after the fact. The LangGraph persistence docs cover the full checkpointer API.

Common Pitfalls

Sharing the message history with every node. It’s tempting to give each agent the whole messages list “for context.” Don’t. Each node should receive the minimum it needs. A coder that sees the planner’s chain of thought will second-guess the plan.

No revision ceiling. A reviewer that keeps finding new objections and a coder that keeps half-fixing them will loop until your bill spikes. MAX_REVISIONS plus recursion_limit are both mandatory, not optional.

Routers that mutate state. Conditional-edge functions in LangGraph should return a string, not a dict. If you need to mutate state at a branch point, do it in a dedicated node like advance. Mixing the two makes the graph impossible to reason about.

Free-text verdicts. If the reviewer returns “I think this is fine,” your router has to parse English. Structured output with a Literal field turns the verdict into a typed value you can branch on with confidence.

Temperature drift. Run planner and reviewer at temperature=0. Determinism in the control-flow agents makes runs reproducible; save any creativity for nodes where it actually helps.

Troubleshooting

Symptom: GraphRecursionError raised mid-run. Cause: the graph never reached END within recursion_limit steps, usually because the task list never empties. Fix: confirm advance_or_finish flips task status to done, and verify has_more_work returns END when current_task_id is None.

Symptom: with_structured_output raises a validation error. Cause: the model returned JSON that doesn’t satisfy the Pydantic schema, often because the prompt asked for something the schema forbids. Fix: align the system prompt with the schema constraints — if Plan caps tasks at 8, say so in the prompt — and consider catching the error to retry once.

Symptom: the coder ignores reviewer notes and resubmits the same diff. Cause: review_notes is being cleared before the coder reads it. Fix: only reset review_notes inside advance_or_finish, after a task is accepted — never on the retry path.

Symptom: every task gets approved instantly, even broken ones. Cause: the reviewer prompt is too permissive or the diff isn’t reaching the node. Fix: log state["diffs"][-1] at the top of reviewer_node; if it’s empty, the coder node isn’t writing to the diffs reducer correctly.

Symptom: state from a previous run bleeds into a new one. Cause: reusing the same thread_id. Fix: generate a fresh thread_id per feature request, or call the graph with a new one each time.

What’s Next

You now have a squad that plans, codes, reviews, and loops with a hard budget. The natural next steps are durable checkpointing so long runs survive restarts, and a human approval gate before changes land. From there, give the coder real file-system tools and the reviewer access to a test runner, and the squad starts to resemble an actual engineering team — one you can audit edge by edge.

The Squad Topology

Project Setup

Defining the Shared State

The Planner Node

The Coder Node

The Reviewer Node

Wiring the Graph

Running the Squad

Common Pitfalls

Troubleshooting

What’s Next

Related posts

Stateful Agent Graphs, Checkpointing and Human in the Loop

Orchestrating Multi-Agent Workflows with CrewAI

Production Multi Agent Systems with LangGraph 0.2, A Hands On Tutorial

Production Agents with LangGraph, State Machines Over Chains

The Agentic AI Landscape in May 2024, LangGraph, AutoGen, CrewAI

Instrumenting LLM Calls with OpenTelemetry Traces

Catching Regressions with an AI Reviewer Agent on Pull Requests

Incident Response Automation with LangGraph, A Step by Step Tutorial

Let’s Start a Project