background-shape
Multi Agent Systems in 2025, Architecture Patterns That Work
March 3, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — Four patterns cover ninety percent of real multi-agent work in 2025: supervisor, hierarchical, swarm, and event-driven. Pick by failure mode and latency budget, not by what’s trending on Twitter.

I’ve shipped four multi-agent systems into production in the last twelve months, and the one consistent lesson is that the architecture you pick on day one decides what kind of fires you’ll be putting out on day ninety. The frameworks have matured fast. LangGraph hit 0.2 stable, CrewAI crossed 0.86, AutoGen got rewritten from scratch as 0.4 in late 2024, and the Model Context Protocol gave us a real standard for tool exposure. None of that saves you from picking the wrong topology.

This is the field guide I’d hand a senior engineer who’s been asked to scope a multi-agent feature for next quarter. I’m assuming you already know how to call an LLM and you’ve built at least one ReAct-style agent. What you probably haven’t done is reason about agent topologies the way you reason about service topologies. That’s the gap I want to close.

I’ll walk through four patterns that hold up under real load, show runnable Python for each using LangGraph 0.2.x and CrewAI 0.86, and call out the failure modes I’ve personally hit. The code targets Python 3.12 and assumes you have an OpenAI or Anthropic key in your environment.

What changed in 2025 that matters

Three things shifted the design space this year. First, reasoning models like OpenAI’s o1 and Claude 3.7 Sonnet’s extended thinking mode mean a single agent can chew through problems that used to require an orchestrator-worker split. Second, MCP gave us a wire format for tools that doesn’t lock you into one vendor’s SDK. Third, structured outputs and function calling are now reliable enough that you can treat agent-to-agent messages as typed RPC instead of free-form prose.

The practical effect is that you should default to fewer agents than you would have in 2024. A two-agent supervisor-worker setup with a reasoning model often beats a five-agent hierarchical mess. Complexity is a tax, and you pay it in latency, token cost, and debuggability.

                +-----------+
       user --> | router    |
                +-----+-----+
                      |
            +---------+---------+
            |         |         |
        +---v---+ +---v---+ +---v---+
        | agent | | agent | | agent |
        +-------+ +-------+ +-------+

That’s the supervisor pattern in one diagram. It’s where most of you should start. The other three patterns earn their complexity only when supervisor stops fitting.

Pattern 1, supervisor with typed handoffs

The supervisor pattern has one orchestrator that decides which specialist agent runs next. It’s the multi-agent equivalent of a switch statement. It works when your problem decomposes into a small set of well-known subtasks, like “research, write, edit” or “classify, route, respond”.

The trick that took me too long to learn is that the supervisor should not see the raw output of the workers. It should see a typed summary the worker writes for it. Otherwise you blow your context window and the supervisor starts hallucinating about what each worker actually did.

Here’s a working LangGraph 0.2 implementation. Install with pip install "langgraph==0.2.74" "langchain-openai>=0.2" "openai>=1.59".

from typing import Annotated, Literal, TypedDict
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    next_agent: str
    research_summary: str
    draft: str

llm = ChatOpenAI(model="gpt-4o", temperature=0)

def supervisor(state: AgentState) -> dict:
    system = SystemMessage(content=(
        "You route between three workers: researcher, writer, editor. "
        "Respond with ONLY one word: researcher, writer, editor, or done."
    ))
    decision = llm.invoke([system] + state["messages"]).content.strip().lower()
    if decision not in {"researcher", "writer", "editor", "done"}:
        decision = "done"
    return {"next_agent": decision}

def researcher(state: AgentState) -> dict:
    prompt = f"Research this query and return 5 bullet facts:\n{state['messages'][0].content}"
    summary = llm.invoke([HumanMessage(content=prompt)]).content
    return {
        "research_summary": summary,
        "messages": [AIMessage(content=f"[researcher] done, {len(summary)} chars")]
    }

def writer(state: AgentState) -> dict:
    prompt = f"Write a 300-word draft from these facts:\n{state['research_summary']}"
    draft = llm.invoke([HumanMessage(content=prompt)]).content
    return {
        "draft": draft,
        "messages": [AIMessage(content=f"[writer] draft ready, {len(draft.split())} words")]
    }

def editor(state: AgentState) -> dict:
    prompt = f"Tighten this draft, return final copy:\n{state['draft']}"
    final = llm.invoke([HumanMessage(content=prompt)]).content
    return {
        "draft": final,
        "messages": [AIMessage(content=f"[editor] polished")]
    }

def route(state: AgentState) -> Literal["researcher", "writer", "editor", "__end__"]:
    nxt = state["next_agent"]
    return "__end__" if nxt == "done" else nxt

graph = StateGraph(AgentState)
graph.add_node("supervisor", supervisor)
graph.add_node("researcher", researcher)
graph.add_node("writer", writer)
graph.add_node("editor", editor)
graph.set_entry_point("supervisor")
graph.add_conditional_edges("supervisor", route)
for worker in ("researcher", "writer", "editor"):
    graph.add_edge(worker, "supervisor")

app = graph.compile()
result = app.invoke({
    "messages": [HumanMessage(content="Write a blurb on Postgres 17 streaming I/O.")],
    "next_agent": "", "research_summary": "", "draft": ""
})
print(result["draft"])

Notice that each worker returns a short status message to messages and stashes its real output in its own state slot. The supervisor sees the status, not the full draft. That’s the typed handoff in practice.

When supervisor is wrong

Supervisor breaks when work needs to happen in parallel, when workers need to talk to each other directly, or when the routing logic itself needs to be learned rather than prompted. Move to one of the next three patterns.

Pattern 2, hierarchical teams

When you have more than five or six specialists, a flat supervisor becomes a routing nightmare. The supervisor’s prompt grows, its decisions get worse, and you blow context. The fix is to group specialists into teams, each with its own sub-supervisor, and have a top-level supervisor route between teams.

I used this for a customer support agent system with twelve specialists across three domains: billing, technical, and account management. Top supervisor picks the team, team lead picks the specialist. Each level only ever picks among three or four options, so accuracy stays high.

# pseudo-LangGraph, abbreviated for clarity
from langgraph.graph import StateGraph

billing_team = StateGraph(TeamState)
billing_team.add_node("billing_lead", billing_lead)
billing_team.add_node("refunds_specialist", refunds)
billing_team.add_node("invoicing_specialist", invoicing)
# ... edges

tech_team = StateGraph(TeamState)
# ... similar

top = StateGraph(GlobalState)
top.add_node("top_supervisor", top_router)
top.add_node("billing", billing_team.compile())
top.add_node("tech", tech_team.compile())
top.add_conditional_edges("top_supervisor", route_to_team)

LangGraph 0.2 supports nesting compiled graphs as nodes directly, which is the cleanest way to express hierarchy. The state at each level can be different, you just pass an adapter that translates the global state into team state on entry and back on exit.

The pattern that worked for me, sub-supervisors hold short-term memory for their team and never leak it upward. Only the team’s final answer crosses the team boundary. Otherwise the global state balloons.

Pattern 3, swarm with peer handoff

Swarms drop the supervisor entirely. Each agent decides whether to handle the request or hand off to a peer. OpenAI’s Swarm reference implementation popularized the pattern in late 2024, and it’s now a first-class topology in LangGraph via langgraph-swarm.

Use swarm when the routing logic is itself the hard part and you want it co-located with the specialist knowledge. Customer support is a classic fit, the billing agent often knows better than a router whether a question is really a tech issue.

# pip install "langgraph-swarm>=0.0.3"
from langgraph_swarm import create_handoff_tool, create_swarm
from langgraph.prebuilt import create_react_agent
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0)

to_billing = create_handoff_tool(agent_name="billing", description="Hand off billing questions")
to_tech = create_handoff_tool(agent_name="tech", description="Hand off technical questions")

billing = create_react_agent(
    llm, tools=[to_tech], name="billing",
    prompt="You handle billing. Hand off technical issues."
)
tech = create_react_agent(
    llm, tools=[to_billing], name="tech",
    prompt="You handle technical issues. Hand off billing questions."
)

swarm = create_swarm(agents=[billing, tech], default_active_agent="billing").compile()
out = swarm.invoke({"messages": [{"role": "user", "content": "Why was I charged twice?"}]})

The runtime tracks which agent is currently active. A handoff is implemented as a tool call that swaps the active agent and feeds the same conversation forward. That last part is critical, the receiving agent sees the full history, not a summary. It’s a trade-off, you pay tokens for context fidelity.

Swarm gotcha I hit hard

Two agents can hand off to each other in a loop if their prompts disagree on who owns what. I add a hop counter to state and abort at five hops. It’s crude but it works.

Pattern 4, event-driven with a message broker

For long-running workflows, batch processing, or anything where agents shouldn’t be in a single Python process, you want event-driven. Each agent is a worker that consumes from a queue, processes a message, and emits results to another queue. Redis Streams, RabbitMQ, or Kafka all work. I default to Redis Streams because the operational overhead is low.

# pip install "redis>=5.0" "openai>=1.59"
import json, os
import redis
from openai import OpenAI

r = redis.Redis(host="localhost", decode_responses=True)
client = OpenAI()

def research_worker():
    group = "researchers"
    try:
        r.xgroup_create("tasks.research", group, id="0", mkstream=True)
    except redis.ResponseError:
        pass
    while True:
        msgs = r.xreadgroup(group, "worker-1", {"tasks.research": ">"}, count=1, block=5000)
        if not msgs:
            continue
        for _, entries in msgs:
            for msg_id, fields in entries:
                payload = json.loads(fields["data"])
                resp = client.chat.completions.create(
                    model="gpt-4o",
                    messages=[{"role": "user", "content": f"Research: {payload['query']}"}]
                )
                r.xadd("tasks.write", {"data": json.dumps({
                    "trace_id": payload["trace_id"],
                    "facts": resp.choices[0].message.content
                })})
                r.xack("tasks.research", group, msg_id)

Event-driven scales horizontally and survives process restarts, but you lose the synchronous request-response shape your API probably wants. The fix is a coordinator that fans out, collects results via a per-request stream, and resolves the original HTTP request. I’ll cover that pattern in detail in the LangGraph production tutorial.

Common Pitfalls

A short list of the mistakes I’ve made or seen on review.

  1. Letting workers share a single mutable state dict. Two parallel workers race on writes and one wins silently. Use immutable updates, return new dicts from each node, and let the framework merge with reducers like LangGraph’s add_messages.
  2. Forgetting the loop-breaker. Any topology with cycles needs a hop counter or step limit. I default to twenty steps and log a hard error past that. Cheaper than a runaway agent burning through your token budget at 3am.
  3. Putting the full conversation in every prompt. Each worker gets a tailored view. Strip messages, summarize aggressively, and treat your context window like a memory budget. A 40k-token prompt is a code smell.
  4. Ignoring tool call latency. A supervisor that fires three sequential tool calls before deciding next is two seconds slower than one that decides first. Order matters, batch where you can, and use parallel tool calling on the OpenAI SDK 1.59+ where it’s safe.

Troubleshooting

Three failure modes you’ll hit at least once.

Workers return different schemas, supervisor crashes. This is the structured outputs problem. Pin a Pydantic model per worker and validate before returning. If the model can’t conform, retry once with the validation error in the prompt, then fail closed. Anthropic SDK 0.42+ and OpenAI 1.59+ both support strict JSON schema mode, use it.

Token usage spikes 4x after adding a new agent. Almost always means the new agent is seeing too much history. Inspect the prompt size at each node, you want to be under 8k tokens per call unless you have a reason. If you can’t trim, switch to a cheaper model for that node, gpt-4o-mini or claude-3-5-haiku handle most worker roles fine.

Latency p95 is twice the sum of node latencies. Network and serialization aren’t free, but neither is checkpointing. If you’re using LangGraph’s SqliteSaver, switch to PostgresSaver with connection pooling, or move to in-memory for non-durable runs. I’ve seen 800ms shaved off a single invocation by tuning the checkpointer.

Wrapping Up

Pick supervisor first, hierarchical when the supervisor’s prompt gets unwieldy, swarm when routing knowledge belongs with the specialists, and event-driven when you need durability and horizontal scale. Most production systems I’ve built end up as supervisor with one or two long-running event-driven workers behind it. That hybrid covers the latency-sensitive path and the heavy-lifting path without either side compromising.

The frameworks are good enough now that the architectural choice matters more than the framework choice. LangGraph, CrewAI, and AutoGen 0.4 can all implement any of these patterns. Pick the one your team can debug at 2am. For deeper reading, the LangGraph multi-agent docs lay out the framework primitives that map directly onto these patterns.

Next up in this series, I’ll build a full production-grade supervisor system in LangGraph 0.2 with checkpointing, retries, and observability wired in. Same code patterns, but the boring parts done right.