Multi Agent Conversations with AutoGen, Patterns and Pitfalls

Multi Agent Conversations with AutoGen, Patterns and Pitfalls

May 13, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — AutoGen earns its place when agents genuinely need to disagree. For everything else, you’re paying token tax for an abstraction you don’t need.

I’ll admit I was skeptical of multi-agent frameworks for a long time. Most “multi-agent” demos turn out to be two prompts in a trench coat, and you could replace the whole thing with a single well-structured call and a structured output schema. AutoGen, though, has a use case I keep coming back to. It’s when you want one agent to critique another, and you want that critique to actually shift behavior rather than be appended as politeness.

This post is the AutoGen 0.2.x build I’d ship today, with the architectural rewrite for 0.3 looming. We’ll cover the coder-critic pattern that justifies the framework’s existence, a group chat with a manager, and the termination and cost guards that keep things from spiraling. If you’re evaluating whether you need AutoGen at all, that’s a fair question and I’ll answer it at the end.

The framework is in a transitional phase. The 0.2.x line gets maintenance, the new architecture lands in 0.3, and Microsoft has signaled the API will shift. Build accordingly. Treat what you ship today as something you’ll port within six months, and don’t over-invest in the conversation abstractions.

The coder-critic pattern

This is the simplest setup that genuinely benefits from two agents. A coder writes code; a critic reviews it and pushes back. The critic isn’t just a second LLM pass with a different prompt. It’s a peer in the conversation that can ask the coder to revise.

import autogen

config_list = [{"model": "gpt-4-turbo", "api_key": "sk-..."}]
llm_config = {"config_list": config_list, "temperature": 0, "cache_seed": None}

coder = autogen.AssistantAgent(
    name="coder",
    llm_config=llm_config,
    system_message=(
        "You write Python functions. Always include type hints and docstrings. "
        "When the critic objects, revise. Reply with 'TERMINATE' only when the "
        "critic has signed off."
    ),
)

critic = autogen.AssistantAgent(
    name="critic",
    llm_config=llm_config,
    system_message=(
        "You review Python code for correctness, edge cases, and clarity. "
        "Be specific. Quote lines. If the code passes review, reply 'APPROVED'. "
        "Never write code yourself."
    ),
)

user_proxy = autogen.UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=6,
    is_termination_msg=lambda m: "TERMINATE" in m.get("content", "").upper(),
    code_execution_config=False,
)

groupchat = autogen.GroupChat(
    agents=[user_proxy, coder, critic],
    messages=[],
    max_round=10,
    speaker_selection_method="round_robin",
)
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)

user_proxy.initiate_chat(
    manager,
    message="Write a function that parses an ISO 8601 duration string and returns a timedelta.",
)

A few choices to notice. speaker_selection_method="round_robin" is deterministic. The alternative, "auto", lets the manager LLM pick the next speaker, which is more flexible but doubles your token cost and adds non-determinism that’s hard to debug. Start with round-robin and only move to auto when you have a concrete reason.

max_round=10 is your hard backstop. AutoGen will respect it even if no agent says TERMINATE. Set it lower than you think you need; in my experience anything past 6 rounds in a coder-critic loop is the agents arguing about style.

The “real” termination, not just a magic string

The default termination check matches a substring in the message content. That works until an agent quotes “TERMINATE” while discussing termination conditions, at which point your conversation ends mid-thought. Use a stricter check.

import re

def is_terminated(message: dict) -> bool:
    content = message.get("content", "")
    if not content:
        return False
    return bool(re.search(r"^\s*TERMINATE\s*$", content, re.MULTILINE))

Pair this with a sentinel from the critic too. If the critic says “APPROVED” the coder should know to terminate next turn. Encode that in the coder’s system prompt explicitly.

Group chat with a designated manager

When you have more than two agents, the question of who speaks next becomes a design problem. AutoGen’s GroupChatManager runs an LLM call per turn to decide. It works, it’s expensive, and it occasionally picks the wrong agent.

The pattern I prefer for fixed workflows is to skip the manager and use a state machine over agent identities. AutoGen lets you do this with a custom speaker_selection_method.

def select_speaker(last_speaker, groupchat):
    if last_speaker.name == "user":
        return next(a for a in groupchat.agents if a.name == "planner")
    if last_speaker.name == "planner":
        return next(a for a in groupchat.agents if a.name == "coder")
    if last_speaker.name == "coder":
        return next(a for a in groupchat.agents if a.name == "critic")
    if last_speaker.name == "critic":
        last_msg = groupchat.messages[-1]["content"]
        if "APPROVED" in last_msg:
            return None
        return next(a for a in groupchat.agents if a.name == "coder")
    return None

groupchat = autogen.GroupChat(
    agents=[user_proxy, planner, coder, critic],
    messages=[],
    max_round=12,
    speaker_selection_method=select_speaker,
)

Now the speaker selection is deterministic and free. You only pay for the agent turns themselves. For workflows where the order is mostly fixed but you occasionally need branching, this is far cheaper than the LLM-based manager and far easier to debug.

Cost control is structural, not optional

Multi-agent conversations are token furnaces. Three agents talking to each other over six turns is eighteen LLM calls, each loaded with the full prior history. A trivial task can spend 40k tokens before producing an answer.

The patterns that actually contain this.

First, summarize aggressively between rounds. AutoGen agents see the full message history by default. Override _message_history or use the summary_method on initiate_chat to compact older turns. Second, cap output tokens per agent via llm_config. A critic doesn’t need 4000 tokens. Third, log token usage per turn and alarm on outliers; AutoGen exposes this via autogen.runtime_logging.

autogen.runtime_logging.start(logger_type="sqlite", config={"dbname": "autogen_runs.db"})
# ... run agents ...
autogen.runtime_logging.stop()

You can query the resulting SQLite for token counts per agent per run, which is the data you need to decide where to cut.

For a comparison with simpler single-agent flows, see /blog/agentic-ai-landscape-may-2024/. The Microsoft team has also published a list of patterns for speaker selection that’s worth a careful read before you commit to a topology.

When to skip AutoGen entirely

I want to be direct about this. Most flows people build in AutoGen could be built simpler.

A “researcher” agent that gathers facts followed by a “writer” agent that turns them into prose is two LLM calls in a function. You don’t need AutoGen for that. You need a Python function.

A pipeline of “extract data, transform data, validate data” is three calls and some validation. Again, a function.

AutoGen earns its complexity when one or more of these are true. The agents need to react to each other’s outputs in a way that isn’t a fixed pipeline. The number of turns is unbounded and depends on the conversation. You want a critic that can genuinely send the coder back to revise, not just append commentary. If none of those apply, write a function.

Common Pitfalls

The recurring failure modes I see in AutoGen codebases.

No max_round, no termination check. Two agents will happily talk forever. Always set both.
LLM-based manager when round-robin would do. You’re paying an extra LLM call per turn for flexibility you don’t use.
Sharing one config_list with different temperature needs. A critic at temperature 0 and a coder at temperature 0.3 need separate llm_config dicts.
Forgetting cache_seed=None. AutoGen’s default caching can mask bugs. Disable during development.
Treating tool execution as a UserProxyAgent job in production. The default code executor runs arbitrary Python. Use use_docker=True or a remote sandbox.
Ignoring the 0.3 migration. If you’re starting today, isolate your AutoGen-specific code behind an interface so the port is contained.

Wrapping Up

AutoGen in May 2024 is a useful tool in a narrow set of situations. Coder-critic pairs work well. Group chats with deterministic speaker selection are reasonable. Anything where you genuinely want emergent multi-agent dialogue is hard to replicate without a framework like this.

For everything else, the boring answer wins. A function, a single LLM call, a structured output. The multi-agent abstraction is seductive because it feels like progress, but progress in production usually looks like fewer LLM calls, not more. Use AutoGen when the problem demands a conversation. Skip it when the problem is really a pipeline.

The transitional state of the framework is the awkward part. Build with the assumption that you’ll port some of this in six months. Keep your business logic outside the agent classes. Keep your tool implementations framework-agnostic. The conversation orchestration is what AutoGen does; everything else should be portable.

The coder-critic pattern

The “real” termination, not just a magic string

Group chat with a designated manager

Cost control is structural, not optional

When to skip AutoGen entirely

Common Pitfalls

Wrapping Up

Related posts

The Agentic AI Landscape in May 2024, LangGraph, AutoGen, CrewAI

AutoGen 0.4 Deep Dive, What Changed and How to Use It

Evaluating LLM Agents, From Vibes to Regression Suites

Cost Control for LLM Agents, Token Budgets and Anthropic Prompt Caching

Guardrails for LLM Agents in 2024, Llama Guard, Rebuff, and NeMo

Memory for AI Agents, Short Term, Long Term, and What to Store Where

ReAct, Reflexion, and Planner Executor, Agent Loop Patterns That Work

Designing Tools for LLM Agents, Function Schemas That Survive Production

Let’s Start a Project