ReAct, Reflexion, and Planner Executor, Agent Loop Patterns That Work
TL;DR — ReAct is the default. Reflexion adds a self-critique pass that pays off on multi-step tasks. Planner-executor wins when the task decomposes cleanly and the steps are expensive.
The agent loop is the heart of the whole stack. Get it right and the model’s strengths shine through. Get it wrong and you’ll spend weeks tuning prompts to paper over a structural mistake. Three patterns dominate the literature and the codebases I’ve audited this year, and each one has a clear home turf.
This post is the side-by-side I’d hand a new engineer joining an agent team. We’ll implement ReAct, Reflexion, and planner-executor as small, real Python programs, then talk about when each one is the right call. The implementations are intentionally minimal so the structure is visible; in production you’d wrap them in LangGraph or equivalent, but the underlying loops would be the same.
Before we start, the framing I find useful. Every agent is a loop that does three things. Think, act, observe. ReAct interleaves those three at every step. Reflexion adds a fourth, “critique,” that runs less frequently. Planner-executor splits “think” off entirely and runs it once at the top. The pattern names are about how you arrange those four primitives.
ReAct, the loop you start with
Reason and Act. The model alternates between a “thought” (free text about what to do) and an “action” (a tool call). The observation gets fed back as the next input. This is the loop OpenAI’s function calling was designed for and the one most production agents in 2024 use.
from openai import OpenAI
import json
client = OpenAI()
TOOLS = [{
"type": "function",
"function": {
"name": "search_orders",
"description": "Find orders matching a query.",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
"additionalProperties": False,
},
},
}]
def run_react(user_msg: str, max_steps: int = 6) -> str:
messages = [
{"role": "system", "content": "You are a careful agent. Think step by step. Use tools when needed."},
{"role": "user", "content": user_msg},
]
for _ in range(max_steps):
resp = client.chat.completions.create(
model="gpt-4-turbo", messages=messages, tools=TOOLS, tool_choice="auto",
)
msg = resp.choices[0].message
messages.append(msg.model_dump(exclude_none=True))
if not msg.tool_calls:
return msg.content
for call in msg.tool_calls:
result = dispatch_tool(call.function.name, json.loads(call.function.arguments))
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result),
})
return "Step limit reached."
That’s a complete ReAct loop in about 25 lines. It works for the majority of agent workloads. The strengths are simplicity, low latency (no extra planning pass), and easy debugging because the trace is linear. The weakness is that the model can wander. On long tasks it forgets the goal, repeats actions, or stops short.
The fix for wandering isn’t a new pattern, it’s a better prompt and an iteration cap. But there’s a class of problems where even good prompting doesn’t help, and that’s where Reflexion comes in.
Reflexion, the self-critique loop
Reflexion adds a critique step after each attempt. The agent tries, evaluates its own work, and either accepts or revises. The original paper from Shinn et al. is a worthwhile read, but the practical implementation is simpler than the academic version.
def run_reflexion(task: str, max_attempts: int = 3) -> str:
history = []
for attempt in range(max_attempts):
result = run_react(task + ("\n\nPrior critiques:\n" + "\n".join(history) if history else ""))
critique = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content":
"Evaluate the answer against the task. Reply 'PASS' if correct, "
"or 'FAIL: <one-line reason>' if not."},
{"role": "user", "content": f"Task: {task}\n\nAnswer: {result}"},
],
temperature=0,
).choices[0].message.content
if critique.startswith("PASS"):
return result
history.append(critique)
return result
The reason this works is that the critique is cheaper than the attempt. One LLM call evaluates the output of N tool calls. When the attempt costs 10k tokens and the critique costs 500, you can afford two or three revisions and still come out ahead of a single longer ReAct loop.
Reflexion is most useful when failure is recoverable and detectable. Code generation with a test suite. Data extraction where you can spot-check the schema. Search-and-summarize where you can verify citations. It’s less useful when failure is invisible to a second LLM, which is to say, when the critic can’t actually tell good from bad.
Planner-executor, when the task decomposes
The planner-executor split is for tasks where you can confidently lay out the steps before you start. Migrate a database. Refactor a module. Process a batch of documents. The planner runs once, produces a list of subtasks, and the executor handles them in sequence (or in parallel if they’re independent).
from pydantic import BaseModel
from typing import List
class Plan(BaseModel):
steps: List[str]
def make_plan(task: str) -> Plan:
resp = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content":
"Break the task into 3-7 concrete steps. Each step should be self-contained "
"and executable by an agent with tool access."},
{"role": "user", "content": task},
],
response_format={"type": "json_object"},
)
return Plan.model_validate_json(resp.choices[0].message.content)
def run_planner_executor(task: str) -> List[str]:
plan = make_plan(task)
results = []
for step in plan.steps:
results.append(run_react(step, max_steps=4))
return results
The wins are predictability and parallelism. You know up front how many steps you’re committing to. Independent steps can run concurrently. The planner can be a cheaper model (gpt-3.5-turbo or claude-3-haiku) since planning is mostly a structuring task, not a reasoning one.
The losses are inflexibility and a doubled latency floor. If a step’s result changes what the next step should be, the original plan is wrong and you either replan (expensive) or muddle through (bad outputs). For tasks where you genuinely can’t predict the next step until you see the previous one, ReAct beats this every time.
How I actually choose
The decision rule that’s served me well.
Is the task five steps or fewer with predictable shape? ReAct. Is the task long, with verifiable outputs and a clear failure signal? Reflexion. Is the task naturally a checklist where the steps don’t depend on each other much? Planner-executor.
You can compose them. A planner-executor where each step is run with Reflexion. A ReAct loop that hands off to a planner-executor when it hits a “compile a report” subtask. The patterns aren’t mutually exclusive, they’re vocabulary. Pick the right word for each part of the system.
For the framework-level discussion that frames these patterns, see /blog/production-agents-langgraph-state-machines/. The ReAct paper from Yao et al. is also worth reading once if you’ve only seen it described secondhand.
Common Pitfalls
The mistakes that cost real money in production.
- Running Reflexion’s critic with the same prompt as the actor. You’ll get sycophantic critiques. Use a fundamentally different framing for the critic; sometimes a different model entirely.
- Letting the planner produce 30 steps. Cap it. A 30-step plan is a lie the model told you. 3-7 steps is realistic.
- No early termination in ReAct. The model can produce a final answer at step 2 of 6. Detect that and stop instead of letting it ramble.
- Planning with a more expensive model than executing. Counterintuitive. Planning is structuring; execution is where reasoning happens. Match models to task type.
- Hoping Reflexion will catch hallucinations. It mostly won’t. Hallucinated facts pass critique because the critic has no ground truth. Catch hallucinations with retrieval, not reflection.
- Mixing critique and revision in one prompt. Separate them. Critic identifies the problem. A new actor call fixes it. Conflating the two confuses the model.
Wrapping Up
These three loops cover something like 90% of the agent shapes I see in production. The names matter less than the structural choices. ReAct interleaves thinking and acting. Reflexion adds a critic. Planner-executor splits planning out. Each gives up something to gain something else.
The mistake I want to flag one more time is treating these as alternatives rather than building blocks. A real production agent often has all three patterns in different places. A top-level planner produces a list of subtasks, each subtask runs as a ReAct loop, and the entire output is verified by a Reflexion-style critique. That kind of composition is where the patterns earn their keep, not in any single loop’s purity.
If you’re starting fresh, start with ReAct. Add Reflexion when you measure error rates and they’re unacceptable. Add planning when you find yourself simulating it manually outside the agent. Build up, don’t start at the top.