Role Based Agent Teams with CrewAI, A Production Walkthrough
TL;DR — CrewAI 0.86 is opinionated and that’s a feature. Use it when your problem decomposes into stable role descriptions and the work mostly flows linearly. Skip it when you need fine-grained graph control, that’s LangGraph’s job.
CrewAI is the framework I reach for when the team I’m working with has more product engineers than ML engineers. The API reads like a description of an actual team, you write a Researcher agent with a backstory, a Writer agent with a goal, and a couple of Tasks that flow between them. A junior dev can read a CrewAI script and understand what it does on the first pass. That’s worth a lot.
But I’ve also rebuilt two CrewAI prototypes in LangGraph because the team outgrew the abstractions. CrewAI 0.86 closed a lot of that gap. The hierarchical process is stable, the memory backends are pluggable, custom tools no longer require a decorator dance, and the new flow API gives you graph-like control for the cases that need it. This post is the walkthrough I’d give a team picking it up in 2025.
I’ll build a content research crew, three agents working on a real task, with custom tools, persistent memory, and observability wired in. Code targets Python 3.12 and crewai==0.86.0.
1. Install and project layout
CrewAI ships a CLI that scaffolds a project. I use it for new work because it standardizes the layout across teams.
python3.12 -m venv .venv && source .venv/bin/activate
pip install "crewai==0.86.0" "crewai-tools==0.17.0" "openai>=1.59" "tavily-python==0.5.0"
crewai create crew research_crew
cd research_crew
The scaffold gives you src/research_crew/crew.py plus YAML config files for agents and tasks. The YAML-first approach is one of the things CrewAI gets right, your prompts live as configuration, not as inline strings.
src/research_crew/
config/
agents.yaml
tasks.yaml
crew.py
main.py
tools/
custom_tool.py
Set your env vars in .env.
echo "OPENAI_API_KEY=sk-..." >> .env
echo "TAVILY_API_KEY=tvly-..." >> .env
echo "OPENAI_MODEL_NAME=gpt-4o" >> .env
2. Defining the crew in YAML
This is where CrewAI’s opinions shine. Each agent has a role, goal, and backstory. Each task has a description, expected_output, and an agent assignment. The framework injects these into the system prompt at runtime.
# config/agents.yaml
researcher:
role: >
Senior Research Analyst for {topic}
goal: >
Find current, accurate information on {topic} and produce a structured
fact sheet with sources and dates.
backstory: >
You're a senior analyst who's been burned by hallucinated stats. You
cross-check every claim and you cite sources. You refuse to make up
numbers you can't verify.
writer:
role: >
Technical writer
goal: >
Turn the research fact sheet into a {word_count}-word article that
reads like it was written by a senior engineer.
backstory: >
You write like a senior engineer, contractions, no fluff, no hype
language. You assume the reader is technical.
editor:
role: >
Editor and fact-checker
goal: >
Verify every claim in the draft against the fact sheet, fix style
issues, and return the polished article.
backstory: >
You're a meticulous editor. You catch claims that drift from sources
and you cut filler. You don't add new claims.
# config/tasks.yaml
research_task:
description: >
Research the topic {topic}. Produce a fact sheet with at least 8 facts,
each cited with source URL and publication date. Do not include facts
older than 18 months unless they're foundational.
expected_output: >
A markdown fact sheet with 8+ items in the format:
- [fact]. Source, [URL] (YYYY-MM-DD).
agent: researcher
writing_task:
description: >
Using the research fact sheet, write a {word_count}-word article on
{topic}. The audience is senior engineers.
expected_output: >
A complete markdown article, no front matter, no title heading.
agent: writer
context: [research_task]
editing_task:
description: >
Verify each claim in the draft against the fact sheet. Fix style.
Return the final polished article.
expected_output: >
Final article in markdown.
agent: editor
context: [research_task, writing_task]
The context field is critical. It tells CrewAI which prior task outputs to include as context for this task. Without it, the writer sees only the prompt, not the researcher’s output. The framework handles the wiring.
3. Wiring the crew with custom tools
Now the Python side. The @CrewBase decorator binds the YAML to a class.
# src/research_crew/crew.py
from crewai import Agent, Crew, Process, Task
from crewai.project import CrewBase, agent, crew, task
from crewai_tools import SerperDevTool, WebsiteSearchTool
from .tools.fact_checker import FactCheckerTool
@CrewBase
class ResearchCrew:
agents_config = "config/agents.yaml"
tasks_config = "config/tasks.yaml"
@agent
def researcher(self) -> Agent:
return Agent(
config=self.agents_config["researcher"],
tools=[SerperDevTool(), WebsiteSearchTool()],
verbose=True,
max_iter=10,
allow_delegation=False,
)
@agent
def writer(self) -> Agent:
return Agent(
config=self.agents_config["writer"],
verbose=True,
max_iter=3,
allow_delegation=False,
)
@agent
def editor(self) -> Agent:
return Agent(
config=self.agents_config["editor"],
tools=[FactCheckerTool()],
verbose=True,
max_iter=5,
allow_delegation=False,
)
@task
def research_task(self) -> Task:
return Task(config=self.tasks_config["research_task"])
@task
def writing_task(self) -> Task:
return Task(config=self.tasks_config["writing_task"])
@task
def editing_task(self) -> Task:
return Task(config=self.tasks_config["editing_task"])
@crew
def crew(self) -> Crew:
return Crew(
agents=self.agents,
tasks=self.tasks,
process=Process.sequential,
memory=True,
verbose=True,
)
The allow_delegation=False is a habit you should pick up. CrewAI agents can delegate to each other by default, and that delegation is a tool call the LLM decides to make. It’s chaotic in practice. Turn it off unless you specifically need it.
Writing a custom tool
CrewAI 0.86 ships a clean BaseTool interface.
# src/research_crew/tools/fact_checker.py
from crewai.tools import BaseTool
from pydantic import BaseModel, Field
from typing import Type
class FactCheckInput(BaseModel):
claim: str = Field(..., description="The claim to fact-check.")
sources: list[str] = Field(..., description="URLs cited for the claim.")
class FactCheckerTool(BaseTool):
name: str = "fact_checker"
description: str = (
"Check whether a factual claim is supported by the given source URLs. "
"Returns 'supported', 'contradicted', or 'unverifiable' with a reason."
)
args_schema: Type[BaseModel] = FactCheckInput
def _run(self, claim: str, sources: list[str]) -> str:
# In a real tool you'd fetch the URLs and do retrieval.
# For demo, we return a stub.
if not sources:
return "unverifiable, no sources provided"
return f"supported, found in {sources[0]}"
Two things matter here. The args_schema is what the LLM sees, so write the field descriptions like documentation, not like code comments. And the _run method should be fast and deterministic, slow tools tank your agent’s latency.
4. Running the crew
The main entry point.
# src/research_crew/main.py
from research_crew.crew import ResearchCrew
def run():
inputs = {"topic": "Postgres 17 streaming I/O", "word_count": "800"}
crew = ResearchCrew().crew()
result = crew.kickoff(inputs=inputs)
print("=" * 60)
print(result.raw)
print("=" * 60)
print("token usage:", result.token_usage)
if __name__ == "__main__":
run()
Run it.
python -m research_crew.main
You’ll see verbose output of every agent step, including tool calls and their results. The result.raw is the final task’s output, result.tasks_output has each individual task result if you need them.
5. Hierarchical process and the manager agent
Sequential is what you want most of the time. Hierarchical has a manager agent that decides which task runs next, and which agent gets it. It’s CrewAI’s version of the supervisor pattern.
from crewai import Crew, Process
from langchain_openai import ChatOpenAI
@crew
def crew(self) -> Crew:
return Crew(
agents=[self.researcher(), self.writer(), self.editor()],
tasks=[self.research_task(), self.writing_task(), self.editing_task()],
process=Process.hierarchical,
manager_llm=ChatOpenAI(model="gpt-4o", temperature=0),
verbose=True,
memory=True,
)
The manager_llm is what the framework uses for the manager agent’s reasoning. I set it to gpt-4o because the manager’s job is routing, not generation, and you want it deterministic. The worker agents can use cheaper models.
In practice I use hierarchical when I have more than four tasks and the dependencies between them aren’t strictly linear. For three-task pipelines, sequential is fine and saves you the manager’s token cost on every run.
6. Memory and the persistence story
CrewAI memory has three layers, short-term (conversation), long-term (cross-run), and entity (named things the crew has discussed). With memory=True, all three are on with defaults that use a local SQLite store.
For production, point them at a real store.
from crewai.memory import LongTermMemory, ShortTermMemory, EntityMemory
from crewai.memory.storage.rag_storage import RAGStorage
from crewai.memory.storage.ltm_sqlite_storage import LTMSQLiteStorage
@crew
def crew(self) -> Crew:
return Crew(
agents=self.agents,
tasks=self.tasks,
process=Process.sequential,
memory=True,
long_term_memory=LongTermMemory(
storage=LTMSQLiteStorage(db_path="/var/lib/crewai/ltm.db")
),
short_term_memory=ShortTermMemory(
storage=RAGStorage(
embedder_config={"provider": "openai", "config": {"model": "text-embedding-3-small"}},
type="short_term",
path="/var/lib/crewai/stm"
)
),
)
The RAGStorage defaults to ChromaDB locally. For real deployments, override it with a Postgres pgvector backend, the interface is small enough that you can write one in an afternoon.
Observability hook
CrewAI 0.86 supports event callbacks. I wire mine into structured logging for Datadog or Phoenix.
from crewai.events import crewai_event_bus, TaskCompletedEvent, AgentExecutionCompletedEvent
import logging
log = logging.getLogger("crewai")
@crewai_event_bus.on(TaskCompletedEvent)
def on_task_complete(source, event: TaskCompletedEvent):
log.info("task.complete", extra={
"task": event.task.description[:80],
"agent": event.task.agent.role,
"duration_s": getattr(event, "duration", None),
})
@crewai_event_bus.on(AgentExecutionCompletedEvent)
def on_agent_step(source, event):
log.info("agent.step", extra={"agent": event.agent.role})
I cover wiring Phoenix and LangSmith in detail in observability for multi-agent systems.
7. Flows for graph-style control
CrewAI 0.86 introduced Flow, a declarative state machine sitting alongside Crews. Use it when you want graph-style branching without giving up the role-based agent abstractions.
from crewai.flow.flow import Flow, listen, start, router
from pydantic import BaseModel
class ContentState(BaseModel):
topic: str = ""
draft: str = ""
approved: bool = False
revisions: int = 0
class ContentFlow(Flow[ContentState]):
@start()
def kickoff(self):
return ResearchCrew().crew().kickoff(inputs={"topic": self.state.topic})
@listen(kickoff)
def review(self, draft_output):
self.state.draft = str(draft_output)
# call a reviewer crew or a single LLM here
@router(review)
def decide(self):
return "publish" if self.state.approved else "revise"
@listen("revise")
def revise(self):
self.state.revisions += 1
if self.state.revisions >= 3:
return # circuit breaker
@listen("publish")
def publish(self):
# ship it
pass
flow = ContentFlow()
flow.state.topic = "Postgres 17"
flow.kickoff()
Flows handle persistence too. With @persist() on methods, the state is checkpointed between steps, so a crash mid-flow can resume. It’s not LangGraph’s checkpointer in feature parity, but for CrewAI workloads it covers the durability story without leaving the framework.
Common Pitfalls
The ones I’ve actually paid for.
- Leaving
allow_delegation=Trueon every agent. You’ll get agents handing tasks to each other in surprising patterns and your token bill triples. Set it toFalseunless you’ve designed the system around delegation. - Using
verbose=Truein production. It prints full prompts and outputs, which is great for dev and terrible for a Kubernetes log volume. Switch toverbose=Falseand rely on event callbacks for structured telemetry. - Forgetting
expected_outputon tasks. Without it, the LLM doesn’t know what shape the result should take, and downstream tasks get junk context. Be explicit even if it feels redundant. - Mixing async and sync tools. CrewAI 0.86’s tool runtime handles both, but if you have an async tool that wraps a sync API badly, you’ll see deadlocks under load. Keep tools sync unless you genuinely need async I/O.
Troubleshooting
Three failures with concrete fixes.
Crew runs forever. Almost always max_iter is too high on a researcher agent and it keeps searching. Drop max_iter to 5 to 8, and add a tool-use guard in the agent’s backstory like “After 3 searches, write your conclusions even if you have more to investigate.”
Token usage climbs each iteration. Memory is pulling in too much context. Inspect what’s being injected by setting CREWAI_LOG_LEVEL=DEBUG and check the prompts. The usual fix is to scope memory tighter, use a per-topic memory key rather than a global one.
Tool calls fail with validation errors. Your Pydantic args_schema is too strict or the field descriptions are unclear. LLMs hallucinate arguments when descriptions are vague. Rewrite the descriptions as if you’re writing API docs, and use Optional only when truly optional.
Wrapping Up
CrewAI 0.86 hits a sweet spot for teams that want multi-agent without the cognitive overhead of a graph DSL. The YAML-first config keeps prompts versionable, the sequential process is the right default, and the new event bus gives you a hook for observability without monkey-patching.
I’d use CrewAI for content generation pipelines, research crews, and structured report writing. I’d reach for LangGraph or AutoGen 0.4 when I need fine-grained control over the message flow or when human-in-the-loop is more than a single approval step.
The CrewAI changelog is worth following, the 0.80 to 0.86 line tightened a lot of rough edges and there’s more coming. Pin your version in production, the API still moves fast enough that minor releases occasionally break tool signatures.