Observability for Multi Agent Systems, LangSmith and Phoenix in 2025
TL;DR — Multi-agent observability needs traces, evals, and cost tracking. LangSmith is the easy hosted answer, Phoenix is the open-source one. Wire both via OpenTelemetry so you can swap or run them side-by-side.
The first multi-agent system I shipped without proper tracing was a research crew that started producing garbage answers on day twelve. Took me four hours to figure out one of the agents was hitting a stale tool, because I was reading log lines one at a time. The second system had LangSmith from day one, and a similar bug took eight minutes to find. Observability isn’t optional once you have more than one agent.
The vendor landscape settled in late 2024. LangSmith is the LangChain ecosystem’s hosted tracing product, Phoenix from Arize is the open-source alternative, and both speak OpenTelemetry on the OTel semantic conventions for LLMs. You don’t have to pick exclusively. I usually wire both, LangSmith for the team’s day-to-day and Phoenix for self-hosted compliance environments.
This post walks through wiring LangSmith and Phoenix into a LangGraph multi-agent system, what to trace, what to evaluate, and what to alert on. Code targets Python 3.12, langsmith==0.3.1, arize-phoenix==7.9.0, openinference-instrumentation-langchain==0.1.30, and langgraph==0.2.74.
What you actually need to observe
Three signals, in order of how often they save you.
Traces. A trace is the full call tree of a single invocation, every LLM call, tool call, and node transition. This is what tells you where time and tokens went, and where logic took a wrong turn.
Evals. Periodic or per-run evaluations of output quality. Did the answer actually answer the question? Did the agent stay on task? Without evals, you don’t know if a code change made things worse until users complain.
Cost and latency. Per-run token counts, dollar estimates, and p50/p95/p99 latency. The kind of metric you want on a dashboard, not a trace explorer.
+-----------------------------+
| Multi-agent app |
+--+--------+--------+--------+
| | |
v v v
+-----+ +-----+ +-----+
| LLM | | LLM | | tool|
+-----+ +-----+ +-----+
\ | /
\ | /
+---trace--+
|
+----v-----+ +----------+
| LangSmith| | Phoenix |
+----------+ +----------+
That’s the shape. One trace per request, fanning out to model and tool spans, exported to one or more backends.
1. LangSmith, the LangChain-native path
LangSmith is the path of least resistance if you’re on LangGraph or any LangChain stack. Three env vars and you have tracing.
export LANGSMITH_API_KEY="lsv2_..."
export LANGSMITH_PROJECT="research-crew-prod"
export LANGSMITH_TRACING=true
That’s it. Every LangChain and LangGraph invocation gets traced automatically. No code changes. Each node in a LangGraph appears as a span, each LLM call inside the node as a child span with prompt, response, token counts, and latency.
For custom code outside of LangChain, decorate with @traceable.
from langsmith import traceable
@traceable(run_type="tool", name="github_lookup")
async def github_lookup(username: str) -> dict:
# your tool implementation
return {"login": username, "followers": 0}
The run_type controls how the span shows up in the LangSmith UI. tool, llm, chain, retriever, and parser are the main types. Use them, the filtering in the UI is type-aware.
Adding metadata for filtering
LangSmith’s value compounds when you tag runs. Add a user ID, environment, and version to every run, and you can filter to “errors from prod for user X on version 1.4.2” in a few clicks.
from langsmith import Client
from langgraph.graph import StateGraph
graph = builder.compile(checkpointer=saver)
config = {
"configurable": {"thread_id": "t1"},
"metadata": {
"user_id": "u_8472",
"env": "prod",
"version": "1.4.2",
"trace_id": "trace_abc123",
},
"tags": ["research-crew", "prod"],
"run_name": "research-run-for-postgres-17",
}
result = graph.invoke({"topic": "postgres 17"}, config=config)
Treat metadata as structured fields you’ll query. I add feature_flag keys for any A/B experiments, so I can compare trace samples between variants without re-running anything.
2. Phoenix, the self-hosted alternative
Phoenix is open-source and easy to self-host. Same trace concept, OpenTelemetry semantic conventions, and a UI that’s better than you’d expect from an OSS project.
pip install "arize-phoenix==7.9.0" "openinference-instrumentation-langchain==0.1.30"
You can run Phoenix as a hosted notebook backend, but for real use I run it as a container.
docker run -d --name phoenix \
-p 6006:6006 -p 4317:4317 \
-e PHOENIX_WORKING_DIR=/data \
-v phoenix-data:/data \
arizephoenix/phoenix:version-7.9.0
UI at http://localhost:6006, OTel gRPC ingest at localhost:4317. Point your app at it via OpenTelemetry.
# tracing.py
from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor
tracer_provider = register(
project_name="research-crew-prod",
endpoint="http://localhost:4317",
auto_instrument=True,
)
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
The LangChainInstrumentor hooks into LangChain’s callback system and emits OTel spans on every node and LLM call. Combined with auto-instrumentation, you get a tree per invocation in the Phoenix UI without further code changes.
Running both backends
You can send traces to LangSmith and Phoenix simultaneously. LangChain’s tracing is callback-based and supports multiple handlers, while OTel will export to whatever endpoint you configure. The configs don’t conflict.
# Both env vars set, both LangChainInstrumentor wired
# LangSmith picks up via LANGSMITH_TRACING=true
# Phoenix picks up via the OTel instrumentor
I do this in production for two reasons. LangSmith is faster to navigate for incident debugging. Phoenix keeps a self-hosted record I control for compliance retention.
3. Evals, the part everyone skips
Tracing tells you what happened. Evals tell you whether it was any good. Both LangSmith and Phoenix support running evaluators against trace data.
The pattern I use, log a test dataset of inputs to your prod system, run them through your agent, and have an LLM-as-judge evaluate the outputs against rubrics.
# evals.py
from langsmith import Client
from langsmith.evaluation import LangChainStringEvaluator, evaluate
from langchain_openai import ChatOpenAI
ls = Client()
dataset = ls.create_dataset(
"research-crew-regression",
description="Regression set of 50 research queries",
)
ls.create_examples(
inputs=[{"topic": t} for t in REGRESSION_TOPICS],
outputs=[{"reference": r} for r in REFERENCE_ANSWERS],
dataset_id=dataset.id,
)
def predict(inputs: dict) -> dict:
result = graph.invoke({"topic": inputs["topic"], ...}, config={...})
return {"output": result["draft"]}
correctness = LangChainStringEvaluator(
"labeled_score_string",
config={
"criteria": {
"accuracy": "Are the factual claims in 'prediction' supported by 'reference'?",
"completeness": "Does 'prediction' cover the main points of 'reference'?",
},
"llm": ChatOpenAI(model="gpt-4o", temperature=0),
},
)
results = evaluate(
predict,
data="research-crew-regression",
evaluators=[correctness],
experiment_prefix="v1.4.2",
)
Run this on every release. Compare to the previous baseline. Flag regressions in CI.
Phoenix has its own evaluation framework with similar shape.
from phoenix.evals import (
OpenAIModel,
llm_classify,
HallucinationEvaluator,
QAEvaluator,
)
model = OpenAIModel(model="gpt-4o")
qa_eval = QAEvaluator(model)
hallucination_eval = HallucinationEvaluator(model)
# spans_dataframe pulled from Phoenix
results = llm_classify(
dataframe=spans_dataframe,
template=qa_eval.template,
model=model,
rails=["correct", "incorrect"],
provide_explanation=True,
)
Both work. LangSmith’s evaluation UX is more polished, Phoenix’s is more flexible.
4. Cost and latency dashboards
The metrics worth dashboarding.
Per-run cost. Token counts multiplied by your model pricing. Both LangSmith and Phoenix compute this if you’ve configured pricing. I dashboard p50, p95, p99 cost per run, and total daily cost.
Per-node latency. The slowest node in a graph is usually the bottleneck. Phoenix’s flame graph view makes this obvious.
Error rate per node. A node that fails 5% of the time is probably the next bug.
# Custom metric emission via OpenTelemetry
from opentelemetry import metrics
meter = metrics.get_meter("research-crew")
run_cost = meter.create_histogram(name="agent.run.cost_usd", unit="USD")
run_tokens = meter.create_histogram(name="agent.run.tokens", unit="tokens")
def record_run(state, result):
run_cost.record(estimate_cost(result), {"workflow": "research", "model": "gpt-4o"})
run_tokens.record(count_tokens(result), {"workflow": "research"})
Pipe OTel metrics to whatever you already use, Datadog, Prometheus, Cloud Monitoring.
5. Alerting, the short list
Things you want to know about within minutes.
Error rate spike. Per-node 5xx rate over 2% for 10 minutes. Usually means a tool is down or a model is rate-limiting.
Cost spike. Daily cost more than 2x the 7-day average. Catches retry storms, runaway loops, prompt regressions.
Latency regression. p95 latency more than 1.5x the 24-hour baseline. Usually a slower model variant or a stuck tool.
Eval score drop. Regression set accuracy drops more than 5 percentage points compared to the previous release. Pre-release blocker.
Don’t alert on individual run failures. The signal is in rates, not events. Alerting on every failed run will train you to ignore alerts.
I covered the retry and circuit breaker patterns these signals interact with in long-running agent workflows.
6. Correlating traces across services
Multi-agent systems often span services. A LangGraph orchestrator calling out to an MCP server, hitting a vector DB, talking to a tool service. Each service traces locally, but you need them stitched into one view for any real debugging.
OpenTelemetry’s W3C Trace Context propagation handles this if you wire it. The trace ID and parent span ID ride in HTTP headers (traceparent, tracestate), and downstream services pick them up.
from opentelemetry.propagate import inject
import httpx
async def call_downstream(url: str, payload: dict):
headers = {}
inject(headers) # injects traceparent from current span
async with httpx.AsyncClient() as client:
return await client.post(url, json=payload, headers=headers)
On the receiving service, install the FastAPI instrumentation and it’ll auto-extract.
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
FastAPIInstrumentor.instrument_app(api)
In Phoenix or any OTel viewer, you’ll see one trace spanning both services with the right parent-child relationships. Without this, every service shows its own truncated trace and you’re stitching them by hand.
7. Annotations from human reviewers
Both LangSmith and Phoenix let humans label trace data, “this response was good”, “this tool call was wrong”. These labels become training signal for future evals and a way to track quality drift over time.
# LangSmith
from langsmith import Client
ls = Client()
ls.create_feedback(
run_id="run_abc",
key="user_thumbs",
score=1, # 1 for thumbs up, 0 for down
comment="answer was concise and accurate",
)
Wire your product’s feedback widgets to this API. Over a month you’ll have a dataset that highlights where your agent is consistently weak, which is the actual roadmap for what to improve next.
Common Pitfalls
The ones I’ve hit on real deploys.
- Logging full prompts at INFO level. PII, secrets, and credentials end up in your trace store. Mask sensitive fields before they hit the wire. LangSmith has a
LANGSMITH_HIDE_INPUTSenv var, use it for any field you can’t sanitize. - Sampling all traces. Traces are cheap but not free. Sample at the entrypoint if you have high volume, keep 100% of errored runs and 5 to 10% of successful ones. Both LangSmith and Phoenix support head-based sampling via the OTel SDK.
- No retention policy. Trace storage grows fast. Set retention at 30 to 90 days for normal runs, longer only for traces you’ve explicitly bookmarked. Phoenix’s default is forever, which is fine until your disk fills up at 3am.
- Treating evals as one-time work. A regression set built in Q1 is stale by Q3. Refresh the dataset quarterly with real prod queries (sanitized), and version your evals so you can compare across changes.
Troubleshooting
Three failure modes you’ll see.
Traces stop appearing in LangSmith mid-run. Network blip, rate limit, or auth issue. Check LANGSMITH_API_URL connectivity from your pod, and look for 429s in the client’s debug logs. Tracing is async by default, so it doesn’t break your app, but it does silently drop spans.
Phoenix UI shows no spans even though the app runs. OTel exporter is misconfigured. Verify endpoint points at port 4317 (gRPC) and that the container is reachable. A common gotcha, the http:// prefix is required for grpc endpoint in phoenix.otel.register.
Cost numbers in LangSmith look wildly off. LangSmith uses a built-in model price table that lags. For new models like claude-3-7-sonnet-20250219, you may need to set prices manually via the model pricing config. Phoenix has the same issue and the same fix.
Wrapping Up
Multi-agent observability is solvable in 2025 in a way it wasn’t even a year ago. LangSmith for the team that wants minimal setup, Phoenix for the team that wants control, OpenTelemetry as the wire format so you can swap. Wire all three at the start of a project, not after the first incident.
The patterns that hold up, trace everything, tag with structured metadata, evaluate against a regression set on every release, dashboard cost and latency, alert on rates not events. None of it is novel relative to general observability. The difference with multi-agent systems is that the failure modes are weirder and the costs are real per call.
The OpenInference spec defines the LLM semantic conventions both LangSmith and Phoenix follow. Worth a skim, it’s the contract that makes the ecosystem interoperable.
What’s next, building out the eval datasets is the work that pays off most over time. Start with 20 representative queries, expand quarterly, and you’ll have a regression suite that catches real problems before users do.