Instrumenting LLM Calls with OpenTelemetry Traces
TL;DR — Wrap every LLM call in a span with consistent attributes / export over OTLP so latency, retries, and failures are queryable / propagate context across async boundaries or your traces fragment.
The first time an LLM feature of mine went sideways in production, I had nothing. A user reported “the assistant is slow,” and all I could see was an aggregate p99 on the HTTP handler. Was it the model? A retry storm? A slow embedding lookup before the call? The handler span told me nothing because the interesting work happened three await hops deep inside a vendor SDK that emitted zero telemetry.
LLM calls are uniquely hostile to debugging. They’re slow enough that tail latency matters, they fail in ways that look like success (a 200 with a truncated body), and they’re expensive enough that you want per-request accounting. A single logical request often fans out into an embedding call, a retrieval step, one or more completions, and a re-rank. Without distributed tracing you’re reconstructing that fan-out from log timestamps, which is miserable.
This post is the instrumentation layer I wish I’d built on day one. We’ll use OpenTelemetry LLM tracing with the Python SDK 1.30, export over OTLP, and produce spans that actually answer questions during an incident. No vendor lock-in, no magic auto-instrumentation you can’t reason about.
Why a Hand-Rolled Span Beats Auto-Instrumentation
There are auto-instrumentation packages for the popular LLM SDKs. They’re fine for a demo. In production I want to control the span boundary, the attribute names, and what gets recorded as an error. Auto-instrumentation tends to either capture too much (full prompt bodies, blowing up your span size and leaking PII) or too little (no token counts, no model version).
The unit of work I care about is “one model invocation.” That span should start before the network call and end after the response is fully consumed — including the time spent draining a streaming response, because that’s real latency the user feels.
Bootstrapping the Tracer
Install the SDK and the OTLP exporter. Pin versions — the OpenTelemetry API and SDK move fast and a minor mismatch produces confusing import errors.
# pyproject.toml — dependencies section
[project]
dependencies = [
"opentelemetry-api==1.30.0",
"opentelemetry-sdk==1.30.0",
"opentelemetry-exporter-otlp-proto-grpc==1.30.0",
"openai==1.66.0",
]
The provider setup belongs in a single module imported once at startup. Configure the resource so every span carries service identity — without service.name your backend buckets everything under “unknown_service”.
# telemetry.py
import os
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
OTLPSpanExporter,
)
_INITIALIZED = False
def init_tracing() -> trace.Tracer:
global _INITIALIZED
if not _INITIALIZED:
resource = Resource.create(
{
"service.name": os.getenv("OTEL_SERVICE_NAME", "llm-gateway"),
"service.version": os.getenv("GIT_SHA", "dev"),
"deployment.environment": os.getenv("ENV", "local"),
}
)
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(
endpoint=os.getenv(
"OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"
),
timeout=5,
)
# Batch processor: never block the request path on export.
provider.add_span_processor(
BatchSpanProcessor(
exporter,
max_queue_size=2048,
max_export_batch_size=512,
schedule_delay_millis=2000,
)
)
trace.set_tracer_provider(provider)
_INITIALIZED = True
return trace.get_tracer("llm.client", "1.0.0")
BatchSpanProcessor is non-negotiable in production. The SimpleSpanProcessor exports synchronously on span end, which adds the exporter’s network round trip to every request. The batch processor decouples export from your hot path; the trade-off is that spans from a process that crashes hard can be lost. That’s an acceptable trade for a request path.
Wrapping the LLM Call
Here’s the core. The span name is stable (chat.completion) so it aggregates cleanly; the variable detail goes into attributes.
# llm_client.py
import time
from openai import OpenAI, APIError, APITimeoutError, RateLimitError
from opentelemetry.trace import Status, StatusCode, SpanKind
from telemetry import init_tracing
tracer = init_tracing()
client = OpenAI(timeout=30.0)
MODEL = "gpt-4o-2024-11-20"
def chat_completion(messages: list[dict], *, max_tokens: int = 512) -> str:
with tracer.start_as_current_span(
"chat.completion",
kind=SpanKind.CLIENT,
) as span:
# Stable, low-cardinality attributes describing the request.
span.set_attribute("gen_ai.system", "openai")
span.set_attribute("gen_ai.request.model", MODEL)
span.set_attribute("gen_ai.request.max_tokens", max_tokens)
span.set_attribute("gen_ai.operation.name", "chat")
span.set_attribute("llm.message.count", len(messages))
started = time.monotonic()
try:
resp = client.chat.completions.create(
model=MODEL,
messages=messages,
max_tokens=max_tokens,
)
except RateLimitError as exc:
span.set_attribute("error.type", "rate_limit")
span.record_exception(exc)
span.set_status(Status(StatusCode.ERROR, "rate limited"))
raise
except APITimeoutError as exc:
span.set_attribute("error.type", "timeout")
span.record_exception(exc)
span.set_status(Status(StatusCode.ERROR, "upstream timeout"))
raise
except APIError as exc:
span.set_attribute("error.type", "api_error")
span.set_attribute("http.response.status_code", exc.status_code or 0)
span.record_exception(exc)
span.set_status(Status(StatusCode.ERROR, str(exc)))
raise
# Response attributes — these are what you query during an incident.
usage = resp.usage
span.set_attribute("gen_ai.response.model", resp.model)
span.set_attribute("gen_ai.response.id", resp.id)
if usage is not None:
span.set_attribute(
"gen_ai.usage.input_tokens", usage.prompt_tokens
)
span.set_attribute(
"gen_ai.usage.output_tokens", usage.completion_tokens
)
choice = resp.choices[0]
span.set_attribute(
"gen_ai.response.finish_reason", choice.finish_reason or "unknown"
)
span.set_attribute(
"llm.duration_ms", round((time.monotonic() - started) * 1000, 1)
)
# finish_reason == "length" means the model was cut off. That's a
# silent quality failure — flag it without marking the span as error.
if choice.finish_reason == "length":
span.add_event("response_truncated")
return choice.message.content or ""
Two design choices worth calling out. First, I distinguish error types with an error.type attribute rather than a single boolean. A rate limit and a timeout demand different responses — one is a backoff problem, the other might be the model genuinely struggling. Second, a truncated response is an add_event, not an error status. The call technically succeeded; the quality is degraded. Mixing those two signals makes your error rate dashboards lie.
I deliberately don’t record full prompt or completion bodies as span attributes. They’re high-cardinality, often large, and frequently contain PII. If you need prompt capture for evals, route it to a separate sink with its own retention and access controls. Span attributes for LLM payloads are covered in detail by the OpenTelemetry GenAI semantic conventions — worth reading before you invent your own attribute names.
Streaming Responses
Streaming breaks the naive instrumentation because the call returns immediately and latency hides in iteration. You want two numbers: time-to-first-token and total stream duration.
# streaming.py
import time
from openai import OpenAI
from opentelemetry.trace import SpanKind
from telemetry import init_tracing
tracer = init_tracing()
client = OpenAI(timeout=30.0)
def stream_completion(messages: list[dict]) -> str:
with tracer.start_as_current_span(
"chat.completion.stream", kind=SpanKind.CLIENT
) as span:
span.set_attribute("gen_ai.system", "openai")
span.set_attribute("gen_ai.request.model", "gpt-4o-2024-11-20")
started = time.monotonic()
first_token_at: float | None = None
chunks: list[str] = []
stream = client.chat.completions.create(
model="gpt-4o-2024-11-20",
messages=messages,
stream=True,
stream_options={"include_usage": True},
)
for event in stream:
if not event.choices:
# Final usage-only chunk arrives with empty choices.
if event.usage:
span.set_attribute(
"gen_ai.usage.output_tokens",
event.usage.completion_tokens,
)
continue
delta = event.choices[0].delta.content
if delta:
if first_token_at is None:
first_token_at = time.monotonic()
span.set_attribute(
"gen_ai.server.time_to_first_token_ms",
round((first_token_at - started) * 1000, 1),
)
chunks.append(delta)
span.set_attribute(
"llm.stream.total_ms",
round((time.monotonic() - started) * 1000, 1),
)
span.set_attribute("llm.stream.chunk_count", len(chunks))
return "".join(chunks)
stream_options={"include_usage": True} is the only reliable way to get token counts from a streamed response — without it, usage is None and you lose cost accounting. Time-to-first-token is the metric your users actually perceive; total duration matters for capacity planning.
Context Propagation Across Async Boundaries
The most common way LLM traces fragment: an async worker picks up a job and starts a fresh trace because the context didn’t travel with the message. OpenTelemetry context is thread-local and task-local — it does not cross a queue boundary on its own. You inject it on the producer side and extract it on the consumer side.
# propagation.py
from opentelemetry import trace
from opentelemetry.propagate import inject, extract
tracer = trace.get_tracer("llm.worker", "1.0.0")
def enqueue_job(payload: dict) -> dict:
"""Producer: serialize the active context into the message headers."""
carrier: dict[str, str] = {}
inject(carrier) # writes traceparent + tracestate
return {"body": payload, "trace_headers": carrier}
def process_job(message: dict) -> None:
"""Consumer: rebuild the parent context, then start a child span."""
parent_ctx = extract(message.get("trace_headers", {}))
with tracer.start_as_current_span(
"llm.job.process", context=parent_ctx
) as span:
span.set_attribute("job.size", len(message["body"]))
# ... LLM work here lands in the same trace as the producer.
If you skip extract, the worker span has no parent and shows up as an orphan trace. You’ll see two unrelated traces for one logical request and spend an hour wondering why latency doesn’t add up.
Common Pitfalls
- Putting prompt text in span names. Span names must be low-cardinality.
chat.completionaggregates;chat: summarize the following 4000-word document...creates a unique span per request and destroys every aggregation. - Using
SimpleSpanProcessorin production. It exports on the request thread. Under load the exporter’s latency becomes your p99. UseBatchSpanProcessor. - Forgetting
service.nameon the resource. Spans land underunknown_serviceand you can’t filter by service. - Treating truncation as success. A
finish_reasonoflengthis a silent failure. Emit an event so it’s queryable. - Recording exceptions without setting status.
record_exceptionadds an event but does not mark the span failed. You needset_status(Status(StatusCode.ERROR))too, or your error rate stays at zero. - High-cardinality attributes.
gen_ai.response.idis fine as an attribute for lookup but never as a metric dimension.
Troubleshooting
Symptom: no spans appear in the backend. Cause: the exporter can’t reach the collector, and BatchSpanProcessor swallows export errors silently. Fix: run with OTEL_LOG_LEVEL=debug, confirm OTEL_EXPORTER_OTLP_ENDPOINT is the gRPC port (4317, not the HTTP 4318), and verify the collector is listening.
Symptom: spans show up but durations are near zero. Cause: the span closed before the streaming response was drained — the with block exited while iteration was still pending. Fix: keep the iteration loop inside the with block, as in the streaming example.
Symptom: worker spans appear as separate root traces. Cause: context wasn’t propagated across the queue. Fix: inject on enqueue, extract on dequeue, and pass the resulting context to start_as_current_span.
Symptom: process exits and the last few traces are missing. Cause: the batch processor’s queue wasn’t flushed on shutdown. Fix: call trace.get_tracer_provider().shutdown() in your shutdown hook so it drains and exports.
Symptom: token attributes are missing on streamed calls. Cause: stream_options wasn’t set. Fix: pass stream_options={"include_usage": True} and read usage from the final empty-choices chunk.
Wrapping Up
You now have spans that capture model, tokens, latency, error type, and truncation — and they stitch together across async boundaries. The next step is standardizing those attribute names so dashboards built by one team work for another. I’ll cover that in the GenAI semantic conventions post. After that, turning these spans into per-request cost accounting is a short hop.