Tracking Token Usage and Cost per Request with OpenTelemetry
TL;DR — Emit token counters with low-cardinality dimensions / compute cost in a pricing table at query time, not at ingest / alert on cost burn rate so the bill never surprises you.
A finance person once forwarded me a model-provider invoice with a single highlighted line and the subject “what is this.” It was a 40% month-over-month jump. I had traces. I had latency dashboards. I had exactly zero ability to say which feature or which customer drove the increase, because I’d never instrumented token usage as a first-class signal. I spent two days correlating log lines to reconstruct what one good metric would have shown instantly.
Token usage cost telemetry is the metrics layer that makes spend a queryable, alertable quantity instead of a monthly surprise. Traces tell you about a single request. Metrics tell you about aggregate behavior over time — exactly what you need for cost. This post builds that layer with OpenTelemetry metrics, exports to Prometheus 3.x, and ends with burn-rate alerting.
This builds directly on the attribute names from the GenAI semantic conventions . If you’ve adopted those, the metric dimensions here will look familiar — that’s deliberate.
Tokens Are a Metric, Cost Is a Query
The single most important decision: do not bake cost into the metric. Emit token counts. Compute cost at query time from a pricing table.
Why? Prices change. Providers cut prices, add tiers, introduce cached-input discounts. If you multiply tokens by a price at emission time and store dollars, every price change silently corrupts your historical data and you can’t recompute. If you store raw token counts, a price change is a one-line edit to a recording rule and your whole history stays accurate.
So the pipeline is: counters for input and output tokens, dimensioned by model and a few low-cardinality labels, exported to Prometheus, with cost expressed as recording rules.
Setting Up the Meter
# metrics.py
import os
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.prometheus import PrometheusMetricReader
_METER = None
def init_metrics():
global _METER
if _METER is None:
resource = Resource.create(
{
"service.name": os.getenv("OTEL_SERVICE_NAME", "llm-gateway"),
"deployment.environment": os.getenv("ENV", "local"),
}
)
# PrometheusMetricReader exposes a /metrics scrape endpoint.
reader = PrometheusMetricReader()
provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(provider)
_METER = metrics.get_meter("llm.cost", "1.0.0")
return _METER
The PrometheusMetricReader turns the SDK into a pull-based scrape target. You expose its registry on an HTTP endpoint:
# metrics_server.py
from prometheus_client import start_http_server
from metrics import init_metrics
init_metrics()
# Serves the OTel SDK's metrics at http://0.0.0.0:9464/metrics
start_http_server(9464)
Defining the Instruments
Two counters. Counters because token usage is monotonic and additive — Prometheus’ rate() and increase() are built for exactly this.
# instruments.py
from metrics import init_metrics
meter = init_metrics()
input_tokens = meter.create_counter(
name="gen_ai.client.token.usage.input",
unit="{token}",
description="Input (prompt) tokens consumed by LLM calls",
)
output_tokens = meter.create_counter(
name="gen_ai.client.token.usage.output",
unit="{token}",
description="Output (completion) tokens produced by LLM calls",
)
request_count = meter.create_counter(
name="gen_ai.client.requests",
unit="{request}",
description="Count of LLM requests by outcome",
)
The dimension design is where cost telemetry succeeds or fails. Every label multiplies your time series count. Keep dimensions low-cardinality and bounded:
gen_ai.request.model— bounded (you use a handful of models).gen_ai.operation.name— bounded (chat,embeddings).feature— bounded if you control it. A fixed enum of feature names is fine.
Never use user_id, request_id, or raw prompt text as a label. That’s unbounded cardinality and it will take down your Prometheus.
Recording on Every Call
# record.py
from openai import OpenAI, APIError
from instruments import input_tokens, output_tokens, request_count
client = OpenAI()
def tracked_chat(messages: list[dict], *, feature: str) -> str:
model = "gpt-4o-2024-11-20"
# Bounded label set — reused for every counter to keep series aligned.
dims = {
"gen_ai.request.model": model,
"gen_ai.operation.name": "chat",
"feature": feature,
}
try:
resp = client.chat.completions.create(
model=model, messages=messages, max_tokens=512
)
except APIError as exc:
request_count.add(1, {**dims, "outcome": "error",
"error.type": type(exc).__name__})
raise
usage = resp.usage
if usage is not None:
input_tokens.add(usage.prompt_tokens, dims)
output_tokens.add(usage.completion_tokens, dims)
request_count.add(1, {**dims, "outcome": "success"})
return resp.choices[0].message.content or ""
A subtle but important point: the cached-input case. Many providers now bill cached prompt tokens at a discount. If your provider reports cached tokens separately (OpenAI exposes usage.prompt_tokens_details.cached_tokens), record them on a separate counter so the pricing rule can apply the discounted rate:
# cached.py
from instruments import meter
cached_input_tokens = meter.create_counter(
name="gen_ai.client.token.usage.cached_input",
unit="{token}",
description="Cached (discounted) input tokens",
)
# Inside tracked_chat, after the call:
def record_cached(usage, dims) -> None:
details = getattr(usage, "prompt_tokens_details", None)
cached = getattr(details, "cached_tokens", 0) if details else 0
if cached:
cached_input_tokens.add(cached, dims)
# The plain input counter already counted these; the pricing rule
# subtracts cached from total before applying the full rate.
Computing Cost in Prometheus
Now the part that turns tokens into dollars without ever storing dollars in a way you can’t recompute. Prices live in recording rules. When a price changes, you edit one number and reload.
# prometheus-rules.yaml
groups:
- name: llm_cost
interval: 30s
rules:
# --- Pricing constants, USD per 1M tokens, early-2026 rates. ---
# gpt-4o-2024-11-20: $2.50 input / $10.00 output / $1.25 cached input.
- record: llm:cost_usd:rate5m
expr: |
(
sum by (feature, gen_ai_request_model) (
rate(gen_ai_client_token_usage_input_total{
gen_ai_request_model="gpt-4o-2024-11-20"}[5m])
) * 2.50 / 1e6
)
+
(
sum by (feature, gen_ai_request_model) (
rate(gen_ai_client_token_usage_output_total{
gen_ai_request_model="gpt-4o-2024-11-20"}[5m])
) * 10.00 / 1e6
)
# Hourly spend per feature, useful for the FinOps dashboard.
- record: llm:cost_usd:increase1h
expr: |
(
sum by (feature) (
increase(gen_ai_client_token_usage_input_total[1h])
) * 2.50 / 1e6
)
+
(
sum by (feature) (
increase(gen_ai_client_token_usage_output_total[1h])
) * 10.00 / 1e6
)
Note the metric name suffix _total — Prometheus appends it to counters automatically, and the OTel gen_ai.client.token.usage.input name has its dots converted to underscores on export. Get those two transformations wrong and your queries silently match nothing.
Once llm:cost_usd:rate5m exists, your Grafana panels query a clean, fast pre-aggregated series instead of recomputing the arithmetic on every dashboard load. Prometheus recording rules are documented at prometheus.io/docs
.
Alerting on Cost Burn
A monthly budget is useless if you discover the overrun on the 1st. Alert on burn rate — projected spend versus budget.
# prometheus-alerts.yaml
groups:
- name: llm_cost_alerts
rules:
- alert: LLMCostBurnRateHigh
# Projected 24h spend exceeds the daily budget of $400.
expr: (sum(llm:cost_usd:rate5m) * 86400) > 400
for: 15m
labels:
severity: warning
annotations:
summary: "LLM spend projecting over daily budget"
description: >
Current burn rate projects ${{ $value | printf \"%.0f\" }}/day,
budget is $400. Check the feature breakdown panel.
- alert: LLMCostSpikeSudden
# 5m rate is 3x the 1h average — a sudden jump, likely a loop.
expr: |
sum(llm:cost_usd:rate5m)
> 3 * sum(rate(gen_ai_client_token_usage_output_total[1h]))
* 10.00 / 1e6
for: 5m
labels:
severity: critical
annotations:
summary: "LLM cost spiked 3x over baseline"
The for: 15m on the burn-rate alert matters — a brief traffic spike shouldn’t page anyone. The sudden-spike alert has a shorter window because a 3x jump is usually a retry loop or a runaway agent, and that should page fast.
Common Pitfalls
- Storing cost instead of tokens. A price change corrupts history irreversibly. Store tokens, derive cost.
user_idas a label. Unbounded cardinality. Each unique user creates a new time series until Prometheus falls over.- Forgetting
_total. Exported counter names get a_totalsuffix. Queries without it match nothing. - Double-counting cached tokens. If cached tokens are included in the plain input count, your pricing rule must subtract them before applying the full rate, or you over-report cost.
- Counting tokens only on success. Failed and timed-out calls can still consume input tokens (and you still pay). Decide deliberately whether to count them.
- No
featuredimension. Without it you can see total spend but never which feature moved the bill — exactly the question finance asks.
Troubleshooting
Symptom: /metrics endpoint is empty. Cause: init_metrics() ran but no instrument has recorded a value yet, or the meter provider wasn’t set before instruments were created. Fix: ensure init_metrics() runs before any create_counter call, and send one request to populate the counters.
Symptom: cost dashboards read zero. Cause: the query uses the OTel dotted name or omits _total. Fix: query gen_ai_client_token_usage_input_total (underscores, _total suffix).
Symptom: Prometheus memory climbs steadily then OOMs. Cause: a high-cardinality label such as request_id or user_id. Fix: drop the label at the source; use metric_relabel_configs to strip it on scrape as a stopgap.
Symptom: cost looks too low after a provider price cut. Cause: the recording rule still has the old price constant. Fix: update the constant in prometheus-rules.yaml and reload Prometheus — historical token data recomputes correctly because you stored tokens, not dollars.
Symptom: the burn-rate alert flaps. Cause: a for window too short relative to normal traffic variance. Fix: lengthen for to 15m or compare against a longer-window baseline.
Wrapping Up
You can now answer “what is this” in seconds — spend is a metric, broken down by feature and model, with burn-rate alerts that fire before the invoice does. Because you stored tokens and not dollars, every past price change is just a recording-rule edit away from accurate. Next up in the series, we shift from LLM cost to the other half of an AI pipeline’s bill, monitoring vector database performance under load.