background-shape
Opentelemetry article cover illustration on a gradient background
March 10, 2026 · 8 min read · by Muhammad Amal programming
Advertisement

TL;DR — Emit token counters with low-cardinality dimensions / compute cost in a pricing table at query time, not at ingest / alert on cost burn rate so the bill never surprises you.

A finance person once forwarded me a model-provider invoice with a single highlighted line and the subject “what is this.” It was a 40% month-over-month jump. I had traces. I had latency dashboards. I had exactly zero ability to say which feature or which customer drove the increase, because I’d never instrumented token usage as a first-class signal. I spent two days correlating log lines to reconstruct what one good metric would have shown instantly.

Token usage cost telemetry is the metrics layer that makes spend a queryable, alertable quantity instead of a monthly surprise. Traces tell you about a single request. Metrics tell you about aggregate behavior over time — exactly what you need for cost. This post builds that layer with OpenTelemetry metrics, exports to Prometheus 3.x, and ends with burn-rate alerting.

Advertisement

This builds directly on the attribute names from the GenAI semantic conventions . If you’ve adopted those, the metric dimensions here will look familiar — that’s deliberate.

Tokens Are a Metric, Cost Is a Query

The single most important decision: do not bake cost into the metric. Emit token counts. Compute cost at query time from a pricing table.

Why? Prices change. Providers cut prices, add tiers, introduce cached-input discounts. If you multiply tokens by a price at emission time and store dollars, every price change silently corrupts your historical data and you can’t recompute. If you store raw token counts, a price change is a one-line edit to a recording rule and your whole history stays accurate.

So the pipeline is: counters for input and output tokens, dimensioned by model and a few low-cardinality labels, exported to Prometheus, with cost expressed as recording rules.

Setting Up the Meter

# metrics.py
import os
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.prometheus import PrometheusMetricReader

_METER = None


def init_metrics():
    global _METER
    if _METER is None:
        resource = Resource.create(
            {
                "service.name": os.getenv("OTEL_SERVICE_NAME", "llm-gateway"),
                "deployment.environment": os.getenv("ENV", "local"),
            }
        )
        # PrometheusMetricReader exposes a /metrics scrape endpoint.
        reader = PrometheusMetricReader()
        provider = MeterProvider(resource=resource, metric_readers=[reader])
        metrics.set_meter_provider(provider)
        _METER = metrics.get_meter("llm.cost", "1.0.0")
    return _METER

The PrometheusMetricReader turns the SDK into a pull-based scrape target. You expose its registry on an HTTP endpoint:

# metrics_server.py
from prometheus_client import start_http_server
from metrics import init_metrics

init_metrics()
# Serves the OTel SDK's metrics at http://0.0.0.0:9464/metrics
start_http_server(9464)

Defining the Instruments

Two counters. Counters because token usage is monotonic and additive — Prometheus’ rate() and increase() are built for exactly this.

# instruments.py
from metrics import init_metrics

meter = init_metrics()

input_tokens = meter.create_counter(
    name="gen_ai.client.token.usage.input",
    unit="{token}",
    description="Input (prompt) tokens consumed by LLM calls",
)
output_tokens = meter.create_counter(
    name="gen_ai.client.token.usage.output",
    unit="{token}",
    description="Output (completion) tokens produced by LLM calls",
)
request_count = meter.create_counter(
    name="gen_ai.client.requests",
    unit="{request}",
    description="Count of LLM requests by outcome",
)

The dimension design is where cost telemetry succeeds or fails. Every label multiplies your time series count. Keep dimensions low-cardinality and bounded:

  • gen_ai.request.model — bounded (you use a handful of models).
  • gen_ai.operation.name — bounded (chat, embeddings).
  • feature — bounded if you control it. A fixed enum of feature names is fine.

Never use user_id, request_id, or raw prompt text as a label. That’s unbounded cardinality and it will take down your Prometheus.

Recording on Every Call

# record.py
from openai import OpenAI, APIError
from instruments import input_tokens, output_tokens, request_count

client = OpenAI()


def tracked_chat(messages: list[dict], *, feature: str) -> str:
    model = "gpt-4o-2024-11-20"
    # Bounded label set — reused for every counter to keep series aligned.
    dims = {
        "gen_ai.request.model": model,
        "gen_ai.operation.name": "chat",
        "feature": feature,
    }
    try:
        resp = client.chat.completions.create(
            model=model, messages=messages, max_tokens=512
        )
    except APIError as exc:
        request_count.add(1, {**dims, "outcome": "error",
                               "error.type": type(exc).__name__})
        raise

    usage = resp.usage
    if usage is not None:
        input_tokens.add(usage.prompt_tokens, dims)
        output_tokens.add(usage.completion_tokens, dims)
    request_count.add(1, {**dims, "outcome": "success"})
    return resp.choices[0].message.content or ""

A subtle but important point: the cached-input case. Many providers now bill cached prompt tokens at a discount. If your provider reports cached tokens separately (OpenAI exposes usage.prompt_tokens_details.cached_tokens), record them on a separate counter so the pricing rule can apply the discounted rate:

# cached.py
from instruments import meter

cached_input_tokens = meter.create_counter(
    name="gen_ai.client.token.usage.cached_input",
    unit="{token}",
    description="Cached (discounted) input tokens",
)

# Inside tracked_chat, after the call:
def record_cached(usage, dims) -> None:
    details = getattr(usage, "prompt_tokens_details", None)
    cached = getattr(details, "cached_tokens", 0) if details else 0
    if cached:
        cached_input_tokens.add(cached, dims)
        # The plain input counter already counted these; the pricing rule
        # subtracts cached from total before applying the full rate.

Computing Cost in Prometheus

Now the part that turns tokens into dollars without ever storing dollars in a way you can’t recompute. Prices live in recording rules. When a price changes, you edit one number and reload.

# prometheus-rules.yaml
groups:
  - name: llm_cost
    interval: 30s
    rules:
      # --- Pricing constants, USD per 1M tokens, early-2026 rates. ---
      # gpt-4o-2024-11-20: $2.50 input / $10.00 output / $1.25 cached input.

      - record: llm:cost_usd:rate5m
        expr: |
          (
            sum by (feature, gen_ai_request_model) (
              rate(gen_ai_client_token_usage_input_total{
                gen_ai_request_model="gpt-4o-2024-11-20"}[5m])
            ) * 2.50 / 1e6
          )
          +
          (
            sum by (feature, gen_ai_request_model) (
              rate(gen_ai_client_token_usage_output_total{
                gen_ai_request_model="gpt-4o-2024-11-20"}[5m])
            ) * 10.00 / 1e6
          )

      # Hourly spend per feature, useful for the FinOps dashboard.
      - record: llm:cost_usd:increase1h
        expr: |
          (
            sum by (feature) (
              increase(gen_ai_client_token_usage_input_total[1h])
            ) * 2.50 / 1e6
          )
          +
          (
            sum by (feature) (
              increase(gen_ai_client_token_usage_output_total[1h])
            ) * 10.00 / 1e6
          )

Note the metric name suffix _total — Prometheus appends it to counters automatically, and the OTel gen_ai.client.token.usage.input name has its dots converted to underscores on export. Get those two transformations wrong and your queries silently match nothing.

Once llm:cost_usd:rate5m exists, your Grafana panels query a clean, fast pre-aggregated series instead of recomputing the arithmetic on every dashboard load. Prometheus recording rules are documented at prometheus.io/docs .

Alerting on Cost Burn

A monthly budget is useless if you discover the overrun on the 1st. Alert on burn rate — projected spend versus budget.

# prometheus-alerts.yaml
groups:
  - name: llm_cost_alerts
    rules:
      - alert: LLMCostBurnRateHigh
        # Projected 24h spend exceeds the daily budget of $400.
        expr: (sum(llm:cost_usd:rate5m) * 86400) > 400
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "LLM spend projecting over daily budget"
          description: >
            Current burn rate projects ${{ $value | printf \"%.0f\" }}/day,
            budget is $400. Check the feature breakdown panel.

      - alert: LLMCostSpikeSudden
        # 5m rate is 3x the 1h average — a sudden jump, likely a loop.
        expr: |
          sum(llm:cost_usd:rate5m)
            > 3 * sum(rate(gen_ai_client_token_usage_output_total[1h]))
                  * 10.00 / 1e6
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "LLM cost spiked 3x over baseline"

The for: 15m on the burn-rate alert matters — a brief traffic spike shouldn’t page anyone. The sudden-spike alert has a shorter window because a 3x jump is usually a retry loop or a runaway agent, and that should page fast.

Common Pitfalls

  • Storing cost instead of tokens. A price change corrupts history irreversibly. Store tokens, derive cost.
  • user_id as a label. Unbounded cardinality. Each unique user creates a new time series until Prometheus falls over.
  • Forgetting _total. Exported counter names get a _total suffix. Queries without it match nothing.
  • Double-counting cached tokens. If cached tokens are included in the plain input count, your pricing rule must subtract them before applying the full rate, or you over-report cost.
  • Counting tokens only on success. Failed and timed-out calls can still consume input tokens (and you still pay). Decide deliberately whether to count them.
  • No feature dimension. Without it you can see total spend but never which feature moved the bill — exactly the question finance asks.

Troubleshooting

Symptom: /metrics endpoint is empty. Cause: init_metrics() ran but no instrument has recorded a value yet, or the meter provider wasn’t set before instruments were created. Fix: ensure init_metrics() runs before any create_counter call, and send one request to populate the counters.

Symptom: cost dashboards read zero. Cause: the query uses the OTel dotted name or omits _total. Fix: query gen_ai_client_token_usage_input_total (underscores, _total suffix).

Symptom: Prometheus memory climbs steadily then OOMs. Cause: a high-cardinality label such as request_id or user_id. Fix: drop the label at the source; use metric_relabel_configs to strip it on scrape as a stopgap.

Symptom: cost looks too low after a provider price cut. Cause: the recording rule still has the old price constant. Fix: update the constant in prometheus-rules.yaml and reload Prometheus — historical token data recomputes correctly because you stored tokens, not dollars.

Symptom: the burn-rate alert flaps. Cause: a for window too short relative to normal traffic variance. Fix: lengthen for to 15m or compare against a longer-window baseline.

Wrapping Up

You can now answer “what is this” in seconds — spend is a metric, broken down by feature and model, with burn-rate alerts that fire before the invoice does. Because you stored tokens and not dollars, every past price change is just a recording-rule edit away from accurate. Next up in the series, we shift from LLM cost to the other half of an AI pipeline’s bill, monitoring vector database performance under load.

Advertisement