Prometheus Cardinality and Cost Control
TL;DR — Prometheus cost scales with series count, not data volume. High-cardinality labels (request_id, user_id) blow up cost. Use
tsdb-statsto find offenders; drop them in scrape config or instrumentation. Shorter retention + recording rules + storage tier choice control the rest.
After error budgets, the operational cost concern. Prometheus is cheap until it isn’t. Most “Prometheus is expensive” stories trace back to cardinality.
The cardinality math
Each unique combination of labels = one time series. Each series:
- Stores samples (timestamps + values) at 16 bytes each
- Has overhead in the index (~10-50 bytes per series for label lookup)
- Costs CPU during scrape, query, compaction
For a metric with 4 labels of cardinality [10, 5, 100, 200]:
Series = 10 × 5 × 100 × 200 = 1,000,000 series
At 15s scrape interval × 30 days × 16 bytes/sample × 1M series = ~80 GB.
One bad metric can dwarf your whole monitoring footprint.
Finding the offenders
The web UI exposes a stats page:
http://prometheus:9090/tsdb-status
Shows top series by metric name, top series by label, total series count.
For more detail:
# Top metrics by series count
topk(10, count by (__name__)({__name__=~".+"}))
# Series count per label value for a specific metric
count by (path) (http_requests_total)
Or via the API:
curl -s http://prometheus:9090/api/v1/status/tsdb | jq .data
Identifies which metrics are expensive.
Common cardinality bombs
Looking at field experience, the worst offenders:
User ID / customer ID in labels. Bounded by user count = millions. Don’t.
Request ID in labels. Unique per request = infinite.
Full URL path. /users/42, /users/43 — separate series each.
Free-text labels. Error messages, user input echoes — unbounded.
Pod names from auto-scaling deployments. Pod names change on each rollout; old series persist until expiry.
Status code as label on an endpoint that returns many. If your API returns 100 distinct error codes, that’s 100× multiplier.
Fix each: either drop the label or normalize it.
Dropping at scrape
In prometheus.yml:
scrape_configs:
- job_name: 'api'
static_configs:
- targets: ['api:8080']
metric_relabel_configs:
# Drop specific metrics entirely
- source_labels: [__name__]
regex: 'http_request_size_bytes_bucket'
action: drop
# Drop high-card label from kept metrics
- source_labels: [user_id]
regex: '.+'
action: labeldrop
# Normalize path labels (collapse /users/:id back to a generic)
- source_labels: [path]
regex: '/users/\d+'
target_label: path
replacement: '/users/:id'
Prometheus applies these after scrape, before storage. Bad labels gone.
Fixing at instrumentation
Better: don’t emit the high-card metric in the first place.
For Go, where the label values come from:
// Bad
httpRequests.WithLabelValues(r.Method, r.URL.Path, fmt.Sprint(status)).Inc()
// Good — chi route pattern
httpRequests.WithLabelValues(r.Method, chi.RouteContext(r.Context()).RoutePattern(), fmt.Sprint(status)).Inc()
This eliminates the problem at the source. Catch it in code review.
Retention math
Default retention is 15 days. Each day adds storage. Shorter retention = lower cost.
For SLO-critical metrics, you might want 60+ days for trend analysis. For verbose service metrics, 7 days is enough — beyond that, recording rules summarize.
prometheus --storage.tsdb.retention.time=30d --storage.tsdb.retention.size=100GB
The size limit truncates old data even if within retention time. Disk fills more predictably.
For long-term: Thanos / Mimir / VictoriaMetrics. Same Prometheus API, separate long-term storage in object storage (S3). Configure Prometheus to ship blocks; query via the long-term layer.
Recording rules for aggregations
Recording rules compute series and store them. The math:
A query like sum by (path) (rate(http_requests_total[5m])) evaluated against 1M series is expensive every dashboard load. A recording rule computes it every 30s, storing the result as a tiny series (api:request_rate_5m{path=...}).
Trade-offs:
- Storage: recording rule series stored long-term
- CPU: computed every interval instead of per-query
- Cardinality: typically lower (aggregated)
For commonly-queried aggregations, recording rules pay off. For one-off queries, raw data is fine.
Series limits (defense in depth)
Prometheus 2.32+ supports series limits per label:
scrape_configs:
- job_name: 'api'
sample_limit: 10000
label_limit: 30
label_name_length_limit: 128
label_value_length_limit: 256
If a single scrape returns more than 10000 series, the whole scrape is rejected. Catches sudden cardinality explosions in code.
A real example
Service emitting a metric per user_id “for debugging.” Hidden in a code path used rarely.
http_user_login_attempts_total{user_id="42",result="success"}
http_user_login_attempts_total{user_id="43",result="success"}
...
10M users → 10M series → Prometheus runs out of memory and crashes.
The diagnosis: topk(10, count by (__name__)(...)) shows this metric at the top.
Fix:
// Before
loginAttempts.WithLabelValues(userID, result).Inc()
// After — drop user_id from metric; log separately
loginAttempts.WithLabelValues(result).Inc()
log.Info("login attempt", "user_id", userID, "result", result)
Series drops from 10M to 5 (one per result). Prometheus happy.
Common Pitfalls
No cardinality monitoring. Discover the problem after Prometheus OOMs.
Trying to fix high-cardinality with relabeling alone. Drops the label but Prometheus already scraped. Better: drop at source.
Recording rule for everything. Recording rules cost storage too. Use for common dashboard queries; not for every aggregation.
Long retention for verbose metrics. Old debug metrics nobody queries. Shorter retention OR drop them.
Single Prometheus past 100K active series. Memory pressure starts. Past 1M, you need scaling (sharding, Mimir).
No alert on series count. Alert if total series grows > 20% in 24h. Catches new cardinality bombs early.
Wrapping Up
Cardinality is the lever. Find offenders, fix at source, use recording rules for aggregations, shorter retention for verbose metrics. Friday: September retro.