Anomaly Detection on Prometheus Metrics, A Hands On Guide
TL;DR — Start with z-score recording rules in Prometheus 3.0, add MAD when your data has outliers, layer EWMA for slow drift, and only reach for isolation forest when you have multivariate problems and a real budget.
Most teams overcomplicate anomaly detection. They reach for Prophet or an LSTM before they’ve tried subtracting yesterday’s average. The boring statistical methods catch the vast majority of real anomalies and cost almost nothing to operate. The fancy methods catch the long tail, but only after you’ve tuned the boring ones for your data.
This guide walks through four methods in increasing order of complexity: z-score, MAD (median absolute deviation), EWMA (exponentially weighted moving average), and isolation forest. The first three live as Prometheus 3.0 recording and alerting rules. The fourth runs as a Python sidecar that reads Prometheus, scores, and writes back via remote write.
By the end, you’ll have a tiered detection stack: cheap rules that fire fast, medium-cost rules for noisier signals, and an ML sidecar for the handful of metrics where you can’t afford to miss a subtle anomaly. The trick is knowing which tier each metric belongs in, and the answer is almost always “the cheapest one that works”.
1. The Tiered Strategy
Tier 1 (z-score): golden signals, request rate, error rate
Tier 2 (MAD): latency percentiles, queue depths
Tier 3 (EWMA): slow-moving SLI burn, capacity metrics
Tier 4 (isolation): multivariate (CPU+latency+saturation)
Every metric starts in Tier 1. Move it down a tier only when Tier 1 misses something real or produces noise you can’t tune away.
2. Tier 1, Z-Score Rules in Prometheus
The simplest workable detector: how many standard deviations is the current value from the recent past? Use a 1-day offset to compare to the same time of day yesterday.
# rules/zscore.yaml, Prometheus 3.0
groups:
- name: zscore-detectors
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_requests:zscore
expr: |
(
job:http_requests:rate5m
- avg_over_time(job:http_requests:rate5m[1h] offset 1d)
)
/
stddev_over_time(job:http_requests:rate5m[1h] offset 1d)
- alert: RequestRateAnomaly
expr: abs(job:http_requests:zscore) > 3
for: 5m
labels:
severity: warning
detector: zscore
annotations:
summary: "Request rate anomaly on {{ $labels.job }} (z={{ $value }})"
The for: 5m is critical. Without it you’ll fire on every transient spike. The 1-hour window with a 1-day offset gives you “this hour yesterday” as a baseline, which handles daily seasonality for free.
Z-score’s weakness is that it assumes the underlying distribution is normal-ish. For request counts at high volume that’s fine. For p99 latency, it’s terrible. That’s where MAD comes in.
3. Tier 2, MAD for Robust Detection
MAD is the median of absolute deviations from the median. It’s resistant to outliers in a way that mean and standard deviation aren’t.
Prometheus doesn’t have a native median, so we approximate with quantile_over_time.
- name: mad-detectors
interval: 30s
rules:
- record: job:http_p99:5m
expr: |
histogram_quantile(0.99,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
- record: job:http_p99:median_1h_offset_1d
expr: quantile_over_time(0.5, job:http_p99:5m[1h] offset 1d)
- record: job:http_p99:mad_1h_offset_1d
expr: |
quantile_over_time(0.5,
abs(job:http_p99:5m - job:http_p99:median_1h_offset_1d)[1h:30s] offset 1d)
- record: job:http_p99:modified_zscore
expr: |
0.6745 *
(job:http_p99:5m - job:http_p99:median_1h_offset_1d)
/ clamp_min(job:http_p99:mad_1h_offset_1d, 0.001)
- alert: P99LatencyAnomalyRobust
expr: job:http_p99:modified_zscore > 3.5
for: 5m
labels:
severity: warning
detector: mad
The constant 0.6745 is the inverse of the 75th percentile of a standard normal distribution. It makes the modified z-score comparable to a regular z-score. The 3.5 threshold is the Iglewicz and Hoaglin recommendation.
clamp_min protects against division by zero when the metric is rock-stable.
4. Tier 3, EWMA for Slow Drift
EWMA shines when you care about gradual changes that z-score misses. Memory leaks, slow capacity drift, error rates creeping up over hours.
Prometheus 3.0 has holt_winters for this, but I find the manual EWMA easier to reason about.
- name: ewma-detectors
interval: 30s
rules:
- record: job:error_rate:fast
expr: |
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
- record: job:error_rate:slow
expr: |
avg_over_time(job:error_rate:fast[1h])
- record: job:error_rate:drift
expr: |
(job:error_rate:fast - job:error_rate:slow)
/
clamp_min(job:error_rate:slow, 0.0001)
- alert: ErrorRateSlowDrift
expr: job:error_rate:drift > 0.5
for: 30m
labels:
severity: warning
detector: ewma_drift
The for: 30m is on purpose. EWMA detects slow drift, so a slow-firing alert is fine. The 0.5 threshold means “fast rate is 50% above the 1-hour average”, which is a meaningful drift you don’t want to miss.
5. Tier 4, Isolation Forest Sidecar
Some anomalies are only visible when you combine signals. Latency up + CPU up + GC up + RPS down is a different story than any one of those alone. That’s where an isolation forest earns its keep.
The architecture is a Python sidecar that queries Prometheus, scores, and writes anomaly scores back via remote write.
# sidecar/iforest.py
import asyncio
import time
import numpy as np
import httpx
from sklearn.ensemble import IsolationForest
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
PROM = "http://prometheus:9090"
PUSHGW = "http://pushgateway:9091"
METRICS = {
"p99": 'histogram_quantile(0.99, sum by (le)(rate(http_request_duration_seconds_bucket{{job="{svc}"}}[5m])))',
"cpu": 'sum(rate(container_cpu_usage_seconds_total{{pod=~"{svc}-.*"}}[5m]))',
"rps": 'sum(rate(http_requests_total{{job="{svc}"}}[5m]))',
"err": 'sum(rate(http_requests_total{{job="{svc}",status=~"5.."}}[5m]))',
"gc": 'sum(rate(go_gc_duration_seconds_sum{{job="{svc}"}}[5m]))',
}
class Detector:
def __init__(self, service: str, train_hours: int = 24):
self.service = service
self.model = IsolationForest(n_estimators=200, contamination=0.01, random_state=42)
self.train_hours = train_hours
self.feature_names = list(METRICS.keys())
async def fetch(self, query: str, start: int, end: int) -> np.ndarray:
async with httpx.AsyncClient(timeout=30) as c:
r = await c.get(f"{PROM}/api/v1/query_range",
params={"query": query, "start": start, "end": end, "step": 30})
r.raise_for_status()
data = r.json()["data"]["result"]
if not data:
return np.zeros(((end - start) // 30) + 1)
return np.array([float(v[1]) for v in data[0]["values"]])
async def train(self):
end = int(time.time())
start = end - self.train_hours * 3600
cols = []
for k, q in METRICS.items():
arr = await self.fetch(q.format(svc=self.service), start, end)
cols.append(arr)
L = min(len(c) for c in cols)
X = np.column_stack([c[:L] for c in cols])
X = np.nan_to_num(X)
self.model.fit(X)
async def score_current(self) -> dict:
end = int(time.time())
start = end - 600
cols = []
for k, q in METRICS.items():
arr = await self.fetch(q.format(svc=self.service), start, end)
cols.append(arr[-1] if len(arr) else 0.0)
x = np.nan_to_num(np.array([cols]))
score = float(self.model.score_samples(x)[0])
is_anom = score < -0.1
return {"score": score, "anomaly": is_anom,
"features": dict(zip(self.feature_names, cols[0:]))}
def publish(self, result: dict):
reg = CollectorRegistry()
g = Gauge("anomaly_iforest_score", "Isolation forest anomaly score",
["service"], registry=reg)
g.labels(service=self.service).set(result["score"])
push_to_gateway(PUSHGW, job="iforest", registry=reg)
async def loop():
d = Detector("checkout-api")
await d.train()
last_train = time.time()
while True:
if time.time() - last_train > 12 * 3600:
await d.train()
last_train = time.time()
result = await d.score_current()
d.publish(result)
await asyncio.sleep(30)
if __name__ == "__main__":
asyncio.run(loop())
Retrain every 12 hours. Don’t retrain on every iteration — you’ll learn the anomaly into the baseline and never alert again. contamination=0.01 says “expect about 1% of training data to be anomalous”, which is realistic for production.
The score is exported to Prometheus via Pushgateway. Now you can alert on it.
- alert: MultivariateAnomaly
expr: anomaly_iforest_score < -0.15
for: 10m
labels:
severity: warning
detector: iforest
6. Grafana Panels
Make the detector outputs visible. Three panels per service: raw signal, baseline, and z-score (or anomaly score). A panel showing only “anomaly: yes/no” is useless. Show the math.
{
"title": "P99 anomaly score",
"targets": [
{"expr": "job:http_p99:modified_zscore", "legendFormat": "{{job}}"}
],
"thresholds": [
{"value": 3.5, "color": "red"},
{"value": -3.5, "color": "red"}
]
}
Set the threshold lines in Grafana 11.4 to match your alert rules. The on-call should see the same threshold on the dashboard that fired the page.
7. Common Pitfalls
Four mistakes I’ve watched teams make.
- Alerting on raw z-score without
for:. You’ll page on every blip. Always require sustained anomaly. 5 minutes is a sane default. - Training the isolation forest on incident-tainted data. If you train during an outage, the model learns that the outage is normal. Either exclude known incident windows or hold off training until you have a clean 24 hours.
- Using z-score on latency. Latency distributions are heavy-tailed. The mean and stdev are pulled around by tail events and you’ll either miss real anomalies or fire constantly. Use MAD for latency.
- Detecting on too many metrics at once. A multivariate detector with 50 features will find anomalies in noise. Pick 4 to 6 features per service, all of which are causally related.
8. Troubleshooting
Three failure modes you’ll hit.
8.1 Alert fires every Monday at 9 AM
Your 1-day offset baseline is comparing Monday morning to Sunday morning. Use a 1-week offset for traffic that has weekly seasonality, or compare to avg_over_time(...[1h] offset 7d).
8.2 Isolation forest scores are all near zero
You haven’t trained enough data, or the data is too uniform. Bump train_hours to 72 and n_estimators to 400. If still flat, your features are too similar — drop one and replace it with something orthogonal.
8.3 Recording rules eating Prometheus CPU
Each recording rule is a query. If you have 50 services and 4 rules each, that’s 200 queries every 30 seconds. Split rules across multiple Prometheus instances or stagger their interval. Don’t fight Prometheus, just give it less work.
9. Wrapping Up
Anomaly detection is a tiered problem, not a single-model problem. Z-score handles 70% of cases for free. MAD adds robustness for another 20%. EWMA catches drift. Isolation forest catches the multivariate weird stuff. Every step costs more, so don’t skip ahead.
The detectors are only as useful as the response. If you’re new to the response side, the companion piece on auto remediation pipelines with LLM agents and Argo Events shows how to turn these alerts into actions. For the foundational PromQL, the Prometheus query documentation is still the best reference.