Aiops article cover illustration on a gradient background

May 14, 2025 · 8 min read · by Muhammad Amal programming

TL;DR — Start with z-score recording rules in Prometheus 3.0, add MAD when your data has outliers, layer EWMA for slow drift, and only reach for isolation forest when you have multivariate problems and a real budget.

Most teams overcomplicate anomaly detection. They reach for Prophet or an LSTM before they’ve tried subtracting yesterday’s average. The boring statistical methods catch the vast majority of real anomalies and cost almost nothing to operate. The fancy methods catch the long tail, but only after you’ve tuned the boring ones for your data.

This guide walks through four methods in increasing order of complexity: z-score, MAD (median absolute deviation), EWMA (exponentially weighted moving average), and isolation forest. The first three live as Prometheus 3.0 recording and alerting rules. The fourth runs as a Python sidecar that reads Prometheus, scores, and writes back via remote write.

By the end, you’ll have a tiered detection stack: cheap rules that fire fast, medium-cost rules for noisier signals, and an ML sidecar for the handful of metrics where you can’t afford to miss a subtle anomaly. The trick is knowing which tier each metric belongs in, and the answer is almost always “the cheapest one that works”.

1. The Tiered Strategy

Tier 1 (z-score):       golden signals, request rate, error rate
Tier 2 (MAD):           latency percentiles, queue depths
Tier 3 (EWMA):          slow-moving SLI burn, capacity metrics
Tier 4 (isolation):     multivariate (CPU+latency+saturation)

Every metric starts in Tier 1. Move it down a tier only when Tier 1 misses something real or produces noise you can’t tune away.

2. Tier 1, Z-Score Rules in Prometheus

The simplest workable detector: how many standard deviations is the current value from the recent past? Use a 1-day offset to compare to the same time of day yesterday.

# rules/zscore.yaml, Prometheus 3.0
groups:
- name: zscore-detectors
  interval: 30s
  rules:
  - record: job:http_requests:rate5m
    expr: sum by (job) (rate(http_requests_total[5m]))

  - record: job:http_requests:zscore
    expr: |
      (
        job:http_requests:rate5m
        - avg_over_time(job:http_requests:rate5m[1h] offset 1d)
      )
      /
      stddev_over_time(job:http_requests:rate5m[1h] offset 1d)

  - alert: RequestRateAnomaly
    expr: abs(job:http_requests:zscore) > 3
    for: 5m
    labels:
      severity: warning
      detector: zscore
    annotations:
      summary: "Request rate anomaly on {{ $labels.job }} (z={{ $value }})"

The for: 5m is critical. Without it you’ll fire on every transient spike. The 1-hour window with a 1-day offset gives you “this hour yesterday” as a baseline, which handles daily seasonality for free.

Z-score’s weakness is that it assumes the underlying distribution is normal-ish. For request counts at high volume that’s fine. For p99 latency, it’s terrible. That’s where MAD comes in.

3. Tier 2, MAD for Robust Detection

MAD is the median of absolute deviations from the median. It’s resistant to outliers in a way that mean and standard deviation aren’t.

Prometheus doesn’t have a native median, so we approximate with quantile_over_time.

- name: mad-detectors
  interval: 30s
  rules:
  - record: job:http_p99:5m
    expr: |
      histogram_quantile(0.99,
        sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

  - record: job:http_p99:median_1h_offset_1d
    expr: quantile_over_time(0.5, job:http_p99:5m[1h] offset 1d)

  - record: job:http_p99:mad_1h_offset_1d
    expr: |
      quantile_over_time(0.5,
        abs(job:http_p99:5m - job:http_p99:median_1h_offset_1d)[1h:30s] offset 1d)

  - record: job:http_p99:modified_zscore
    expr: |
      0.6745 *
      (job:http_p99:5m - job:http_p99:median_1h_offset_1d)
      / clamp_min(job:http_p99:mad_1h_offset_1d, 0.001)

  - alert: P99LatencyAnomalyRobust
    expr: job:http_p99:modified_zscore > 3.5
    for: 5m
    labels:
      severity: warning
      detector: mad

The constant 0.6745 is the inverse of the 75th percentile of a standard normal distribution. It makes the modified z-score comparable to a regular z-score. The 3.5 threshold is the Iglewicz and Hoaglin recommendation.

clamp_min protects against division by zero when the metric is rock-stable.

4. Tier 3, EWMA for Slow Drift

EWMA shines when you care about gradual changes that z-score misses. Memory leaks, slow capacity drift, error rates creeping up over hours.

Prometheus 3.0 has holt_winters for this, but I find the manual EWMA easier to reason about.

- name: ewma-detectors
  interval: 30s
  rules:
  - record: job:error_rate:fast
    expr: |
      sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
      /
      sum by (job) (rate(http_requests_total[5m]))

  - record: job:error_rate:slow
    expr: |
      avg_over_time(job:error_rate:fast[1h])

  - record: job:error_rate:drift
    expr: |
      (job:error_rate:fast - job:error_rate:slow)
      /
      clamp_min(job:error_rate:slow, 0.0001)

  - alert: ErrorRateSlowDrift
    expr: job:error_rate:drift > 0.5
    for: 30m
    labels:
      severity: warning
      detector: ewma_drift

The for: 30m is on purpose. EWMA detects slow drift, so a slow-firing alert is fine. The 0.5 threshold means “fast rate is 50% above the 1-hour average”, which is a meaningful drift you don’t want to miss.

5. Tier 4, Isolation Forest Sidecar

Some anomalies are only visible when you combine signals. Latency up + CPU up + GC up + RPS down is a different story than any one of those alone. That’s where an isolation forest earns its keep.

The architecture is a Python sidecar that queries Prometheus, scores, and writes anomaly scores back via remote write.

# sidecar/iforest.py
import asyncio
import time
import numpy as np
import httpx
from sklearn.ensemble import IsolationForest
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

PROM = "http://prometheus:9090"
PUSHGW = "http://pushgateway:9091"

METRICS = {
    "p99": 'histogram_quantile(0.99, sum by (le)(rate(http_request_duration_seconds_bucket{{job="{svc}"}}[5m])))',
    "cpu": 'sum(rate(container_cpu_usage_seconds_total{{pod=~"{svc}-.*"}}[5m]))',
    "rps": 'sum(rate(http_requests_total{{job="{svc}"}}[5m]))',
    "err": 'sum(rate(http_requests_total{{job="{svc}",status=~"5.."}}[5m]))',
    "gc":  'sum(rate(go_gc_duration_seconds_sum{{job="{svc}"}}[5m]))',
}

class Detector:
    def __init__(self, service: str, train_hours: int = 24):
        self.service = service
        self.model = IsolationForest(n_estimators=200, contamination=0.01, random_state=42)
        self.train_hours = train_hours
        self.feature_names = list(METRICS.keys())

    async def fetch(self, query: str, start: int, end: int) -> np.ndarray:
        async with httpx.AsyncClient(timeout=30) as c:
            r = await c.get(f"{PROM}/api/v1/query_range",
                params={"query": query, "start": start, "end": end, "step": 30})
            r.raise_for_status()
            data = r.json()["data"]["result"]
        if not data:
            return np.zeros(((end - start) // 30) + 1)
        return np.array([float(v[1]) for v in data[0]["values"]])

    async def train(self):
        end = int(time.time())
        start = end - self.train_hours * 3600
        cols = []
        for k, q in METRICS.items():
            arr = await self.fetch(q.format(svc=self.service), start, end)
            cols.append(arr)
        L = min(len(c) for c in cols)
        X = np.column_stack([c[:L] for c in cols])
        X = np.nan_to_num(X)
        self.model.fit(X)

    async def score_current(self) -> dict:
        end = int(time.time())
        start = end - 600
        cols = []
        for k, q in METRICS.items():
            arr = await self.fetch(q.format(svc=self.service), start, end)
            cols.append(arr[-1] if len(arr) else 0.0)
        x = np.nan_to_num(np.array([cols]))
        score = float(self.model.score_samples(x)[0])
        is_anom = score < -0.1
        return {"score": score, "anomaly": is_anom,
                "features": dict(zip(self.feature_names, cols[0:]))}

    def publish(self, result: dict):
        reg = CollectorRegistry()
        g = Gauge("anomaly_iforest_score", "Isolation forest anomaly score",
                  ["service"], registry=reg)
        g.labels(service=self.service).set(result["score"])
        push_to_gateway(PUSHGW, job="iforest", registry=reg)

async def loop():
    d = Detector("checkout-api")
    await d.train()
    last_train = time.time()
    while True:
        if time.time() - last_train > 12 * 3600:
            await d.train()
            last_train = time.time()
        result = await d.score_current()
        d.publish(result)
        await asyncio.sleep(30)

if __name__ == "__main__":
    asyncio.run(loop())

Retrain every 12 hours. Don’t retrain on every iteration — you’ll learn the anomaly into the baseline and never alert again. contamination=0.01 says “expect about 1% of training data to be anomalous”, which is realistic for production.

The score is exported to Prometheus via Pushgateway. Now you can alert on it.

- alert: MultivariateAnomaly
  expr: anomaly_iforest_score < -0.15
  for: 10m
  labels:
    severity: warning
    detector: iforest

6. Grafana Panels

Make the detector outputs visible. Three panels per service: raw signal, baseline, and z-score (or anomaly score). A panel showing only “anomaly: yes/no” is useless. Show the math.

{
  "title": "P99 anomaly score",
  "targets": [
    {"expr": "job:http_p99:modified_zscore", "legendFormat": "{{job}}"}
  ],
  "thresholds": [
    {"value": 3.5, "color": "red"},
    {"value": -3.5, "color": "red"}
  ]
}

Set the threshold lines in Grafana 11.4 to match your alert rules. The on-call should see the same threshold on the dashboard that fired the page.

7. Common Pitfalls

Four mistakes I’ve watched teams make.

Alerting on raw z-score without for:. You’ll page on every blip. Always require sustained anomaly. 5 minutes is a sane default.
Training the isolation forest on incident-tainted data. If you train during an outage, the model learns that the outage is normal. Either exclude known incident windows or hold off training until you have a clean 24 hours.
Using z-score on latency. Latency distributions are heavy-tailed. The mean and stdev are pulled around by tail events and you’ll either miss real anomalies or fire constantly. Use MAD for latency.
Detecting on too many metrics at once. A multivariate detector with 50 features will find anomalies in noise. Pick 4 to 6 features per service, all of which are causally related.

8. Troubleshooting

Three failure modes you’ll hit.

8.1 Alert fires every Monday at 9 AM

Your 1-day offset baseline is comparing Monday morning to Sunday morning. Use a 1-week offset for traffic that has weekly seasonality, or compare to avg_over_time(...[1h] offset 7d).

8.2 Isolation forest scores are all near zero

You haven’t trained enough data, or the data is too uniform. Bump train_hours to 72 and n_estimators to 400. If still flat, your features are too similar — drop one and replace it with something orthogonal.

8.3 Recording rules eating Prometheus CPU

Each recording rule is a query. If you have 50 services and 4 rules each, that’s 200 queries every 30 seconds. Split rules across multiple Prometheus instances or stagger their interval. Don’t fight Prometheus, just give it less work.

9. Wrapping Up

Anomaly detection is a tiered problem, not a single-model problem. Z-score handles 70% of cases for free. MAD adds robustness for another 20%. EWMA catches drift. Isolation forest catches the multivariate weird stuff. Every step costs more, so don’t skip ahead.

The detectors are only as useful as the response. If you’re new to the response side, the companion piece on auto remediation pipelines with LLM agents and Argo Events shows how to turn these alerts into actions. For the foundational PromQL, the Prometheus query documentation is still the best reference.

1. The Tiered Strategy

2. Tier 1, Z-Score Rules in Prometheus

3. Tier 2, MAD for Robust Detection

4. Tier 3, EWMA for Slow Drift

5. Tier 4, Isolation Forest Sidecar

6. Grafana Panels

7. Common Pitfalls

8. Troubleshooting

8.1 Alert fires every Monday at 9 AM

8.2 Isolation forest scores are all near zero

8.3 Recording rules eating Prometheus CPU

9. Wrapping Up

Related posts

SLOs and Burn Rate Alerting in 2025, A Practical Guide

Postmortem Automation with LLMs, Drafts That Don't Lie

Chaos Engineering with AI Augmented Hypotheses

Incident Response Automation with LangGraph, A Step by Step Tutorial

Building an SRE Copilot for On Call Engineers

AI Driven Log Analysis at Scale, A Production Tutorial

Auto Remediation Pipelines with LLM Agents and Argo Events

AIOps in May 2025, What Actually Works in Production

Let’s Start a Project