AIOps in May 2025, What Actually Works in Production
TL;DR — Classical statistical anomaly detection still wins for metrics, LLMs earn their keep on log triage and incident narration, and the hardest part isn’t the model, it’s the event bus and the runbook contract.
The AIOps pitch hasn’t really changed in five years. Vendors still promise self-healing systems, root cause analysis at the push of a button, and an oracle that knows why latency spiked. What’s changed in 2025 is that some of this finally works, but only when you draw the lines carefully. The 2024 wave of “let an LLM read your logs” demos collapsed under cost and hallucination pressure. The 2025 wave is more humble and more useful.
I’ve spent the last year shipping AIOps glue across two platform teams. One runs a high-traffic e-commerce stack on Kubernetes 1.32, the other a multi-region fintech with strict change controls. The patterns that worked are remarkably similar, and so are the failures. This piece is a field report. No vendor pitches, no benchmarks, just the architectures and trade-offs I’d defend in a design review.
The short version is that AIOps in 2025 is three loosely coupled layers: detection (mostly statistical), triage (mostly LLM-assisted), and remediation (mostly deterministic with LLM-supervised escalation). Treat them as one big agent and you’ll be debugging prompts at 3 AM. Treat them as a pipeline with explicit contracts and the whole thing becomes maintainable.
1. The Three Layers That Actually Ship
Every working AIOps stack I’ve seen in 2025 separates these concerns. Skip this and you get a monolithic agent that’s impossible to test.
+-------------------+
metrics -----> | Detection layer | --> events (typed)
logs -------> | (stats + ML) |
traces -----> +-------------------+
|
v
+-------------------+
| Triage layer | --> hypotheses + context
| (LLM + RAG) |
+-------------------+
|
v
+-------------------+
| Remediation | --> actions (deterministic)
| (Argo + scripts) |
+-------------------+
1.1 Detection stays statistical
The biggest mistake teams make is asking an LLM to look at Prometheus metrics. It’s slow, expensive, and worse than a 30-line MAD detector. For metrics in 2025, the boring stack still wins.
# prometheus rule, Prometheus 3.0
groups:
- name: latency-anomalies
interval: 30s
rules:
- record: job:http_p99:5m
expr: histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
- alert: P99LatencyAnomaly
expr: |
(job:http_p99:5m - avg_over_time(job:http_p99:5m[1h] offset 1d))
/ stddev_over_time(job:http_p99:5m[1h] offset 1d) > 3
for: 5m
labels:
severity: warning
detector: zscore
annotations:
summary: "P99 anomaly on {{ $labels.job }}"
Z-score with a 1-day offset gets you 80% of the value. For the remaining 20% (seasonal traffic, deploy-correlated drift), add Prophet or a small isolation forest, but only after you’ve burned the cheap detector for a month and seen what it misses.
1.2 Triage is where LLMs earn money
This is where claude-3.7-sonnet and gpt-4o actually pull weight. Given a detection event plus structured context (recent deploys, related alerts, top error logs), they write the kind of summary an on-call engineer would write at minute three. Not minute one, minute three. The model isn’t faster than a human at the first minute, it’s faster at compressing context that the human would otherwise scroll through.
1.3 Remediation stays deterministic
Let the LLM propose. Let a state machine execute. We’ll come back to this in the auto-remediation section, but the rule is simple: every remediation action is a named function with a known blast radius, and the LLM picks from a menu. It does not write shell.
2. Building the Event Bus
Everything hangs off a typed event bus. Use NATS, Kafka, or Redis streams — pick what you already run. The shape matters more than the transport.
# events/schema.py
from pydantic import BaseModel, Field
from datetime import datetime
from typing import Literal
class DetectionEvent(BaseModel):
event_id: str
kind: Literal["metric_anomaly", "log_burst", "trace_error_spike"]
service: str
severity: Literal["info", "warning", "critical"]
detected_at: datetime
source: str # which detector
signal: dict # detector-specific payload
labels: dict[str, str] = Field(default_factory=dict)
fingerprint: str # for dedup
The fingerprint is critical. Without it you’ll page on the same anomaly every evaluation cycle. I use a SHA1 of (kind, service, labels.alertname, time_bucket_5min).
# events/dedup.py
import hashlib
import time
def fingerprint(kind: str, service: str, alertname: str) -> str:
bucket = int(time.time() // 300) # 5 min buckets
raw = f"{kind}|{service}|{alertname}|{bucket}"
return hashlib.sha1(raw.encode()).hexdigest()[:16]
The triage layer consumes these events, hydrates them with context, and emits a TriageReport. The remediation layer consumes reports and either acts or escalates. Each hop is a separate service. Each hop is testable.
3. Standing Up the Observability Substrate
You can’t do AIOps without three signals: metrics, logs, traces. In May 2025 the canonical stack is Prometheus 3.0 for metrics, Loki 3.3 for logs, Tempo 2.6 for traces, all glued by OpenTelemetry collector 0.120 and visualized in Grafana 11.4. There’s no good reason to deviate unless you’re already on a vendor.
Here’s a minimal collector config that fans out to all three.
# otelcol-config.yaml, OpenTelemetry collector 0.120
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
send_batch_size: 8192
timeout: 5s
memory_limiter:
check_interval: 1s
limit_percentage: 75
resource:
attributes:
- key: cluster
value: prod-eu-1
action: upsert
exporters:
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
loki:
endpoint: http://loki:3100/loki/api/v1/push
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
metrics:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [loki]
traces:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [otlp/tempo]
If you only do one thing after reading this, make sure your service name and version are stamped on every signal. The triage LLM is useless without consistent service.name and service.version labels.
4. Adding the Detection Layer
Three detectors cover most of what matters: metric anomaly, log burst, and trace error spike. I’ll show the log burst detector because it’s the one teams skip and then regret.
# detectors/log_burst.py
import asyncio
from collections import defaultdict, deque
from statistics import mean, stdev
import httpx
class LogBurstDetector:
def __init__(self, loki_url: str, window: int = 300):
self.loki = loki_url
self.window = window
self.history: dict[str, deque] = defaultdict(lambda: deque(maxlen=288)) # 24h of 5min
async def evaluate(self, service: str) -> dict | None:
query = f'sum(count_over_time({{service="{service}",level="error"}}[5m]))'
async with httpx.AsyncClient(timeout=10) as c:
r = await c.get(f"{self.loki}/loki/api/v1/query", params={"query": query})
r.raise_for_status()
data = r.json()["data"]["result"]
if not data:
return None
current = float(data[0]["value"][1])
hist = self.history[service]
hist.append(current)
if len(hist) < 30:
return None
baseline = mean(list(hist)[:-1])
sd = stdev(list(hist)[:-1]) or 1.0
z = (current - baseline) / sd
if z > 4 and current > 10:
return {"service": service, "z": z, "rate": current, "baseline": baseline}
return None
Z-score of 4 plus an absolute floor of 10 errors per 5 minutes catches the real bursts and skips the ones that are just sparse logs going from 0 to 2. The absolute floor matters more than the statistical threshold.
5. Wiring the LLM Triage Service
The triage service is a thin wrapper. It receives a detection event, gathers context, and asks the model for a structured response. The structured response is non-negotiable.
# triage/service.py
from anthropic import AsyncAnthropic
from pydantic import BaseModel
import json
class TriageReport(BaseModel):
summary: str
likely_causes: list[str]
recommended_actions: list[str]
confidence: float
needs_human: bool
SYSTEM = """You are an SRE assistant. Given a detection event and context, produce a
structured triage report. Be specific. If you don't have enough context, say so and
set needs_human=true. Never invent log lines or metrics."""
class TriageService:
def __init__(self, api_key: str):
self.client = AsyncAnthropic(api_key=api_key)
async def triage(self, event: dict, context: dict) -> TriageReport:
msg = await self.client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1024,
system=SYSTEM,
messages=[{
"role": "user",
"content": json.dumps({"event": event, "context": context}, default=str),
}],
)
raw = msg.content[0].text
return TriageReport.model_validate_json(raw)
The context blob includes: last three deploys to this service, top 10 error log fingerprints in the last 15 minutes, related alerts firing on dependencies, and the service’s SLO burn rate. That’s it. More context degrades the report because the model starts pattern-matching on irrelevant noise.
If you want the deeper version of this pattern with state machines and retries, see my piece on building an SRE copilot for on-call engineers.
6. Closing the Loop with Remediation
The remediation layer is where teams overreach. The rule I live by: every action has a name, an owner, and a blast radius. The LLM picks a name. A human-approved policy decides if it executes.
# remediation/actions.yaml
actions:
- name: restart_pod
blast_radius: pod
requires_approval: false
runbook: https://runbooks.internal/restart-pod
- name: scale_deployment
blast_radius: deployment
requires_approval: false
max_replicas: 20
- name: rollback_deployment
blast_radius: deployment
requires_approval: true
- name: drain_node
blast_radius: node
requires_approval: true
- name: failover_region
blast_radius: region
requires_approval: true
The triage report’s recommended_actions is just a list of these names. If the model invents a new action, the dispatcher rejects the report. This is the single most important guardrail.
# remediation/dispatcher.py
ALLOWED = {a["name"] for a in load_actions()}
async def dispatch(report: TriageReport):
for action in report.recommended_actions:
if action not in ALLOWED:
await escalate(report, reason=f"unknown action: {action}")
return
if report.needs_human or report.confidence < 0.7:
await escalate(report, reason="low confidence or human-required")
return
for action in report.recommended_actions:
await execute(action, report)
7. Common Pitfalls
Four mistakes I’ve seen kill AIOps projects in 2025.
- Letting the LLM see raw Prometheus. Don’t. It can’t reason about scrape intervals, recording rules, or PromQL semantics. Feed it pre-computed summaries.
- Skipping fingerprinting. Without dedup you’ll page on the same anomaly every 30 seconds. The triage cost alone will eat your model budget by Tuesday.
- Building one big agent. A single LangGraph that does detection, triage, and remediation is fun to demo and impossible to operate. Three services, three deploys, three SLOs.
- Trusting confidence scores blindly. Models are confidently wrong about half the time. Use confidence as a routing signal, not a truth signal. Below 0.7 escalates. Above 0.7 still gets logged and reviewed weekly.
8. Troubleshooting
Three failure modes you’ll hit in the first month.
8.1 Triage reports are vague
Usually a context problem. Check that you’re passing service version, recent deploys, and a small set of error log examples. If the report says “investigate the database” with no specifics, your context has no database signal.
8.2 Detection storms during deploys
Every deploy looks like an anomaly. Suppress detections for 10 minutes after a Kubernetes Deployment rollout completes. The simplest fix is a Prometheus inhibit rule keyed on a deploy_in_progress metric you publish from your CI.
8.3 Remediation actions executing twice
This is always idempotency. Every action handler must check current state before acting. A restart_pod that runs twice is fine. A rollback_deployment that runs twice will undo your manual fix. Use a Redis lock keyed on (action, target, fingerprint) with a 10-minute TTL.
9. Wrapping Up
AIOps in May 2025 is finally boring in the good way. Statistical detection, structured triage, deterministic remediation. The LLM is a context compressor, not an oracle. The hardest engineering isn’t in the model layer, it’s in the event schema, the action catalog, and the dedup logic.
If you’re starting fresh, build detection first, ship it for a month, then add triage, then add remediation. Don’t invert the order. And whatever you do, don’t put a model in the critical path of paging until you’ve watched your detectors lie for a few weeks and learned their personalities.
For the next step, see the companion piece on auto-remediation pipelines with LLM agents and Argo Events, or the broader OpenTelemetry collector docs for the substrate. The plumbing is the product.