background-shape
AIOps in May 2025, What Actually Works in Production
May 5, 2025 · 9 min read · by Muhammad Amal programming

TL;DR — Classical statistical anomaly detection still wins for metrics, LLMs earn their keep on log triage and incident narration, and the hardest part isn’t the model, it’s the event bus and the runbook contract.

The AIOps pitch hasn’t really changed in five years. Vendors still promise self-healing systems, root cause analysis at the push of a button, and an oracle that knows why latency spiked. What’s changed in 2025 is that some of this finally works, but only when you draw the lines carefully. The 2024 wave of “let an LLM read your logs” demos collapsed under cost and hallucination pressure. The 2025 wave is more humble and more useful.

I’ve spent the last year shipping AIOps glue across two platform teams. One runs a high-traffic e-commerce stack on Kubernetes 1.32, the other a multi-region fintech with strict change controls. The patterns that worked are remarkably similar, and so are the failures. This piece is a field report. No vendor pitches, no benchmarks, just the architectures and trade-offs I’d defend in a design review.

The short version is that AIOps in 2025 is three loosely coupled layers: detection (mostly statistical), triage (mostly LLM-assisted), and remediation (mostly deterministic with LLM-supervised escalation). Treat them as one big agent and you’ll be debugging prompts at 3 AM. Treat them as a pipeline with explicit contracts and the whole thing becomes maintainable.

1. The Three Layers That Actually Ship

Every working AIOps stack I’ve seen in 2025 separates these concerns. Skip this and you get a monolithic agent that’s impossible to test.

                +-------------------+
metrics ----->  |  Detection layer  |  --> events (typed)
logs ------->   |  (stats + ML)     |
traces ----->   +-------------------+
                          |
                          v
                +-------------------+
                |   Triage layer    |  --> hypotheses + context
                |   (LLM + RAG)     |
                +-------------------+
                          |
                          v
                +-------------------+
                |  Remediation      |  --> actions (deterministic)
                |  (Argo + scripts) |
                +-------------------+

1.1 Detection stays statistical

The biggest mistake teams make is asking an LLM to look at Prometheus metrics. It’s slow, expensive, and worse than a 30-line MAD detector. For metrics in 2025, the boring stack still wins.

# prometheus rule, Prometheus 3.0
groups:
- name: latency-anomalies
  interval: 30s
  rules:
  - record: job:http_p99:5m
    expr: histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
  - alert: P99LatencyAnomaly
    expr: |
      (job:http_p99:5m - avg_over_time(job:http_p99:5m[1h] offset 1d))
        / stddev_over_time(job:http_p99:5m[1h] offset 1d) > 3
    for: 5m
    labels:
      severity: warning
      detector: zscore
    annotations:
      summary: "P99 anomaly on {{ $labels.job }}"

Z-score with a 1-day offset gets you 80% of the value. For the remaining 20% (seasonal traffic, deploy-correlated drift), add Prophet or a small isolation forest, but only after you’ve burned the cheap detector for a month and seen what it misses.

1.2 Triage is where LLMs earn money

This is where claude-3.7-sonnet and gpt-4o actually pull weight. Given a detection event plus structured context (recent deploys, related alerts, top error logs), they write the kind of summary an on-call engineer would write at minute three. Not minute one, minute three. The model isn’t faster than a human at the first minute, it’s faster at compressing context that the human would otherwise scroll through.

1.3 Remediation stays deterministic

Let the LLM propose. Let a state machine execute. We’ll come back to this in the auto-remediation section, but the rule is simple: every remediation action is a named function with a known blast radius, and the LLM picks from a menu. It does not write shell.

2. Building the Event Bus

Everything hangs off a typed event bus. Use NATS, Kafka, or Redis streams — pick what you already run. The shape matters more than the transport.

# events/schema.py
from pydantic import BaseModel, Field
from datetime import datetime
from typing import Literal

class DetectionEvent(BaseModel):
    event_id: str
    kind: Literal["metric_anomaly", "log_burst", "trace_error_spike"]
    service: str
    severity: Literal["info", "warning", "critical"]
    detected_at: datetime
    source: str  # which detector
    signal: dict  # detector-specific payload
    labels: dict[str, str] = Field(default_factory=dict)
    fingerprint: str  # for dedup

The fingerprint is critical. Without it you’ll page on the same anomaly every evaluation cycle. I use a SHA1 of (kind, service, labels.alertname, time_bucket_5min).

# events/dedup.py
import hashlib
import time

def fingerprint(kind: str, service: str, alertname: str) -> str:
    bucket = int(time.time() // 300)  # 5 min buckets
    raw = f"{kind}|{service}|{alertname}|{bucket}"
    return hashlib.sha1(raw.encode()).hexdigest()[:16]

The triage layer consumes these events, hydrates them with context, and emits a TriageReport. The remediation layer consumes reports and either acts or escalates. Each hop is a separate service. Each hop is testable.

3. Standing Up the Observability Substrate

You can’t do AIOps without three signals: metrics, logs, traces. In May 2025 the canonical stack is Prometheus 3.0 for metrics, Loki 3.3 for logs, Tempo 2.6 for traces, all glued by OpenTelemetry collector 0.120 and visualized in Grafana 11.4. There’s no good reason to deviate unless you’re already on a vendor.

Here’s a minimal collector config that fans out to all three.

# otelcol-config.yaml, OpenTelemetry collector 0.120
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 8192
    timeout: 5s
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
  resource:
    attributes:
      - key: cluster
        value: prod-eu-1
        action: upsert

exporters:
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [loki]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlp/tempo]

If you only do one thing after reading this, make sure your service name and version are stamped on every signal. The triage LLM is useless without consistent service.name and service.version labels.

4. Adding the Detection Layer

Three detectors cover most of what matters: metric anomaly, log burst, and trace error spike. I’ll show the log burst detector because it’s the one teams skip and then regret.

# detectors/log_burst.py
import asyncio
from collections import defaultdict, deque
from statistics import mean, stdev
import httpx

class LogBurstDetector:
    def __init__(self, loki_url: str, window: int = 300):
        self.loki = loki_url
        self.window = window
        self.history: dict[str, deque] = defaultdict(lambda: deque(maxlen=288))  # 24h of 5min

    async def evaluate(self, service: str) -> dict | None:
        query = f'sum(count_over_time({{service="{service}",level="error"}}[5m]))'
        async with httpx.AsyncClient(timeout=10) as c:
            r = await c.get(f"{self.loki}/loki/api/v1/query", params={"query": query})
            r.raise_for_status()
            data = r.json()["data"]["result"]
        if not data:
            return None
        current = float(data[0]["value"][1])
        hist = self.history[service]
        hist.append(current)
        if len(hist) < 30:
            return None
        baseline = mean(list(hist)[:-1])
        sd = stdev(list(hist)[:-1]) or 1.0
        z = (current - baseline) / sd
        if z > 4 and current > 10:
            return {"service": service, "z": z, "rate": current, "baseline": baseline}
        return None

Z-score of 4 plus an absolute floor of 10 errors per 5 minutes catches the real bursts and skips the ones that are just sparse logs going from 0 to 2. The absolute floor matters more than the statistical threshold.

5. Wiring the LLM Triage Service

The triage service is a thin wrapper. It receives a detection event, gathers context, and asks the model for a structured response. The structured response is non-negotiable.

# triage/service.py
from anthropic import AsyncAnthropic
from pydantic import BaseModel
import json

class TriageReport(BaseModel):
    summary: str
    likely_causes: list[str]
    recommended_actions: list[str]
    confidence: float
    needs_human: bool

SYSTEM = """You are an SRE assistant. Given a detection event and context, produce a
structured triage report. Be specific. If you don't have enough context, say so and
set needs_human=true. Never invent log lines or metrics."""

class TriageService:
    def __init__(self, api_key: str):
        self.client = AsyncAnthropic(api_key=api_key)

    async def triage(self, event: dict, context: dict) -> TriageReport:
        msg = await self.client.messages.create(
            model="claude-3-7-sonnet-20250219",
            max_tokens=1024,
            system=SYSTEM,
            messages=[{
                "role": "user",
                "content": json.dumps({"event": event, "context": context}, default=str),
            }],
        )
        raw = msg.content[0].text
        return TriageReport.model_validate_json(raw)

The context blob includes: last three deploys to this service, top 10 error log fingerprints in the last 15 minutes, related alerts firing on dependencies, and the service’s SLO burn rate. That’s it. More context degrades the report because the model starts pattern-matching on irrelevant noise.

If you want the deeper version of this pattern with state machines and retries, see my piece on building an SRE copilot for on-call engineers.

6. Closing the Loop with Remediation

The remediation layer is where teams overreach. The rule I live by: every action has a name, an owner, and a blast radius. The LLM picks a name. A human-approved policy decides if it executes.

# remediation/actions.yaml
actions:
  - name: restart_pod
    blast_radius: pod
    requires_approval: false
    runbook: https://runbooks.internal/restart-pod
  - name: scale_deployment
    blast_radius: deployment
    requires_approval: false
    max_replicas: 20
  - name: rollback_deployment
    blast_radius: deployment
    requires_approval: true
  - name: drain_node
    blast_radius: node
    requires_approval: true
  - name: failover_region
    blast_radius: region
    requires_approval: true

The triage report’s recommended_actions is just a list of these names. If the model invents a new action, the dispatcher rejects the report. This is the single most important guardrail.

# remediation/dispatcher.py
ALLOWED = {a["name"] for a in load_actions()}

async def dispatch(report: TriageReport):
    for action in report.recommended_actions:
        if action not in ALLOWED:
            await escalate(report, reason=f"unknown action: {action}")
            return
    if report.needs_human or report.confidence < 0.7:
        await escalate(report, reason="low confidence or human-required")
        return
    for action in report.recommended_actions:
        await execute(action, report)

7. Common Pitfalls

Four mistakes I’ve seen kill AIOps projects in 2025.

  1. Letting the LLM see raw Prometheus. Don’t. It can’t reason about scrape intervals, recording rules, or PromQL semantics. Feed it pre-computed summaries.
  2. Skipping fingerprinting. Without dedup you’ll page on the same anomaly every 30 seconds. The triage cost alone will eat your model budget by Tuesday.
  3. Building one big agent. A single LangGraph that does detection, triage, and remediation is fun to demo and impossible to operate. Three services, three deploys, three SLOs.
  4. Trusting confidence scores blindly. Models are confidently wrong about half the time. Use confidence as a routing signal, not a truth signal. Below 0.7 escalates. Above 0.7 still gets logged and reviewed weekly.

8. Troubleshooting

Three failure modes you’ll hit in the first month.

8.1 Triage reports are vague

Usually a context problem. Check that you’re passing service version, recent deploys, and a small set of error log examples. If the report says “investigate the database” with no specifics, your context has no database signal.

8.2 Detection storms during deploys

Every deploy looks like an anomaly. Suppress detections for 10 minutes after a Kubernetes Deployment rollout completes. The simplest fix is a Prometheus inhibit rule keyed on a deploy_in_progress metric you publish from your CI.

8.3 Remediation actions executing twice

This is always idempotency. Every action handler must check current state before acting. A restart_pod that runs twice is fine. A rollback_deployment that runs twice will undo your manual fix. Use a Redis lock keyed on (action, target, fingerprint) with a 10-minute TTL.

9. Wrapping Up

AIOps in May 2025 is finally boring in the good way. Statistical detection, structured triage, deterministic remediation. The LLM is a context compressor, not an oracle. The hardest engineering isn’t in the model layer, it’s in the event schema, the action catalog, and the dedup logic.

If you’re starting fresh, build detection first, ship it for a month, then add triage, then add remediation. Don’t invert the order. And whatever you do, don’t put a model in the critical path of paging until you’ve watched your detectors lie for a few weeks and learned their personalities.

For the next step, see the companion piece on auto-remediation pipelines with LLM agents and Argo Events, or the broader OpenTelemetry collector docs for the substrate. The plumbing is the product.