background-shape
AI Driven Log Analysis at Scale, A Production Tutorial
May 9, 2025 · 8 min read · by Muhammad Amal programming

TL;DR — Mine templates with Drain3, retrieve nearest neighbors by template hash, then let the LLM summarize a bounded sample. Never feed raw logs to a model in the critical path, and never page on summaries.

Feeding raw logs to a language model is a great demo and a terrible production strategy. At any meaningful scale you’ll burn through your budget by Tuesday and your model will hallucinate field names by Wednesday. The pattern that actually works in 2025 is a two-stage pipeline: a cheap deterministic clusterer turns millions of log lines into a few hundred templates, and a model summarizes a small bounded sample on demand.

This tutorial walks through that pipeline end-to-end. The substrate is Loki 3.3 with OpenTelemetry collector 0.120 shipping logs from Kubernetes 1.32. The template miner is Drain3, the same algorithm IBM has been using for log parsing for years, dressed up with a state store. The model is claude-3.7-sonnet, called only when a human or an alert needs a narrative.

The result is a system that processes 50k log lines per second per node, holds about 2000 active templates per service, and produces sub-second summaries when an SRE asks “what’s going on with the checkout service” via a Slack slash command. Model cost runs about ten dollars per day at that volume, not ten thousand.

1. The Pipeline Shape

Three stages, each with a clear contract.

+------------+      +------------+      +------------+      +-----------+
|  Loki 3.3  | -->  | Drain3     | -->  | Template   | -->  | LLM       |
|  raw logs  |      | miner      |      | index +    |      | summarize |
|            |      |            |      | vectors    |      | (bounded) |
+------------+      +------------+      +------------+      +-----------+
   millions/min        ~100/min            ~2k total          on demand

The numbers matter. The miner converts millions of log lines per minute into about a hundred new template hits per minute. The template index holds maybe two thousand templates per service. The model only sees what the SRE asks about, and only ever a bounded sample.

2. Standing Up the Loki Side

Assuming you already have Loki 3.3 running, the OpenTelemetry collector pushes logs in with a clean label set. Keep labels low-cardinality.

# otelcol-logs.yaml, OpenTelemetry collector 0.120
receivers:
  filelog:
    include: [/var/log/pods/*/*/*.log]
    operators:
      - type: container

processors:
  batch:
    timeout: 2s
    send_batch_size: 4096
  attributes:
    actions:
      - key: log.original
        action: delete  # keep payload size down

exporters:
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    default_labels_enabled:
      exporter: false
      job: true

service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [attributes, batch]
      exporters: [loki]

Labels in Loki are expensive. Service, namespace, severity. Nothing else. Everything else goes in the log body and gets parsed downstream.

3. Mining Templates with Drain3

Drain3 is the workhorse. It clusters log lines by structural similarity using a fixed-depth tree. We run one Drain3 instance per service in a sidecar that tails Loki via the streaming API.

# miner/drain_worker.py
import json
import asyncio
import httpx
from drain3 import TemplateMiner
from drain3.template_miner_config import TemplateMinerConfig
from drain3.redis_persistence import RedisPersistence

def make_miner(service: str) -> TemplateMiner:
    cfg = TemplateMinerConfig()
    cfg.profiling_enabled = False
    cfg.drain_sim_th = 0.4
    cfg.drain_depth = 6
    cfg.drain_max_children = 100
    cfg.drain_max_clusters = 2000
    persistence = RedisPersistence(
        redis_host="redis", redis_port=6379, redis_db=0,
        redis_pass="", is_ssl=False,
        redis_key=f"drain:{service}",
    )
    return TemplateMiner(persistence_handler=persistence, config=cfg)

async def stream(service: str, miner: TemplateMiner):
    url = "http://loki:3100/loki/api/v1/tail"
    params = {"query": f'{{service="{service}"}}', "limit": 1000}
    async with httpx.AsyncClient(timeout=None) as c:
        async with c.stream("GET", url, params=params) as r:
            async for line in r.aiter_lines():
                if not line.strip():
                    continue
                event = json.loads(line)
                for stream_data in event.get("streams", []):
                    for ts, body in stream_data["values"]:
                        process(service, miner, ts, body, stream_data["stream"])

def process(service, miner, ts, body, labels):
    result = miner.add_log_message(body.strip())
    if result["change_type"] == "cluster_created":
        publish_new_template(service, result["cluster_id"], result["template_mined"])
    record_hit(service, result["cluster_id"], ts, labels)

if __name__ == "__main__":
    import sys
    svc = sys.argv[1]
    asyncio.run(stream(svc, make_miner(svc)))

drain_sim_th of 0.4 is the threshold I’ve landed on for most service logs. Tighter (0.6+) and you’ll get template explosion. Looser (0.2) and you’ll merge unrelated lines. Tune it per service after the first week.

The cluster_created event is the interesting one. When Drain3 produces a new template, we want to know. That’s often the first signal of a new error path.

4. The Template Index

We store templates in Postgres with a pgvector embedding for semantic search.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE log_templates (
  id BIGSERIAL PRIMARY KEY,
  service TEXT NOT NULL,
  cluster_id INTEGER NOT NULL,
  template TEXT NOT NULL,
  first_seen TIMESTAMPTZ NOT NULL,
  last_seen TIMESTAMPTZ NOT NULL,
  hit_count BIGINT NOT NULL DEFAULT 0,
  severity_max TEXT,
  embedding vector(1536),
  UNIQUE (service, cluster_id)
);

CREATE INDEX ON log_templates USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX ON log_templates (service, last_seen DESC);

Embeddings come from OpenAI’s text-embedding-3-small. The template text is short, so embedding cost is negligible — fractions of a cent per thousand templates.

# index/embed.py
from openai import AsyncOpenAI

oai = AsyncOpenAI()

async def embed(template: str) -> list[float]:
    r = await oai.embeddings.create(
        model="text-embedding-3-small",
        input=template,
        dimensions=1536,
    )
    return r.data[0].embedding

Now find_similar_templates is one SQL query.

SELECT id, template, hit_count, last_seen
FROM log_templates
WHERE service = $1
ORDER BY embedding <=> $2
LIMIT 10;

5. The Summarization Endpoint

The summarization endpoint is the only place the LLM enters the pipeline. It accepts a service name and a time window. It returns a structured report.

# api/summarize.py
from fastapi import FastAPI
from pydantic import BaseModel
from anthropic import AsyncAnthropic
import asyncpg, os, json

app = FastAPI()
claude = AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

class SummaryRequest(BaseModel):
    service: str
    minutes: int = 15

class TemplateHit(BaseModel):
    template: str
    hit_count: int
    severity_max: str | None
    examples: list[str]

class Summary(BaseModel):
    headline: str
    top_issues: list[str]
    likely_actions: list[str]
    confidence: float

SYSTEM = """You analyze log templates. Each input has a template (with <*> placeholders),
a hit count, max severity, and up to 3 raw example lines. Produce a short Summary.
Quote template fragments verbatim. Never invent log lines. Never produce a numeric
threshold you can't see in the input."""

@app.post("/summarize", response_model=Summary)
async def summarize(req: SummaryRequest):
    hits = await fetch_top_templates(req.service, req.minutes, limit=25)
    examples = await fetch_examples(hits)  # 3 lines per template
    payload = [
        TemplateHit(
            template=h["template"],
            hit_count=h["hit_count"],
            severity_max=h["severity_max"],
            examples=examples[h["id"]],
        ).model_dump()
        for h in hits
    ]
    msg = await claude.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=800,
        system=SYSTEM,
        messages=[{"role": "user", "content": json.dumps(payload)}],
    )
    return Summary.model_validate_json(msg.content[0].text)

The trick is bounding the input. 25 templates, 3 examples each, capped at 800 output tokens. That’s a sub-cent call. If you let the SRE ask for “last 24 hours” without bounds, you’ll regret it.

6. The Slack Slash Command

Hook it to Slack so SREs can ask in-channel.

# slack/handler.py
from fastapi import FastAPI, Form
import httpx

app = FastAPI()

@app.post("/slack/logs")
async def slack_logs(text: str = Form(...), response_url: str = Form(...)):
    parts = text.split()
    service = parts[0]
    minutes = int(parts[1]) if len(parts) > 1 else 15
    async with httpx.AsyncClient() as c:
        r = await c.post("http://api/summarize",
                         json={"service": service, "minutes": minutes},
                         timeout=20)
        summary = r.json()
    blocks = [
        {"type": "section", "text": {"type": "mrkdwn",
            "text": f"*{summary['headline']}* (confidence {summary['confidence']:.0%})"}},
        {"type": "section", "text": {"type": "mrkdwn",
            "text": "\n".join(f"- {x}" for x in summary["top_issues"])}},
    ]
    async with httpx.AsyncClient() as c:
        await c.post(response_url, json={"blocks": blocks})
    return {"response_type": "in_channel"}

Usage: /logs checkout-api 30. The first response is the standard Slack 200 within 3 seconds; the real reply goes to response_url.

7. Detecting New-Template Events

The most underrated AIOps signal is “a template that has never been seen before just appeared”. You don’t need a model for this. It’s a SQL trigger.

# miner/notify.py
async def publish_new_template(service: str, cid: int, template: str):
    severity = guess_severity(template)
    if severity not in ("error", "critical"):
        return
    await emit_event({
        "kind": "new_log_template",
        "service": service,
        "cluster_id": cid,
        "template": template,
        "severity": severity,
    })

def guess_severity(template: str) -> str:
    low = template.lower()
    if "panic" in low or "fatal" in low: return "critical"
    if "error" in low or "exception" in low: return "error"
    if "warn" in low: return "warning"
    return "info"

This event goes to the same Argo Events bus your remediation pipeline listens on. A brand-new error template post-deploy is one of the strongest “rollback candidate” signals you have.

8. Common Pitfalls

Four mistakes worth dodging.

  1. Embedding raw log lines instead of templates. You’ll burn money and your similarity scores will be dominated by timestamps and request IDs. Embed templates.
  2. Letting Drain3’s cluster count grow unbounded. Set drain_max_clusters and watch it. If you’re hitting the ceiling, your similarity threshold is too tight.
  3. Including all severities in the summary. Filter to warning-and-above before sending to the model. Info logs drown the signal and cost tokens.
  4. Caching summary results too aggressively. A 5-minute summary cache is fine. A 1-hour cache means SREs are looking at stale data during incidents, which is worse than no cache at all.

9. Troubleshooting

Three failure modes you’ll hit.

9.1 Drain3 producing one giant cluster

You set drain_sim_th too low. The miner is gluing unrelated lines together. Raise to 0.5 and reset the persistence store. Don’t try to migrate, just start fresh — template clustering is not historically meaningful.

9.2 Loki tail dropping lines

The tail API is best-effort. For anything you need durability on, use the query API on a 30-second poll instead. Tail is fine for the live dashboard, not fine for “no log left behind” pipelines.

9.3 Summary headline keeps blaming the database

The model picks up on whatever’s loudest. If your DB connection retries are the noisiest template, every summary will mention the DB. Add a low-cardinality category label to your logs and let the summarizer group by it. Or downweight by hit count in the SQL query.

10. Wrapping Up

Cheap deterministic clustering plus expensive on-demand summarization is the pattern that scales. Drain3 turns the firehose into a manageable taxonomy. The LLM only ever sees a tiny structured slice. The Slack command becomes the SRE’s first move when something feels off, and it’s fast enough to use casually.

For the upstream pieces, the Loki documentation covers the storage tuning I skipped here. If you want to wire the new-template events into action, see auto remediation pipelines with LLM agents and Argo Events for the executor side, or anomaly detection on Prometheus metrics for the metric counterpart.