Chaos Engineering with AI Augmented Hypotheses

Aiops article cover illustration on a gradient background

May 21, 2025 · 9 min read · by Muhammad Amal programming

TL;DR — Let the LLM propose hypotheses from your service graph and postmortem history, run them as LitmusChaos experiments with strict blast radius caps, and abort automatically on SLO burn rate breach.

The classic chaos engineering pitch is “break things on purpose to find weaknesses”. The pitch works on slides. In practice, teams either run the same five experiments forever (pod kills, network latency, region drain) or they don’t run any. The hard part isn’t the breaking, it’s deciding what to break next.

This is where an LLM is genuinely useful. Given a service topology, an incident history, and a recent set of code changes, claude-3.7-sonnet can propose hypotheses that a human SRE wouldn’t get to for a few weeks. “Your checkout service depends on a recently-added inventory cache, and the cache has no circuit breaker — hypothesis: if the cache returns 500s for 10% of requests, checkout p99 latency stays under 2s.” That’s a useful experiment proposal.

But the model doesn’t run anything. LitmusChaos 3.10 does, with a strict blast radius config and an abort hook tied to your SLO burn rate. The model is upstream of the experiment design; everything downstream is deterministic Kubernetes resources you can review like any other manifest.

1. The Pipeline

incident history + topology + recent diffs
        |
        v
+----------------+
| Hypothesis     |  claude-3.7-sonnet
| generator      |  output: list of typed experiment proposals
+----------------+
        |
        v
+----------------+
| Human review   |  PR with proposed ChaosExperiment manifests
+----------------+
        |
        v
+----------------+
| LitmusChaos    |  scheduled run, abort on SLO breach
| 3.10           |
+----------------+
        |
        v
+----------------+
| Results +      |  back into incident history
| Postmortem     |
+----------------+

The model never touches the cluster. Its output is a pull request.

2. Modeling the Service Topology

The hypothesis generator needs a graph of services and their dependencies. The cheapest source is your service mesh or your Tempo 2.6 traces.

# topology/build.py
import httpx
import asyncio
from collections import defaultdict

async def build_topology(window_hours: int = 24) -> dict:
    """Build a service graph from Tempo traces."""
    async with httpx.AsyncClient(timeout=30) as c:
        r = await c.get("http://tempo:3200/api/search", params={
            "tags": "",
            "minDuration": "0ms",
            "limit": 1000,
            "start": int((time.time() - window_hours*3600) * 1e9),
            "end": int(time.time() * 1e9),
        })
        traces = r.json()["traces"]

    edges = defaultdict(int)
    for trace in traces:
        spans = await fetch_spans(trace["traceID"])
        for span in spans:
            parent = span.get("parent_service")
            child = span.get("service")
            if parent and child and parent != child:
                edges[(parent, child)] += 1

    return {
        "edges": [{"from": a, "to": b, "weight": w}
                  for (a, b), w in edges.items()],
        "services": list(set([a for a, _ in edges] + [b for _, b in edges])),
    }

The output is a list of weighted edges. Weight is call count over the window. Combined with a short description of each service (pulled from your service catalog), this is enough context.

3. Pulling Incident History

The other input is past pain. Read the last 90 days of postmortems.

# history/postmortems.py
async def recent_postmortems(days: int = 90) -> list[dict]:
    # assumes postmortems are markdown in a git repo with frontmatter
    out = []
    cutoff = datetime.utcnow() - timedelta(days=days)
    for path in glob("postmortems/*.md"):
        meta, body = parse_frontmatter(open(path).read())
        if meta["date"] < cutoff: continue
        out.append({
            "title": meta["title"],
            "date": meta["date"].isoformat(),
            "services": meta.get("services", []),
            "root_cause": meta.get("root_cause"),
            "contributing": meta.get("contributing_factors", []),
            "summary": body[:600],  # cap
        })
    return out[-15:]  # last 15

Cap at 15. The model doesn’t need all 90 days of incidents, just the recent texture.

4. The Hypothesis Generator

The prompt is specific about format. Structured output, no prose.

# hypothesis/generator.py
from anthropic import AsyncAnthropic
from pydantic import BaseModel, Field
import json

class ExperimentProposal(BaseModel):
    name: str  # short kebab-case
    hypothesis: str  # "if X then Y"
    target_service: str
    fault_type: str  # one of an enum
    fault_params: dict
    blast_radius: str  # pod | deployment | namespace
    expected_outcome: str
    abort_conditions: list[str]
    estimated_risk: str  # low | medium | high
    rationale: str  # why this experiment, refs to incidents

class Proposals(BaseModel):
    experiments: list[ExperimentProposal]

ALLOWED_FAULTS = {
    "pod-delete", "pod-cpu-hog", "pod-memory-hog",
    "network-latency", "network-loss", "network-corruption",
    "dns-error", "http-status-code", "disk-fill",
}

SYSTEM = f"""You are a chaos engineering planner. Given a service topology and
recent postmortems, propose 3-5 chaos experiments that would surface real
weaknesses. Use only these fault types: {sorted(ALLOWED_FAULTS)}. Set blast_radius
to the smallest unit that can produce the effect. Reference specific postmortems
in rationale by title. Be specific about fault_params. Never propose an experiment
that would affect a region or cluster."""

claude = AsyncAnthropic()

async def generate(topology: dict, postmortems: list[dict]) -> Proposals:
    msg = await claude.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=2048,
        system=SYSTEM,
        messages=[{"role": "user", "content": json.dumps({
            "topology": topology,
            "postmortems": postmortems,
        })}],
    )
    proposals = Proposals.model_validate_json(msg.content[0].text)
    # validate fault types
    for exp in proposals.experiments:
        if exp.fault_type not in ALLOWED_FAULTS:
            raise ValueError(f"disallowed fault: {exp.fault_type}")
        if exp.blast_radius not in {"pod", "deployment", "namespace"}:
            raise ValueError(f"disallowed blast radius: {exp.blast_radius}")
    return proposals

The validation step matters. Models invent fault types when they think they have a good idea. Reject anything outside the allowed set.

5. Generating LitmusChaos Manifests

Each proposal becomes a ChaosEngine manifest. LitmusChaos 3.10 has a clean CRD that maps cleanly to our proposal schema.

# manifests/render.py
import yaml

def to_chaos_engine(exp: ExperimentProposal) -> dict:
    return {
        "apiVersion": "litmuschaos.io/v1alpha1",
        "kind": "ChaosEngine",
        "metadata": {
            "name": exp.name,
            "namespace": "chaos",
            "annotations": {
                "hypothesis": exp.hypothesis,
                "expected": exp.expected_outcome,
                "estimated_risk": exp.estimated_risk,
            },
        },
        "spec": {
            "engineState": "stop",  # default off, human flips to active
            "appinfo": {
                "appns": guess_namespace(exp.target_service),
                "applabel": f"app={exp.target_service}",
                "appkind": "deployment",
            },
            "chaosServiceAccount": "litmus-runner",
            "experiments": [{
                "name": exp.fault_type,
                "spec": {
                    "components": {
                        "env": [
                            {"name": k.upper(), "value": str(v)}
                            for k, v in exp.fault_params.items()
                        ],
                    },
                    "probe": render_slo_probe(exp),
                },
            }],
        },
    }

def render_slo_probe(exp: ExperimentProposal) -> list[dict]:
    return [{
        "name": "slo-burn-rate-check",
        "type": "promProbe",
        "mode": "Continuous",
        "runProperties": {
            "probeTimeout": "10s",
            "interval": "30s",
            "stopOnFailure": True,
        },
        "promProbe/inputs": {
            "endpoint": "http://prometheus:9090",
            "query": f'slo:sli_error:ratio_rate5m{{service="{exp.target_service}"}}',
            "comparator": {
                "type": "float",
                "criteria": "<=",
                "value": "0.072",  # 14.4x burn on a 99.5% SLO
            },
        },
    }]

The stopOnFailure: True on the prom probe is the abort hook. If the SLO burn rate crosses the critical threshold during the experiment, Litmus aborts and reverts.

6. The Pull Request Flow

The whole thing runs as a GitHub Action on a weekly schedule. The output is a PR that an SRE reviews.

# .github/workflows/chaos-proposals.yaml
name: chaos-proposals
on:
  schedule:
    - cron: '0 9 * * 1'  # Monday 9 AM UTC
  workflow_dispatch:

jobs:
  propose:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install -r requirements.txt
      - run: python -m hypothesis.cli generate
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          PROMETHEUS_URL: ${{ secrets.PROM_URL }}
          TEMPO_URL: ${{ secrets.TEMPO_URL }}
      - uses: peter-evans/create-pull-request@v6
        with:
          title: "Weekly chaos hypotheses"
          body-path: proposals.md
          branch: chaos-proposals-${{ github.run_id }}
          labels: chaos, needs-review

The CLI writes both proposals.md (human-readable summary) and a directory of ChaosEngine manifests. The SRE reviews the PR, deletes any they don’t like, and merges. A merged PR triggers a second workflow that applies the manifests and flips engineState: active after a 24-hour delay.

The 24-hour delay is a cooling-off period. It catches the case where the reviewer approved on Friday afternoon and changed their mind by Monday.

7. Running the Experiment

Once active, LitmusChaos schedules the experiment in a chaos window — typically business hours, never on Fridays or before holidays.

# scheduler/window.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: pod-delete-checkout
spec:
  schedule:
    repeat:
      timeRange:
        startTime: '2025-05-22T10:00:00Z'
        endTime: '2025-05-22T16:00:00Z'
      properties:
        minChaosInterval: '30m'
        random: true
  engineTemplateSpec:
    # ... references the ChaosEngine

random: true plus minChaosInterval: 30m means the experiment runs once randomly inside the window. The randomness matters — predictable chaos is rehearsed chaos, which doesn’t surface real weaknesses.

8. Feedback Loop into Postmortems

When an experiment surfaces a real issue (probe abort, unexpected page, performance regression), the runner writes a “chaos finding” entry to the same repo as postmortems.

# postmortems/2025-05-22-chaos-finding-checkout-cache.md
---
title: Checkout cache no circuit breaker
date: 2025-05-22
services: [checkout, inventory-cache]
root_cause: missing circuit breaker on inventory-cache client
contributing_factors:
  - chaos experiment "inventory-cache-error" induced 10% 500s
  - checkout retried on every error with no backoff
type: chaos_finding
hypothesis_confirmed: false
hypothesis: "checkout would degrade gracefully under 10% upstream errors"
actual: "checkout p99 latency rose to 8s within 90 seconds"
---

The next hypothesis generator run reads this file. The model now knows checkout has retry storms, and proposes follow-up experiments — maybe a network latency variant, maybe a partial DNS failure. The system gets sharper week over week.

9. Common Pitfalls

Four mistakes to skip.

Letting the model name fault types freely. Models invent plausible-sounding fault names like “pod-network-partition” that don’t exist in Litmus. Validate against the allowed set and reject the whole proposal if any are wrong.
No probe abort. Running a chaos experiment without an SLO-based abort is a real outage waiting to happen. Every experiment must have a probe.
Running chaos in prod without a staging trial. Even with the abort hook, novel experiments belong in staging first. Add a target_env: staging field to your proposal schema and only auto-merge proposals for staging.
Treating proposals as a backlog. Old proposals go stale fast — the topology changes, the postmortem history changes. Delete proposals not run within two weeks and regenerate.

10. Troubleshooting

Three failure modes you’ll hit.

10.1 LitmusChaos doesn’t abort on probe failure

Check that stopOnFailure is true and that the probe’s comparator.value is realistic. A common bug is setting the threshold so low that the probe never fires (e.g. <= 0.0001 on an SLO that normally runs at 0.001).

10.2 Hypothesis generator suggests the same five experiments every week

You’re feeding it the same context every week. Make sure the postmortem list excludes already-confirmed hypotheses, and rotate which services are in the input topology.

10.3 PR contains malformed YAML

Pydantic validates the model output, but the YAML rendering is your code. If the rendering breaks on edge cases (unicode in service names, multi-line annotations), the PR will fail to apply. Run kubectl apply --dry-run=client -f in the CI step before opening the PR.

11. Wrapping Up

Chaos engineering’s bottleneck has always been imagination. The LLM doesn’t run the experiments, but it surfaces variants a tired SRE wouldn’t have proposed. With a tight allowlist of fault types, a probe-driven abort, and a PR-based review, the whole loop becomes safe enough to run weekly.

For the SLO probe details that gate the experiments, see SLOs and burn rate alerting in 2025, a practical guide . The LitmusChaos documentation covers the experiment catalog and probe types in depth — it’s worth a full read before wiring this up.

1. The Pipeline

2. Modeling the Service Topology

3. Pulling Incident History

4. The Hypothesis Generator

5. Generating LitmusChaos Manifests

6. The Pull Request Flow

7. Running the Experiment

8. Feedback Loop into Postmortems

9. Common Pitfalls

10. Troubleshooting

10.1 LitmusChaos doesn’t abort on probe failure

10.2 Hypothesis generator suggests the same five experiments every week

10.3 PR contains malformed YAML

11. Wrapping Up

Related posts

Chaos Engineering on Kubernetes, Litmus and Chaos Mesh in 2024

Self Hosted n8n on Kubernetes, A Production Setup

Postmortem Automation with LLMs, Drafts That Don't Lie

SLOs and Burn Rate Alerting in 2025, A Practical Guide

Incident Response Automation with LangGraph, A Step by Step Tutorial

Anomaly Detection on Prometheus Metrics, A Hands On Guide

Building an SRE Copilot for On Call Engineers

AI Driven Log Analysis at Scale, A Production Tutorial

Let’s Start a Project