Chaos Engineering with AI Augmented Hypotheses
TL;DR — Let the LLM propose hypotheses from your service graph and postmortem history, run them as LitmusChaos experiments with strict blast radius caps, and abort automatically on SLO burn rate breach.
The classic chaos engineering pitch is “break things on purpose to find weaknesses”. The pitch works on slides. In practice, teams either run the same five experiments forever (pod kills, network latency, region drain) or they don’t run any. The hard part isn’t the breaking, it’s deciding what to break next.
This is where an LLM is genuinely useful. Given a service topology, an incident history, and a recent set of code changes, claude-3.7-sonnet can propose hypotheses that a human SRE wouldn’t get to for a few weeks. “Your checkout service depends on a recently-added inventory cache, and the cache has no circuit breaker — hypothesis: if the cache returns 500s for 10% of requests, checkout p99 latency stays under 2s.” That’s a useful experiment proposal.
But the model doesn’t run anything. LitmusChaos 3.10 does, with a strict blast radius config and an abort hook tied to your SLO burn rate. The model is upstream of the experiment design; everything downstream is deterministic Kubernetes resources you can review like any other manifest.
1. The Pipeline
incident history + topology + recent diffs
|
v
+----------------+
| Hypothesis | claude-3.7-sonnet
| generator | output: list of typed experiment proposals
+----------------+
|
v
+----------------+
| Human review | PR with proposed ChaosExperiment manifests
+----------------+
|
v
+----------------+
| LitmusChaos | scheduled run, abort on SLO breach
| 3.10 |
+----------------+
|
v
+----------------+
| Results + | back into incident history
| Postmortem |
+----------------+
The model never touches the cluster. Its output is a pull request.
2. Modeling the Service Topology
The hypothesis generator needs a graph of services and their dependencies. The cheapest source is your service mesh or your Tempo 2.6 traces.
# topology/build.py
import httpx
import asyncio
from collections import defaultdict
async def build_topology(window_hours: int = 24) -> dict:
"""Build a service graph from Tempo traces."""
async with httpx.AsyncClient(timeout=30) as c:
r = await c.get("http://tempo:3200/api/search", params={
"tags": "",
"minDuration": "0ms",
"limit": 1000,
"start": int((time.time() - window_hours*3600) * 1e9),
"end": int(time.time() * 1e9),
})
traces = r.json()["traces"]
edges = defaultdict(int)
for trace in traces:
spans = await fetch_spans(trace["traceID"])
for span in spans:
parent = span.get("parent_service")
child = span.get("service")
if parent and child and parent != child:
edges[(parent, child)] += 1
return {
"edges": [{"from": a, "to": b, "weight": w}
for (a, b), w in edges.items()],
"services": list(set([a for a, _ in edges] + [b for _, b in edges])),
}
The output is a list of weighted edges. Weight is call count over the window. Combined with a short description of each service (pulled from your service catalog), this is enough context.
3. Pulling Incident History
The other input is past pain. Read the last 90 days of postmortems.
# history/postmortems.py
async def recent_postmortems(days: int = 90) -> list[dict]:
# assumes postmortems are markdown in a git repo with frontmatter
out = []
cutoff = datetime.utcnow() - timedelta(days=days)
for path in glob("postmortems/*.md"):
meta, body = parse_frontmatter(open(path).read())
if meta["date"] < cutoff: continue
out.append({
"title": meta["title"],
"date": meta["date"].isoformat(),
"services": meta.get("services", []),
"root_cause": meta.get("root_cause"),
"contributing": meta.get("contributing_factors", []),
"summary": body[:600], # cap
})
return out[-15:] # last 15
Cap at 15. The model doesn’t need all 90 days of incidents, just the recent texture.
4. The Hypothesis Generator
The prompt is specific about format. Structured output, no prose.
# hypothesis/generator.py
from anthropic import AsyncAnthropic
from pydantic import BaseModel, Field
import json
class ExperimentProposal(BaseModel):
name: str # short kebab-case
hypothesis: str # "if X then Y"
target_service: str
fault_type: str # one of an enum
fault_params: dict
blast_radius: str # pod | deployment | namespace
expected_outcome: str
abort_conditions: list[str]
estimated_risk: str # low | medium | high
rationale: str # why this experiment, refs to incidents
class Proposals(BaseModel):
experiments: list[ExperimentProposal]
ALLOWED_FAULTS = {
"pod-delete", "pod-cpu-hog", "pod-memory-hog",
"network-latency", "network-loss", "network-corruption",
"dns-error", "http-status-code", "disk-fill",
}
SYSTEM = f"""You are a chaos engineering planner. Given a service topology and
recent postmortems, propose 3-5 chaos experiments that would surface real
weaknesses. Use only these fault types: {sorted(ALLOWED_FAULTS)}. Set blast_radius
to the smallest unit that can produce the effect. Reference specific postmortems
in rationale by title. Be specific about fault_params. Never propose an experiment
that would affect a region or cluster."""
claude = AsyncAnthropic()
async def generate(topology: dict, postmortems: list[dict]) -> Proposals:
msg = await claude.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=2048,
system=SYSTEM,
messages=[{"role": "user", "content": json.dumps({
"topology": topology,
"postmortems": postmortems,
})}],
)
proposals = Proposals.model_validate_json(msg.content[0].text)
# validate fault types
for exp in proposals.experiments:
if exp.fault_type not in ALLOWED_FAULTS:
raise ValueError(f"disallowed fault: {exp.fault_type}")
if exp.blast_radius not in {"pod", "deployment", "namespace"}:
raise ValueError(f"disallowed blast radius: {exp.blast_radius}")
return proposals
The validation step matters. Models invent fault types when they think they have a good idea. Reject anything outside the allowed set.
5. Generating LitmusChaos Manifests
Each proposal becomes a ChaosEngine manifest. LitmusChaos 3.10 has a clean CRD that maps cleanly to our proposal schema.
# manifests/render.py
import yaml
def to_chaos_engine(exp: ExperimentProposal) -> dict:
return {
"apiVersion": "litmuschaos.io/v1alpha1",
"kind": "ChaosEngine",
"metadata": {
"name": exp.name,
"namespace": "chaos",
"annotations": {
"hypothesis": exp.hypothesis,
"expected": exp.expected_outcome,
"estimated_risk": exp.estimated_risk,
},
},
"spec": {
"engineState": "stop", # default off, human flips to active
"appinfo": {
"appns": guess_namespace(exp.target_service),
"applabel": f"app={exp.target_service}",
"appkind": "deployment",
},
"chaosServiceAccount": "litmus-runner",
"experiments": [{
"name": exp.fault_type,
"spec": {
"components": {
"env": [
{"name": k.upper(), "value": str(v)}
for k, v in exp.fault_params.items()
],
},
"probe": render_slo_probe(exp),
},
}],
},
}
def render_slo_probe(exp: ExperimentProposal) -> list[dict]:
return [{
"name": "slo-burn-rate-check",
"type": "promProbe",
"mode": "Continuous",
"runProperties": {
"probeTimeout": "10s",
"interval": "30s",
"stopOnFailure": True,
},
"promProbe/inputs": {
"endpoint": "http://prometheus:9090",
"query": f'slo:sli_error:ratio_rate5m{{service="{exp.target_service}"}}',
"comparator": {
"type": "float",
"criteria": "<=",
"value": "0.072", # 14.4x burn on a 99.5% SLO
},
},
}]
The stopOnFailure: True on the prom probe is the abort hook. If the SLO burn rate crosses the critical threshold during the experiment, Litmus aborts and reverts.
6. The Pull Request Flow
The whole thing runs as a GitHub Action on a weekly schedule. The output is a PR that an SRE reviews.
# .github/workflows/chaos-proposals.yaml
name: chaos-proposals
on:
schedule:
- cron: '0 9 * * 1' # Monday 9 AM UTC
workflow_dispatch:
jobs:
propose:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements.txt
- run: python -m hypothesis.cli generate
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
PROMETHEUS_URL: ${{ secrets.PROM_URL }}
TEMPO_URL: ${{ secrets.TEMPO_URL }}
- uses: peter-evans/create-pull-request@v6
with:
title: "Weekly chaos hypotheses"
body-path: proposals.md
branch: chaos-proposals-${{ github.run_id }}
labels: chaos, needs-review
The CLI writes both proposals.md (human-readable summary) and a directory of ChaosEngine manifests. The SRE reviews the PR, deletes any they don’t like, and merges. A merged PR triggers a second workflow that applies the manifests and flips engineState: active after a 24-hour delay.
The 24-hour delay is a cooling-off period. It catches the case where the reviewer approved on Friday afternoon and changed their mind by Monday.
7. Running the Experiment
Once active, LitmusChaos schedules the experiment in a chaos window — typically business hours, never on Fridays or before holidays.
# scheduler/window.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: pod-delete-checkout
spec:
schedule:
repeat:
timeRange:
startTime: '2025-05-22T10:00:00Z'
endTime: '2025-05-22T16:00:00Z'
properties:
minChaosInterval: '30m'
random: true
engineTemplateSpec:
# ... references the ChaosEngine
random: true plus minChaosInterval: 30m means the experiment runs once randomly inside the window. The randomness matters — predictable chaos is rehearsed chaos, which doesn’t surface real weaknesses.
8. Feedback Loop into Postmortems
When an experiment surfaces a real issue (probe abort, unexpected page, performance regression), the runner writes a “chaos finding” entry to the same repo as postmortems.
# postmortems/2025-05-22-chaos-finding-checkout-cache.md
---
title: Checkout cache no circuit breaker
date: 2025-05-22
services: [checkout, inventory-cache]
root_cause: missing circuit breaker on inventory-cache client
contributing_factors:
- chaos experiment "inventory-cache-error" induced 10% 500s
- checkout retried on every error with no backoff
type: chaos_finding
hypothesis_confirmed: false
hypothesis: "checkout would degrade gracefully under 10% upstream errors"
actual: "checkout p99 latency rose to 8s within 90 seconds"
---
The next hypothesis generator run reads this file. The model now knows checkout has retry storms, and proposes follow-up experiments — maybe a network latency variant, maybe a partial DNS failure. The system gets sharper week over week.
9. Common Pitfalls
Four mistakes to skip.
- Letting the model name fault types freely. Models invent plausible-sounding fault names like “pod-network-partition” that don’t exist in Litmus. Validate against the allowed set and reject the whole proposal if any are wrong.
- No probe abort. Running a chaos experiment without an SLO-based abort is a real outage waiting to happen. Every experiment must have a probe.
- Running chaos in prod without a staging trial. Even with the abort hook, novel experiments belong in staging first. Add a
target_env: stagingfield to your proposal schema and only auto-merge proposals for staging. - Treating proposals as a backlog. Old proposals go stale fast — the topology changes, the postmortem history changes. Delete proposals not run within two weeks and regenerate.
10. Troubleshooting
Three failure modes you’ll hit.
10.1 LitmusChaos doesn’t abort on probe failure
Check that stopOnFailure is true and that the probe’s comparator.value is realistic. A common bug is setting the threshold so low that the probe never fires (e.g. <= 0.0001 on an SLO that normally runs at 0.001).
10.2 Hypothesis generator suggests the same five experiments every week
You’re feeding it the same context every week. Make sure the postmortem list excludes already-confirmed hypotheses, and rotate which services are in the input topology.
10.3 PR contains malformed YAML
Pydantic validates the model output, but the YAML rendering is your code. If the rendering breaks on edge cases (unicode in service names, multi-line annotations), the PR will fail to apply. Run kubectl apply --dry-run=client -f in the CI step before opening the PR.
11. Wrapping Up
Chaos engineering’s bottleneck has always been imagination. The LLM doesn’t run the experiments, but it surfaces variants a tired SRE wouldn’t have proposed. With a tight allowlist of fault types, a probe-driven abort, and a PR-based review, the whole loop becomes safe enough to run weekly.
For the SLO probe details that gate the experiments, see SLOs and burn rate alerting in 2025, a practical guide. The LitmusChaos documentation covers the experiment catalog and probe types in depth — it’s worth a full read before wiring this up.