Chaos Engineering on Kubernetes, Litmus and Chaos Mesh in 2024

Sre article cover illustration on a gradient background

June 10, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — Litmus 3.x and Chaos Mesh 2.6 are both production-ready in 2024. Pick Litmus if you want a curated experiment hub and ChaosCenter UX. Pick Chaos Mesh if you want fine-grained network and IO faults and CLI-first workflows. Either way, the rigor is in the hypothesis, not the tool.

The hardest sell for chaos engineering used to be “you want to break production on purpose?” In 2024 that sell is easier because everyone has been burned by an outage that retrospectively looks obvious. The harder sell now is “you want to run chaos experiments without an SLO and an error budget?” The answer to that one should be no, and I’ll explain why.

This post compares Litmus 3.x and Chaos Mesh 2.6, the two CNCF projects most teams will choose between. Both are mature. Both have rough edges. The decision is mostly about your team’s workflow, not capability. I’ve used both in anger on clusters running Kubernetes 1.30.

Prerequisite reading. If you don’t have SLOs and error budgets yet, chaos engineering will just stress your team. The SLO is the hypothesis. Without it, you’re just breaking things.

What chaos engineering actually is

A chaos experiment has four parts:

A steady-state hypothesis. “Checkout latency p95 stays under 300ms” or “5xx rate stays under 0.5%.”
A perturbation. Kill a pod, drop 30% of packets, add 100ms latency, fill a disk.
A blast radius. Which pods, which namespace, for how long.
A rollback plan. What happens if the experiment goes worse than expected.

If you can’t articulate all four, you’re not doing chaos engineering. You’re doing the kubectl-delete-pod-and-pray school of incident generation. I’ve watched senior engineers do this. It’s not useful.

Both Litmus and Chaos Mesh encode this structure into their CRDs. That’s why I’d use them over ad-hoc scripts even if you only run one experiment a month.

Litmus 3.x at a glance

Litmus, now at 3.x, leans on a hub of pre-built experiments and a web UI called ChaosCenter. Strengths:

Large catalog of pre-built faults (pod, network, node, app-specific).
Good for distributed orgs because the UI gates who can run what.
Workflow chaining via Argo under the hood.
Strong tenancy story for shared platform teams.

A minimal pod-delete experiment in Litmus 3.x looks like this:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: checkout-pod-delete
  namespace: checkout
spec:
  appinfo:
    appns: checkout
    applabel: "app=checkout"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "15"
            - name: FORCE
              value: "false"
            - name: PODS_AFFECTED_PERC
              value: "33"
        probe:
          - name: checkout-availability-slo
            type: promProbe
            mode: Continuous
            runProperties:
              probeTimeout: 5
              interval: 10
              retry: 0
            promProbe/inputs:
              endpoint: "http://prometheus.monitoring:9090"
              query: |
                sum(rate(http_requests_total{service="checkout",code=~"5.."}[2m]))
                / sum(rate(http_requests_total{service="checkout"}[2m])) < 0.005
              comparator:
                type: float
                criteria: "<"
                value: "0.005"

The interesting part is the promProbe. It’s a continuous check tied to your SLO. If 5xx rate creeps above 0.5% during the experiment, Litmus marks the experiment as failed, which is the right semantics: your hypothesis didn’t hold.

Chaos Mesh 2.6 at a glance

Chaos Mesh, also CNCF, takes a different posture. CRD-first, dashboard-second. Strengths:

Network chaos is best-in-class. Delay, loss, partition, corruption with tc-level precision.
IO chaos (delay, errno injection) is robust.
Kernel and time chaos for hard-mode experiments.
Simpler mental model if you live in kubectl apply.

The same idea as a network delay experiment:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: checkout-downstream-delay
  namespace: checkout
spec:
  action: delay
  mode: fixed-percent
  value: "50"
  selector:
    namespaces:
      - checkout
    labelSelectors:
      "app": "checkout"
  direction: to
  target:
    mode: all
    selector:
      namespaces:
        - payments
      labelSelectors:
        "app": "payments"
  delay:
    latency: "100ms"
    jitter: "20ms"
    correlation: "25"
  duration: "5m"

This adds 100ms±20ms latency to half of checkout’s calls to payments for five minutes. It’s the kind of experiment that reveals whether your timeouts and retries are tuned. The first time I ran the equivalent in a real environment we discovered a 30-second default gRPC timeout that turned a 100ms downstream delay into a queue depth excursion. That was free reliability work.

Picking between them

A short decision tree:

Network chaos primary use case → Chaos Mesh.
Multi-team shared platform with RBAC needs → Litmus.
Want a chaos-as-code workflow with Git → either; both have good GitOps stories.
Need time chaos (skew the clock) → Chaos Mesh.
Need a UI that PMs and SREs can both use → Litmus ChaosCenter.

In practice many teams run both. Litmus for app-level scenarios, Chaos Mesh for network and IO faults. There’s no penalty.

For broader context, the Principles of Chaos Engineering is the canonical reading. It predates both tools and aged well.

The safety scaffolding

Tools don’t keep you safe. Process does. The scaffolding I require before any experiment touches a non-dev cluster:

# Pre-flight checks before running any chaos experiment.
# 1. Confirm the SLO budget for the target service is healthy.
curl -s "https://prom.internal/api/v1/query" \
  --data-urlencode 'query=slo:checkout:budget_remaining_30d' \
  | jq '.data.result[0].value[1]' | awk '{ if ($1+0 < 0.25) exit 1 }'

# 2. Confirm a known-good rollback path exists (e.g. delete the ChaosEngine).
kubectl auth can-i delete chaosengine.litmuschaos.io --namespace checkout

# 3. Notify the on-call channel via the chatops bot.
./scripts/announce.sh "Starting pod-delete on checkout, hypothesis: SLO holds, ttl=60s, owner=$(git config user.email)"

# 4. Apply the experiment.
kubectl apply -f experiments/checkout-pod-delete.yaml

Specifically:

Budget gate. If less than 25% of the error budget is left, no experiments. The whole point of the budget is to spend it on learning, not on emergencies.
Rollback proven. Can you actually kill the experiment? Have you tried? Once, in a test cluster, I couldn’t because the controller’s leader election was on a node I’d just isolated. Funny in retrospect.
Announce in chat. So that when the on-call’s pager goes off, they don’t think it’s real for 90 seconds.
Time-box. Every experiment has a TTL. No experiment runs indefinitely. Both Litmus and Chaos Mesh support this; use it.

Maturity stages

A useful framing borrowed from a few of the early Netflix talks. Teams progress through four stages:

Game days. Manual experiments, scheduled, with the whole team in the room. This is the right starting point. You learn how your team reacts to failure before you learn how the system does.
Scheduled experiments. A handful of experiments run on a cron schedule against staging. Findings get triaged weekly.
Continuous experiments. Small experiments run continuously in production, gated on SLO health. Findings auto-file tickets.
Chaos as part of CI. Pre-production includes chaos as a gate. A PR that breaks the network-delay experiment can’t merge.

Most teams I work with should be at stage 1 or 2. Stage 3 needs SLO maturity. Stage 4 needs a level of platform investment that maybe 10% of orgs can justify. Don’t aim for stage 4 from day one; that’s how chaos programs fail before they produce learning.

What to test first

A short list of experiments worth running before anything else. They surface the most common production-relevant failure modes:

Kill a pod from your most-critical deployment. Confirms PDBs and replica counts are right.
Drop 30% of traffic to your top downstream. Confirms timeouts and circuit breakers are tuned.
Add 200ms of latency to a backend. Confirms your retry budget doesn’t amplify latency into outage.
Fill a node’s disk. Confirms eviction thresholds are sane.
Kill DNS for 30 seconds. Confirms your services tolerate transient resolution failures.

That’s a quarter of work for a small team. If any of those experiments shows a problem (and one of them will), that’s a real finding worth fixing. The remaining experiments come from your incident history.

Gotchas

Running in production first. Don’t. Run in staging until you trust your hypothesis testing. Then run in a canary slice of production. Then run broadly.
No steady-state probe. Without a probe, an experiment can’t fail. If it can’t fail, you’re not testing anything.
Selecting too broadly. Both tools default to permissive selectors. A typo in a label selector once landed me a multi-namespace pod-delete. Use mode: fixed-percent or mode: fixed rather than mode: all for anything beyond dev.
Forgetting the cleanup. Failed experiments can leave finalizers, network rules, or iptables state behind. Periodically run kubectl get chaos --all-namespaces -A and confirm nothing is stuck.
Confusing chaos with load testing. Chaos is fault injection. Load testing is volume. You usually want both, not one of them.
Skipping the writeup. An experiment without a post-experiment report is wasted. Even “hypothesis held, no findings” is worth recording.

Wrapping Up

Chaos engineering on Kubernetes in 2024 is mature enough that the tool choice is the easy part. The hard parts are running experiments that test real hypotheses, gating them on SLO health, and turning findings into actual reliability work. If you adopt Litmus or Chaos Mesh and stop there, you’ll have a CRD library and not much else.

Next up I’ll cover auto-remediation with Argo Events and Kyverno , which is the natural next step once chaos engineering has surfaced the failure modes that should be auto-mitigated. Detecting and fixing in code, not in pages at 4am.

What chaos engineering actually is

Litmus 3.x at a glance

Chaos Mesh 2.6 at a glance

Picking between them

The safety scaffolding

Maturity stages

What to test first

Gotchas

Wrapping Up

Related posts

Auto Remediation on Kubernetes, Argo Events and Policy as Code

Chaos Engineering with AI Augmented Hypotheses

Synthetic Monitoring and Canary Deploys, A Practical Pairing

Blameless Postmortems That Actually Change Behavior

Service Mesh Resilience, Istio Ambient vs Linkerd in 2024

eBPF Plus OpenTelemetry, The Observability Pairing for 2024

SLOs and Error Budgets That Engineers Actually Use

Digital Immune Systems for Engineers, What Gartner Got Right

Let’s Start a Project