background-shape
Digital Immune Systems for Engineers, What Gartner Got Right
June 3, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — Digital Immune System (DIS) is Gartner’s umbrella for six practices that together make software self-protecting. Strip the marketing and what’s left is observability, chaos engineering, auto-remediation, SRE, AI-augmented testing, and supply chain security. Treat it as a checklist, not a product.

When Gartner first pushed “Digital Immune System” into its strategic technology list, I rolled my eyes. Another analyst portmanteau. But after a year of arguing with platform teams about why their incident rate isn’t going down despite three observability vendors and a “DevSecOps transformation,” I’ve changed my tune. Not because the term is brilliant. Because the framing is right.

The thesis is simple. Resilience isn’t a feature you bolt on. It’s a property that emerges from a half-dozen practices working together. Skipping any one of them creates an attack surface, a blind spot, or a recovery gap. Most orgs do two or three of the six and wonder why outages still hurt.

This post is a working engineer’s read of the DIS pillars. What they are, what’s actually new, and which ones deserve real budget in 2024.

The six pillars in plain language

Gartner lists observability, AI-augmented testing, chaos engineering, auto-remediation, site reliability engineering, and software supply chain security. None are new. The claim is that doing all six, with feedback between them, is qualitatively different from doing any subset.

Concretely, an “immune” system needs to:

  1. Sense the body, continuously and with enough granularity (observability).
  2. Probe itself for weaknesses before the adversary does (chaos engineering, AI testing).
  3. Verify its inputs and dependencies (supply chain security).
  4. Respond automatically when patterns of harm appear (auto-remediation).
  5. Adapt by learning from incidents (SRE, postmortems, error budgets).

If that mapping reads suspiciously like the human immune system, that’s the metaphor on the box. It’s a useful one because it forces you to think about feedback loops, not individual tools.

What’s genuinely new in 2024

Two pillars have changed enough since 2022 that they deserve fresh attention.

Observability has shifted from logs-and-metrics to traces-as-first-class. OpenTelemetry collector 0.100 is the de facto pipeline. Vendor-neutral instrumentation is now realistic for most stacks. The interesting work is in correlating traces with eBPF kernel signals (we covered the OTel and eBPF pairing in a later post), and in keeping cardinality from blowing up your bill.

Auto-remediation went from “scary scripts” to declarative policy. Kyverno 1.12 and Argo Events let you express “if this signal, do this action” as Kubernetes resources, reviewable in PRs, with audit trails. That’s a different beast from the cron-driven self-healing of 2019.

A small but representative example. The remediation policy below blocks pods that start crash-looping more than three times in five minutes by tainting the node they land on, so the scheduler picks a different one while a human investigates.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: quarantine-crashloop-node
spec:
  background: false
  rules:
    - name: taint-on-repeated-crashloop
      match:
        any:
          - resources:
              kinds: ["Event"]
      preconditions:
        all:
          - key: "{{ request.object.reason }}"
            operator: Equals
            value: "BackOff"
          - key: "{{ request.object.count }}"
            operator: GreaterThanOrEquals
            value: 3
      mutate:
        targets:
          - apiVersion: v1
            kind: Node
            name: "{{ request.object.involvedObject.spec.nodeName }}"
        patchStrategicMerge:
          spec:
            taints:
              - key: "quarantine"
                value: "crashloop"
                effect: "NoSchedule"

This is the kind of thing that used to require a custom controller. Now it’s a policy your security team can review.

What’s a rebrand

Two pillars are basically older practices with new logos.

“AI-augmented testing” is mostly mutation testing and fuzzing with a language model on top to generate seeds. Useful, but if you don’t have property-based tests yet, start there before you license an LLM-based test generator.

“SRE” has been around since 2003. It’s good. It’s not new. If your org is calling its DIS rollout “transformative” and the SRE chapter is “hire some SREs,” manage expectations.

Where the leverage is

If I had to rank the six by ROI for a mid-size engineering org in 2024, it’d be:

  1. Observability with OTel — table stakes; without it nothing else works.
  2. SRE practices specifically SLOs and error budgets — gives you the language to decide what to remediate.
  3. Chaos engineering — the cheapest way to find your real failure modes.
  4. Auto-remediation — high ROI but only after the above three are healthy.
  5. Supply chain — high stakes; SLSA and Sigstore are mature enough to adopt.
  6. AI testing — useful but not foundational.

The order matters because each step assumes the previous one exists. Auto-remediation without observability is automation gone feral. Chaos engineering without SLOs is just breaking production for sport.

For deeper context on the analyst position, Gartner’s writeup is the canonical reference. Read it once, then ignore the consultancy decks that quote it.

Putting it together

A useful exercise: walk a recent Sev-2 incident through the six pillars and ask, for each one, “did this help, was it absent, or did it fire wrong?” You’ll usually find:

  • Observability caught the symptom but not the cause.
  • Chaos engineering hadn’t covered this failure mode.
  • Auto-remediation didn’t fire because the signal didn’t match.
  • The SLO breach took 20 minutes to register.
  • Supply chain was uninvolved (or, occasionally, the root cause).
  • AI testing was nowhere.

That mapping tells you where to invest next. It’s a more honest planning artifact than a vendor demo.

A practical workflow I’ve used with two platform teams this year:

# Pull the last quarter of Sev-2+ incidents.
gh issue list --label "sev-2,sev-1" --state closed --limit 50 --json number,title,body \
  > incidents.json

# Score each on a 0-2 scale per pillar (manually, with the on-call team).
# 0: pillar was absent. 1: pillar fired but didn't help. 2: pillar prevented or shortened the incident.

# Aggregate. The pillar with the most zeros across incidents is your next quarter's investment.

It’s crude and it works. The ranking will surprise you. In my experience the lowest-scoring pillar is rarely the one leadership thinks it is.

Common Pitfalls

A few patterns I see repeatedly when teams adopt DIS thinking:

  • Buying a DIS product. No vendor sells all six. Anyone claiming to is selling observability with extra slides.
  • Skipping chaos because “we have HA.” HA proves the happy path. Chaos proves the unhappy ones. They’re not interchangeable.
  • Auto-remediation without rate limits. I’ve seen a remediation policy take down a cluster faster than the original incident would have. Always bound the blast radius.
  • Treating supply chain as security’s problem. SBOMs and signature verification belong in CI, owned by platform.
  • No feedback loop into SLOs. If an incident doesn’t update an SLO or alert threshold, you haven’t learned.

A 90-day rollout plan

If I were starting from scratch today, this is the sequence I’d run. It’s deliberately slow because the cultural changes are the bottleneck, not the tooling.

Days 1-30: instrumentation and language. Get OpenTelemetry into your top three services. Define one SLI per service. Don’t write SLOs yet; just measure. The goal of month one is that everyone on the team can name three numbers about the service’s behavior.

Days 30-60: SLO commitment. With a quarter of historical data, write SLOs. The targets should be embarrassingly easy at first; you want the budget to teach the team what an incident-month looks like. Wire burn-rate alerts. Establish on-call escalation.

Days 60-90: first chaos experiment. Pick one failure mode you’re confident the system handles. Prove it does, in a non-prod cluster. Then in canary slice. The point isn’t the experiment; it’s the muscle memory of running one.

After 90 days you’ll know whether your org has the appetite for the full program. About half don’t. That’s useful information too.

What “self-protecting” really means

The marketing implies a system that fixes itself. The reality is more modest. A self-protecting system:

  • Notices its own degradation faster than humans can.
  • Limits the blast radius of failures it can’t fix.
  • Surfaces the right information for the human who does fix it.
  • Learns from each incident so the next one is faster.

Notice that “fixes itself” appears nowhere in that list. Auto-remediation is one tool, useful for narrow scenarios, but the load-bearing word in DIS is “immune,” not “automatic.” An immune system mostly contains threats while signaling for help.

That framing helps with budget conversations. The question isn’t “how much automation can we build?” It’s “where are we slowest to notice, contain, and learn?” Money goes to the slowest of the three.

Wrapping Up

Digital Immune System isn’t a product you install, and Gartner being right about the framing doesn’t make every adjacent vendor right about the implementation. What it is, useful, is a checklist for whether your reliability program has gaps that money won’t fix. Run the pillar audit. Pick the lowest score. Invest there. Repeat next quarter.

The rest of this month I’ll go through several of the pillars in depth, starting with SLOs and error budgets, since they’re the prerequisite for almost everything else. If your team has been doing observability for a while but reliability still hurts, that’s the post to read first.