Auto Remediation on Kubernetes, Argo Events and Policy as Code

Sre article cover illustration on a gradient background

June 12, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — Auto-remediation should be a Kubernetes resource you review in a PR, not a script someone wrote at 2am. Argo Events provides the trigger graph, Kyverno provides the policy enforcement, and Argo CD ships them with your other config. Build the guardrails before the actions.

I’m wary of auto-remediation. Done badly, it amplifies incidents instead of damping them. I’ve watched a memory-leak remediation policy cascade into a full cluster restart because nobody put a rate limit on the restart action. That outage was longer than the leak ever would have been.

Done well, auto-remediation is the difference between a 30-second blip and a page at 3am. The trick is the same trick as every other power tool: respect the blast radius, version the policy, and prove the rollback. This post is the pattern I now reach for in 2024.

If you’re not using SLOs to decide what to remediate yet, that’s the prerequisite. Auto-remediation without an SLO is a script that runs forever. With an SLO, it’s a budget-conscious response.

The architecture

Three pieces, all declarative:

Argo Events — listens for signals (Kubernetes events, Prometheus alerts, webhook posts) and turns them into triggers.
Kyverno — applies cluster-state changes (mutations, validations) in response.
Argo CD — ships both, so a remediation policy goes through your normal PR review.

The flow is signal → EventSource → Sensor → Trigger → Kyverno policy (or a direct Kubernetes action). Each step is a CRD. Each is reviewable. Each is auditable.

Why not just write a custom controller? Because you’d have to write it, test it, secure it, and explain it to your team. Argo Events plus Kyverno is the boring choice, and boring is right when the blast radius is “the whole cluster.”

A worked example, OOMKilled pods

A common remediation: when a deployment has a pod OOMKilled twice in five minutes, bump its memory request by 25% (with a cap) and notify the owner.

Step one: capture the signal. Argo Events watches the Kubernetes event stream for OOMKilled events.

apiVersion: argoproj.io/v1alpha1
kind: EventSource
metadata:
  name: oomkilled-events
  namespace: argo-events
spec:
  resource:
    oom-events:
      namespace: ""
      group: ""
      version: "v1"
      resource: "events"
      eventTypes:
        - ADD
      filter:
        fields:
          - key: "reason"
            value: "OOMKilled"
            operation: "=="

Step two: route to a sensor with rate limits. This is the guardrail.

apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
  name: oom-remediation
  namespace: argo-events
spec:
  dependencies:
    - name: oom-event
      eventSourceName: oomkilled-events
      eventName: oom-events
      filters:
        dataLogicalOperator: "and"
        data:
          - path: "body.involvedObject.kind"
            type: string
            value: ["Pod"]
  triggers:
    - template:
        name: bump-memory-request
        rateLimit:
          unit: Minute
          requestsPerUnit: 2
        k8s:
          operation: patch
          source:
            resource:
              apiVersion: argoproj.io/v1alpha1
              kind: Workflow
              metadata:
                generateName: bump-memory-
              spec:
                entrypoint: bump
                arguments:
                  parameters:
                    - name: pod
                      value: "{{ .Input.body.involvedObject.name }}"
                    - name: namespace
                      value: "{{ .Input.body.involvedObject.namespace }}"
                templates:
                  - name: bump
                    container:
                      image: ghcr.io/internal/k8s-remediate:1.4.0
                      command: ["/remediate"]
                      args:
                        - "bump-memory"
                        - "--pod={{workflow.parameters.pod}}"
                        - "--namespace={{workflow.parameters.namespace}}"
                        - "--increment=25%"
                        - "--cap=8Gi"
                        - "--dry-run=false"

The rateLimit block caps remediations at two per minute across the whole sensor. If twenty pods OOM at once, you’ll handle two, page on the rest. That’s intentional. Twenty simultaneous OOMs is a signal that something else is wrong, and a flurry of auto-bumps will mask it.

Step three: a Kyverno policy that prevents the bump from running on pods that haven’t tagged themselves as opt-in.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-remediation-optin
spec:
  validationFailureAction: Enforce
  rules:
    - name: deny-bump-without-optin
      match:
        any:
          - resources:
              kinds:
                - Workflow
              namespaces:
                - argo-events
              names:
                - "bump-memory-*"
      preconditions:
        all:
          - key: "{{ request.object.spec.arguments.parameters[?(@.name=='pod')].value | [0] }}"
            operator: NotEquals
            value: ""
      validate:
        message: "Pod must have label remediation.platform/optin=true"
        deny:
          conditions:
            any:
              - key: "{{ lookup('Pod', request.object.spec.arguments.parameters[?(@.name=='namespace')].value | [0], request.object.spec.arguments.parameters[?(@.name=='pod')].value | [0], 'metadata.labels.\"remediation.platform/optin\"') }}"
                operator: NotEquals
                value: "true"

Opt-in matters. The default is “no auto-remediation.” Teams who want it ask for it by adding a label. This forces a human decision at deploy time and prevents surprises in shared clusters.

What to remediate, what not to

A non-exhaustive opinion:

Safe to auto-remediate:

OOMKilled pods on opt-in workloads, with memory cap.
Stuck Argo CD applications (sync retry).
Expired TLS certs detected before expiry (cert-manager already does this).
Disk-full conditions on node-local logs (rotate, don’t delete).

Don’t auto-remediate:

Database failovers. Get a human.
Anything that triggers more than 5% of cluster capacity to move.
Security events. Quarantine, page, but don’t “fix.”
Network partitions. The remediation is usually wrong.

The general rule: if the remediation is “delete and recreate,” it’s safe. If it’s “change state in a stateful system,” it isn’t.

Tying it to your error budget

The most underrated guardrail. Don’t auto-remediate when the SLO budget is unhealthy. The reasoning is that a struggling service produces a lot of signals, and auto-remediation amplifies signal-to-action. When the budget is thin, you want fewer actions, not more.

A simple Argo Events filter on a Prometheus query keeps this honest:

filters:
  data:
    - path: "body.metric.budget_remaining"
      type: number
      comparator: ">"
      value: "0.25"

If less than 25% of the budget is left, no automation. Page instead.

Argo CD as the delivery vehicle

Both Argo Events and Kyverno resources belong in your GitOps repo. That gives you:

Diffs on policy changes.
Rollback by git revert.
A canonical place to discover what auto-remediations exist (kubectl get sensor -A plus the repo).

The Argo CD docs on ApplicationSets are the right reference for templating these across clusters. A pattern I use: each remediation lives in its own Application, so disabling a single remediation is a sync state change, not a code change.

Observability of the automation

Auto-remediation that runs silently is worse than no automation. You need to know how often it fires, what it did, and whether it helped.

Three metrics per sensor, exported by Argo Events:

argo_events_sensor_action_triggered_total — how often a trigger fired.
argo_events_sensor_action_failed_total — how often it failed.
A custom counter from your remediation container counting “remediation succeeded, problem recurred within N minutes.” This is the only metric that tells you whether the remediation is actually working.

The third metric is the one that prevents the worst failure mode: remediations that fire continuously without resolving anything. If recurrence is >50%, the remediation is wrong.

Phasing it in

Don’t go from zero to fully-automated remediation. The phases I use:

Detect and notify. The pipeline fires, but the action is just a Slack message to the on-call. Two weeks at this stage.
Detect, dry-run, notify. The pipeline computes what it would do and reports it. The human runs the action if it looks right. Two more weeks.
Detect and act with manual confirm. The pipeline files a PR (yes, a PR) that, when merged, applies the remediation. Useful for non-urgent changes like memory bumps.
Detect and act autonomously. Only for narrow, well-tested cases with proven rollback.

Stage 1 is where you discover that 30% of your signals are noisy. Stage 2 is where you discover that 20% of your remediations are wrong. Both are cheap to find without the automation actually firing.

By the time you reach stage 4, the team trusts the pipeline because they’ve seen it not do dumb things for a month.

When humans beat automation

Auto-remediation works for failure modes that are characterized and bounded. It loses to humans on novel failure modes. Two corollaries:

If your incident postmortem keeps surfacing “the auto-remediation made it worse,” you’ve over-automated. Pull back.
If your incident postmortem keeps surfacing “humans took too long to do the obvious thing,” you’ve under-automated. Push forward.

Both signals matter. Reviewing the ratio quarterly tells you whether to add automation or remove it.

Gotchas

No rate limit. Always set one. The default is unlimited. The default is wrong.
No opt-in. Without it, every workload owner is surprised.
No budget gate. Auto-remediation during an incident makes the incident worse.
Stale metrics. A remediation triggered by a 5-minute-old metric fires after the situation has changed. Pin signals to live data.
No dry-run mode. Every remediation should have --dry-run and you should run it that way for a week before flipping it to enforce.
Remediation loops. The remediation triggers the same signal it was triggered by. Always include a “do not retrigger within N minutes” deduplication.

Wrapping Up

Auto-remediation is one of the highest-leverage practices in the Digital Immune System pillars , and one of the easiest to ship as a foot-gun. Argo Events plus Kyverno gives you the declarative scaffolding to make it boring. Use opt-in, rate limits, budget gates, and dry-run. Treat every remediation policy as a code review, not a script.

Next I’ll move into the observability layer that all of this depends on, specifically the eBPF and OpenTelemetry pairing. Without good signals, auto-remediation is automation in the dark.

The architecture

A worked example, OOMKilled pods

What to remediate, what not to

Tying it to your error budget

Argo CD as the delivery vehicle

Observability of the automation

Phasing it in

When humans beat automation

Gotchas

Wrapping Up

Related posts

Chaos Engineering on Kubernetes, Litmus and Chaos Mesh in 2024

Synthetic Monitoring and Canary Deploys, A Practical Pairing

Blameless Postmortems That Actually Change Behavior

Service Mesh Resilience, Istio Ambient vs Linkerd in 2024

eBPF Plus OpenTelemetry, The Observability Pairing for 2024

SLOs and Error Budgets That Engineers Actually Use

Digital Immune Systems for Engineers, What Gartner Got Right

Shipping Rust to Kubernetes, Smaller Images and Faster Cold Starts

Let’s Start a Project