Auto Remediation on Kubernetes, Argo Events and Policy as Code
TL;DR — Auto-remediation should be a Kubernetes resource you review in a PR, not a script someone wrote at 2am. Argo Events provides the trigger graph, Kyverno provides the policy enforcement, and Argo CD ships them with your other config. Build the guardrails before the actions.
I’m wary of auto-remediation. Done badly, it amplifies incidents instead of damping them. I’ve watched a memory-leak remediation policy cascade into a full cluster restart because nobody put a rate limit on the restart action. That outage was longer than the leak ever would have been.
Done well, auto-remediation is the difference between a 30-second blip and a page at 3am. The trick is the same trick as every other power tool: respect the blast radius, version the policy, and prove the rollback. This post is the pattern I now reach for in 2024.
If you’re not using SLOs to decide what to remediate yet, that’s the prerequisite. Auto-remediation without an SLO is a script that runs forever. With an SLO, it’s a budget-conscious response.
The architecture
Three pieces, all declarative:
- Argo Events — listens for signals (Kubernetes events, Prometheus alerts, webhook posts) and turns them into triggers.
- Kyverno — applies cluster-state changes (mutations, validations) in response.
- Argo CD — ships both, so a remediation policy goes through your normal PR review.
The flow is signal → EventSource → Sensor → Trigger → Kyverno policy (or a direct Kubernetes action). Each step is a CRD. Each is reviewable. Each is auditable.
Why not just write a custom controller? Because you’d have to write it, test it, secure it, and explain it to your team. Argo Events plus Kyverno is the boring choice, and boring is right when the blast radius is “the whole cluster.”
A worked example, OOMKilled pods
A common remediation: when a deployment has a pod OOMKilled twice in five minutes, bump its memory request by 25% (with a cap) and notify the owner.
Step one: capture the signal. Argo Events watches the Kubernetes event stream for OOMKilled events.
apiVersion: argoproj.io/v1alpha1
kind: EventSource
metadata:
name: oomkilled-events
namespace: argo-events
spec:
resource:
oom-events:
namespace: ""
group: ""
version: "v1"
resource: "events"
eventTypes:
- ADD
filter:
fields:
- key: "reason"
value: "OOMKilled"
operation: "=="
Step two: route to a sensor with rate limits. This is the guardrail.
apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
name: oom-remediation
namespace: argo-events
spec:
dependencies:
- name: oom-event
eventSourceName: oomkilled-events
eventName: oom-events
filters:
dataLogicalOperator: "and"
data:
- path: "body.involvedObject.kind"
type: string
value: ["Pod"]
triggers:
- template:
name: bump-memory-request
rateLimit:
unit: Minute
requestsPerUnit: 2
k8s:
operation: patch
source:
resource:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: bump-memory-
spec:
entrypoint: bump
arguments:
parameters:
- name: pod
value: "{{ .Input.body.involvedObject.name }}"
- name: namespace
value: "{{ .Input.body.involvedObject.namespace }}"
templates:
- name: bump
container:
image: ghcr.io/internal/k8s-remediate:1.4.0
command: ["/remediate"]
args:
- "bump-memory"
- "--pod={{workflow.parameters.pod}}"
- "--namespace={{workflow.parameters.namespace}}"
- "--increment=25%"
- "--cap=8Gi"
- "--dry-run=false"
The rateLimit block caps remediations at two per minute across the whole sensor. If twenty pods OOM at once, you’ll handle two, page on the rest. That’s intentional. Twenty simultaneous OOMs is a signal that something else is wrong, and a flurry of auto-bumps will mask it.
Step three: a Kyverno policy that prevents the bump from running on pods that haven’t tagged themselves as opt-in.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-remediation-optin
spec:
validationFailureAction: Enforce
rules:
- name: deny-bump-without-optin
match:
any:
- resources:
kinds:
- Workflow
namespaces:
- argo-events
names:
- "bump-memory-*"
preconditions:
all:
- key: "{{ request.object.spec.arguments.parameters[?(@.name=='pod')].value | [0] }}"
operator: NotEquals
value: ""
validate:
message: "Pod must have label remediation.platform/optin=true"
deny:
conditions:
any:
- key: "{{ lookup('Pod', request.object.spec.arguments.parameters[?(@.name=='namespace')].value | [0], request.object.spec.arguments.parameters[?(@.name=='pod')].value | [0], 'metadata.labels.\"remediation.platform/optin\"') }}"
operator: NotEquals
value: "true"
Opt-in matters. The default is “no auto-remediation.” Teams who want it ask for it by adding a label. This forces a human decision at deploy time and prevents surprises in shared clusters.
What to remediate, what not to
A non-exhaustive opinion:
Safe to auto-remediate:
- OOMKilled pods on opt-in workloads, with memory cap.
- Stuck Argo CD applications (sync retry).
- Expired TLS certs detected before expiry (cert-manager already does this).
- Disk-full conditions on node-local logs (rotate, don’t delete).
Don’t auto-remediate:
- Database failovers. Get a human.
- Anything that triggers more than 5% of cluster capacity to move.
- Security events. Quarantine, page, but don’t “fix.”
- Network partitions. The remediation is usually wrong.
The general rule: if the remediation is “delete and recreate,” it’s safe. If it’s “change state in a stateful system,” it isn’t.
Tying it to your error budget
The most underrated guardrail. Don’t auto-remediate when the SLO budget is unhealthy. The reasoning is that a struggling service produces a lot of signals, and auto-remediation amplifies signal-to-action. When the budget is thin, you want fewer actions, not more.
A simple Argo Events filter on a Prometheus query keeps this honest:
filters:
data:
- path: "body.metric.budget_remaining"
type: number
comparator: ">"
value: "0.25"
If less than 25% of the budget is left, no automation. Page instead.
Argo CD as the delivery vehicle
Both Argo Events and Kyverno resources belong in your GitOps repo. That gives you:
- Diffs on policy changes.
- Rollback by
git revert. - A canonical place to discover what auto-remediations exist (
kubectl get sensor -Aplus the repo).
The Argo CD docs on ApplicationSets are the right reference for templating these across clusters. A pattern I use: each remediation lives in its own Application, so disabling a single remediation is a sync state change, not a code change.
Observability of the automation
Auto-remediation that runs silently is worse than no automation. You need to know how often it fires, what it did, and whether it helped.
Three metrics per sensor, exported by Argo Events:
argo_events_sensor_action_triggered_total— how often a trigger fired.argo_events_sensor_action_failed_total— how often it failed.- A custom counter from your remediation container counting “remediation succeeded, problem recurred within N minutes.” This is the only metric that tells you whether the remediation is actually working.
The third metric is the one that prevents the worst failure mode: remediations that fire continuously without resolving anything. If recurrence is >50%, the remediation is wrong.
Phasing it in
Don’t go from zero to fully-automated remediation. The phases I use:
- Detect and notify. The pipeline fires, but the action is just a Slack message to the on-call. Two weeks at this stage.
- Detect, dry-run, notify. The pipeline computes what it would do and reports it. The human runs the action if it looks right. Two more weeks.
- Detect and act with manual confirm. The pipeline files a PR (yes, a PR) that, when merged, applies the remediation. Useful for non-urgent changes like memory bumps.
- Detect and act autonomously. Only for narrow, well-tested cases with proven rollback.
Stage 1 is where you discover that 30% of your signals are noisy. Stage 2 is where you discover that 20% of your remediations are wrong. Both are cheap to find without the automation actually firing.
By the time you reach stage 4, the team trusts the pipeline because they’ve seen it not do dumb things for a month.
When humans beat automation
Auto-remediation works for failure modes that are characterized and bounded. It loses to humans on novel failure modes. Two corollaries:
- If your incident postmortem keeps surfacing “the auto-remediation made it worse,” you’ve over-automated. Pull back.
- If your incident postmortem keeps surfacing “humans took too long to do the obvious thing,” you’ve under-automated. Push forward.
Both signals matter. Reviewing the ratio quarterly tells you whether to add automation or remove it.
Gotchas
- No rate limit. Always set one. The default is unlimited. The default is wrong.
- No opt-in. Without it, every workload owner is surprised.
- No budget gate. Auto-remediation during an incident makes the incident worse.
- Stale metrics. A remediation triggered by a 5-minute-old metric fires after the situation has changed. Pin signals to live data.
- No dry-run mode. Every remediation should have
--dry-runand you should run it that way for a week before flipping it to enforce. - Remediation loops. The remediation triggers the same signal it was triggered by. Always include a “do not retrigger within N minutes” deduplication.
Wrapping Up
Auto-remediation is one of the highest-leverage practices in the Digital Immune System pillars, and one of the easiest to ship as a foot-gun. Argo Events plus Kyverno gives you the declarative scaffolding to make it boring. Use opt-in, rate limits, budget gates, and dry-run. Treat every remediation policy as a code review, not a script.
Next I’ll move into the observability layer that all of this depends on, specifically the eBPF and OpenTelemetry pairing. Without good signals, auto-remediation is automation in the dark.