OPA 0.55 and Gatekeeper 3.13, Writing Admission Policies People Will Actually Maintain

OPA 0.55 and Gatekeeper 3.13, Writing Admission Policies People Will Actually Maintain

September 21, 2023 · 7 min read · by Muhammad Amal programming

TL;DR — Start every new policy in audit mode for two weeks; only flip to enforce once the violation count is zero / Use ConstraintTemplate + Constraint separation so policy authors and policy consumers are different people / Rego is a query language pretending to be a programming language; structure your rules around violation[{...}] and stop trying to write imperative code.

Most clusters have a pile of admission policies that started as a Friday afternoon experiment and ended up blocking deploys at 2am six months later. The Gatekeeper part is the easy half. The hard half is writing policies that survive ownership changes, are debuggable by people who did not write them, and stop being a deployment hazard.

OPA 0.55 and Gatekeeper 3.13 are the current lines as of September 2023. The Gatekeeper team has done good work on mutation, external data, and audit performance; the docs have not entirely caught up with how to use these well. Here is the working set of patterns I keep coming back to.

The Two-Resource Model

Gatekeeper splits policy into ConstraintTemplate (the parametric rule, written in Rego) and Constraint (an instance of the template, with parameters and scope). This separation matters because the authors are usually different people. A platform engineer writes the template once. Application teams or namespace owners create constraints from it.

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              items:
                type: object
                properties:
                  key: {type: string}
                  allowedRegex: {type: string}
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels

        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_].key}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("missing required labels: %v", [missing])
        }

        violation[{"msg": msg}] {
          some i
          label := input.parameters.labels[i]
          value := input.review.object.metadata.labels[label.key]
          not regex.match(label.allowedRegex, value)
          msg := sprintf("label %v=%v does not match %v",
                        [label.key, value, label.allowedRegex])
        }

And then a constraint:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: namespaces-must-have-owner
spec:
  enforcementAction: dryrun
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Namespace"]
  parameters:
    labels:
      - key: owner
        allowedRegex: "^team-[a-z]+$"
      - key: cost-center
        allowedRegex: "^[0-9]{4}$"

The constraint scopes the template to namespaces, sets the regex parameters, and crucially starts in dryrun mode.

The Three Enforcement Modes

enforcementAction has three values that matter operationally.

dryrun. The policy runs in audit mode. Violations show up on the constraint’s status.violations field and in metrics, but admission requests succeed. This is where every new policy starts.

warn. New in Gatekeeper 3.11+. Admission requests succeed but kubectl prints a warning. Useful as a soft rollout — developers see the message and start fixing things before the hammer drops.

deny. Admission is rejected. This is where you end up, but only after the violation count in dryrun has been zero for at least a week.

The mistake I see most often is flipping straight to deny and then frantically allowlisting the production namespaces that were already violating the policy. Run audit first.

Reading the Audit

The audit run is a periodic job — by default every minute, configurable — that evaluates all existing cluster resources against active constraints. Violations show up on the constraint:

kubectl get k8srequiredlabels namespaces-must-have-owner -o yaml | yq '.status.violations'

For a cluster of any size, the right answer is to push these into a dashboard. Gatekeeper exposes gatekeeper_violations as a Prometheus metric. A panel that shows the violation count per constraint over time tells you whether the cluster is converging toward zero violations or diverging.

The audit also surfaces existing problems that admission would never catch. Admission only sees new and updated objects. A Deployment that was created before the policy existed will not be caught at admission time, but the audit run will flag it on the next pass. Audit metrics are how you find these.

Mutation, Carefully

Gatekeeper 3.13 has stable mutation support. You can use it to add labels, set security contexts, inject sidecars. I have mixed feelings about this.

Mutation makes policies less visible. A developer applies a manifest, it gets mutated on the way in, the running object is different from the file on their disk. The drift between “what I wrote” and “what is running” is a debugging hazard.

The cases where I do use mutation:

Adding labels for cost allocation that are required by other policies. Better than failing admission for missing labels.
Setting default securityContext values on pods that did not specify them.
Pinning image digests by resolving tags at admission time. Powerful, controversial, easy to get wrong.

An example, setting runAsNonRoot: true as a default:

apiVersion: mutations.gatekeeper.sh/v1
kind: Assign
metadata:
  name: pod-default-non-root
spec:
  applyTo:
    - groups: [""]
      kinds: ["Pod"]
      versions: ["v1"]
  match:
    scope: Namespaced
    excludedNamespaces: ["kube-system", "gatekeeper-system"]
  location: "spec.securityContext.runAsNonRoot"
  parameters:
    assign:
      value: true
    pathTests:
      - subPath: "spec.securityContext.runAsNonRoot"
        condition: MustNotExist

The MustNotExist condition means we only set the default; if the pod author specified runAsNonRoot: false explicitly, we leave it. A second validation policy can then deny anything that ends up with runAsNonRoot: false from namespaces that should not have it.

Rego That Reads Well

Rego is a declarative query language. The maintainability issue is that every engineer who writes Rego for the first time tries to write JavaScript in it. The result is unreadable.

Three rules of thumb.

One violation rule per failure type. Multiple violation rules combine via OR — any one matching produces a violation. This is the natural way to express “the resource must satisfy all of these conditions”.

Named helpers for set operations. required - provided reads naturally. Iterating with some i; ...; not provided[required[i]] does not.

Test with opa eval. Every constraint template should have a tests directory with input fixtures and expected output. The OPA CLI runs these in milliseconds. If your Rego is not testable, it is not maintainable.

opa test -v ./policies/

The OPA policy testing docs cover the syntax. The Gatekeeper policy library at github.com/open-policy-agent/gatekeeper-library is a good reference for how to structure tests.

Constraint Library or Roll Your Own

The Gatekeeper policy library covers most of what you would want — pod security, required labels, image registry restrictions, host network bans. For 80% of teams, it is the right starting point. Fork it, version it, prune what you do not use.

Where I have ended up writing custom policies is around organisational rules that no library will know about: “deployments in the payments namespace must reference an image from ghcr.io/myorg/payments-*”, or “ingresses with TLS must use a cert from our internal issuer”, or “the owner label must match a real team in our team registry” (this one uses Gatekeeper’s external data feature to call the registry API).

External data is powerful and slow. A policy that calls an external service at admission time adds latency to every admission request. Cache aggressively, have a fallback for when the service is down, and never put a critical dependency in the admission path that you would not put in the dataplane.

Common Pitfalls

Cluster-scope constraints in dev clusters. A constraint with no match.namespaces field applies cluster-wide, including kube-system. The CoreDNS deployment will fail your label policy and the cluster will degrade. Always start with explicit excludedNamespaces for the system namespaces.

Webhook timeout. Gatekeeper’s admission webhook has a default timeout of 3 seconds. A slow Rego policy or a hanging external-data call will time out and, depending on the webhook’s failurePolicy, either fail open (security risk) or fail closed (cluster outage). Set failurePolicy: Fail for production, but only after audit shows the policies are fast enough.

Rego library imports. Gatekeeper bundles a specific OPA version. Rego language features added upstream do not all reach Gatekeeper immediately. If you copy Rego from a recent OPA tutorial and it breaks, check the Gatekeeper version’s OPA dependency.

Mutation order with multiple webhooks. If you run another mutating webhook (Istio sidecar injection, the Vault Agent Injector), the order matters. Gatekeeper mutations run in a specific phase; sidecar injectors run in another. Test the interaction in a cluster, not in your head.

Forgetting to scope by namespace. Constraints can match by namespace label, namespace name, or label selectors on the object. Use the most specific match you can. A constraint that matches “all pods” and then has an internal allowlist is harder to reason about than a constraint scoped to specific namespaces.

Wrapping Up

Admission policy is the cheapest place to enforce security invariants because the resource does not exist yet — no rollback, no incident. The cost is operational: someone has to own the policy, audit its impact, and update it as the cluster evolves. Two-week dryrun, prometheus on violation counts, tests in CI. Once admission is solid, the next layer is making sure the supply chain feeding admission is itself trustworthy, which is what SLSA provenance addresses.

The Two-Resource Model

The Three Enforcement Modes

Reading the Audit

Mutation, Carefully

Rego That Reads Well

Constraint Library or Roll Your Own

Common Pitfalls

Wrapping Up

Related posts

Pod Security Standards in 2023, Migrating Off PSPs Without Breaking Everything

Falco 0.35 in Production, Runtime Detection Without the Alert Fatigue

Vault 1.14 Dynamic Secrets in Kubernetes, Past the Sidecar Demo

Kubernetes 1.27 Multi-Tenancy, What's Actually Safe and What Still Isn't

Policy as Code with OPA 1.0, A Production Walkthrough

SLSA v1.0 in Practice, Build Provenance Without Boiling the Ocean

SBOMs That Are Actually Useful, Syft, CycloneDX 1.5, and the Limits of Static Analysis

Keyless Container Signing With Cosign 2.2, A Setup That Survives an Audit

Let’s Start a Project