ArgoCD ApplicationSets at Scale, A Multi-Tenant Pattern That Survives 200 Services

ArgoCD ApplicationSets at Scale, A Multi-Tenant Pattern That Survives 200 Services

June 6, 2023 · 7 min read · by Muhammad Amal programming

TL;DR — Hand-authoring one ArgoCD Application per service stops working around 30 services. / ApplicationSet with Git and matrix generators turns the deployment manifest into convention-over-configuration. / Pair it with project-scoped RBAC and sync windows, or you’ll wake up to one team breaking everyone’s deploys.

The first time I rolled out ArgoCD, we had 12 services and a single cluster. The Application manifests lived in a folder, were copy-pasted from each other, and life was good. Two years later, that same shape becomes a 400-file directory where nobody can answer “why is the staging URL still pointing at the old chart version?” without grepping.

ArgoCD 2.7’s ApplicationSet controller is the answer, but only if you commit to convention. This post is the structure I now use by default for any org that has more than ~20 services and at least two clusters. It works, it’s been through audits, and it doesn’t require a dedicated GitOps engineer to keep the lights on.

If you missed the previous post, the org shape behind this — platform team, stream teams, golden paths — is covered in why platform engineering is not DevOps rebranded.

The repo layout

There’s a perennial argument about monorepo versus polyrepo for GitOps. For deployments specifically, the monorepo wins almost every time. One repo, one history, one set of CODEOWNERS rules. Here is the layout I use:

deploy/
  apps/
    checkout/
      base/
        kustomization.yaml
        deployment.yaml
      overlays/
        staging/
          kustomization.yaml
          values.yaml
        prod-us/
          kustomization.yaml
          values.yaml
        prod-eu/
          kustomization.yaml
          values.yaml
    inventory/
      base/...
      overlays/...
  platform/
    appprojects/
      checkout.yaml
      inventory.yaml
    applicationsets/
      stream-services.yaml
    clusters/
      staging.yaml
      prod-us.yaml
      prod-eu.yaml

The contract for stream teams is: own the apps/<service>/ directory. Don’t touch platform/. CODEOWNERS enforces it.

One ApplicationSet to rule them

The trick is to stop writing Application manifests entirely. The platform team owns one ApplicationSet (per service shape, more on that below) that generates them.

# platform/applicationsets/stream-services.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: stream-services
  namespace: argocd
spec:
  goTemplate: true
  generators:
    - matrix:
        generators:
          - git:
              repoURL: https://github.com/acme/deploy.git
              revision: main
              directories:
                - path: apps/*/overlays/*
          - clusters:
              selector:
                matchLabels:
                  argocd.argoproj.io/secret-type: cluster
  template:
    metadata:
      name: '{{ index .path.segments 1 }}-{{ index .path.segments 3 }}'
      labels:
        service: '{{ index .path.segments 1 }}'
        env: '{{ index .path.segments 3 }}'
    spec:
      project: '{{ index .path.segments 1 }}'
      source:
        repoURL: https://github.com/acme/deploy.git
        targetRevision: main
        path: '{{ .path.path }}'
      destination:
        server: '{{ .server }}'
        namespace: '{{ index .path.segments 1 }}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true
          - ServerSideApply=true
  templatePatch: |
    spec:
      ignoreDifferences:
        - group: apps
          kind: Deployment
          jsonPointers:
            - /spec/replicas

Three things are doing work here:

The Git generator scans apps/*/overlays/* so the moment a stream team adds apps/payments/overlays/staging/, an Application appears. No platform-team PR required.
The matrix generator crosses overlays with registered clusters. You don’t manually map “this service goes to prod-eu” — the overlay path encodes it via the clusters/ filter (see below).
ServerSideApply=true is the only sane choice in 2023. It makes field ownership explicit and stops fights with HPA over spec.replicas.

If you have multiple service shapes — say, batch jobs that should not auto-sync — write a second ApplicationSet with a different selector and template. Don’t try to make one set serve every case.

Filtering overlays to clusters

The naive matrix generator above will try to deploy every overlay to every cluster. You don’t want that. The fix is to gate destinations inside the overlay itself.

I encode the target cluster as a label on the overlay’s kustomization.yaml and use a selector in the generator. Even simpler: name the overlay after the cluster (staging, prod-us, prod-eu) and add a clusters generator filter:

- clusters:
    selector:
      matchExpressions:
        - key: env
          operator: In
          values: ['staging', 'prod-us', 'prod-eu']

Then in the template, use {{- if eq .name (index .path.segments 3) }} (with goTemplate: true) to only emit the Application when the cluster’s name matches the overlay folder name. The ArgoCD docs cover the generator combinators in detail at argo-cd.readthedocs.io.

Project-scoped RBAC: the part most people skip

AppProject is what turns ArgoCD from a single-tenant tool into something you can hand to multiple teams. One project per service (or per team) is the right granularity. Anything coarser and you’ll end up with the payments team able to sync the inventory service’s manifests.

# platform/appprojects/checkout.yaml
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: checkout
  namespace: argocd
spec:
  description: Checkout service
  sourceRepos:
    - https://github.com/acme/deploy.git
  destinations:
    - namespace: checkout
      server: '*'
  clusterResourceWhitelist:
    - group: ''
      kind: Namespace
  namespaceResourceBlacklist:
    - group: ''
      kind: ResourceQuota
    - group: ''
      kind: LimitRange
  roles:
    - name: developer
      policies:
        - p, proj:checkout:developer, applications, sync, checkout/*, allow
        - p, proj:checkout:developer, applications, get, checkout/*, allow
      groups:
        - acme:checkout-team
  syncWindows:
    - kind: deny
      schedule: '0 17 * * 5'
      duration: 60h
      applications:
        - '*'
      manualSync: true

The sync window is non-negotiable in any org that has ever had a Friday-evening incident. Stream teams can still manually sync if they really need to ship — and that conscious override creates a paper trail.

Notice that ResourceQuota and LimitRange are blacklisted from stream-team management. The platform team owns those. If a team needs more memory budget, they raise it; they don’t quietly edit it in a PR.

Helm vs Kustomize: pick one per service, not per org

I’ve stopped fighting this religious war. Some services have complex templating needs and Helm is correct. Some have minimal differences across envs and Kustomize is simpler. The ApplicationSet doesn’t care — source.path works for both. What matters is one rendering tool per service. Mixing Helm and Kustomize on the same service via the chart-rendered-then-kustomized pattern is a debugging nightmare. The diff in the ArgoCD UI becomes useless.

For Helm in 2023, the helm.valueFiles approach with one values file per env is fine. For Kustomize, the overlays-with-strategic-merge approach above is fine. Don’t combine them.

Drift, self-heal, and the audit trail

selfHeal: true is correct for stream services. If someone kubectl edits a deployment in prod, ArgoCD should revert it within seconds. The exception is anything controlled by an in-cluster controller — HPA replicas, VPA recommendations, Karpenter labels. These go in ignoreDifferences either at the ApplicationSet template level (as above) or per-service via argocd.argoproj.io/sync-options annotations.

For the audit trail, ArgoCD’s built-in events are not enough. Ship them to your SIEM. The events you care about: OperationCompleted, ResourceUpdated, Sync, and anything with reason: SyncFailed. Many teams pipe these through argocd-notifications to Slack and to a longer-term store.

Common Pitfalls

Letting stream teams write their own Applications. As soon as one team does it, every team does. The ApplicationSet becomes a polite fiction. Use admission policies (Kyverno or OPA Gatekeeper) to deny manually created Application resources outside argocd namespace, or set the cluster-scoped controller permissions accordingly.
One AppProject for everyone. Sounds simpler. Isn’t. The first incident where Team A’s sync brings down Team B’s namespace because they shared an over-broad destinations block is enough to convince you.
Forgetting the chart version pin. targetRevision: HEAD on a Helm chart from a public repo is a supply-chain incident waiting to happen. Pin every version, and use Renovate or Dependabot to bump them with PRs.
No drift dashboards. ArgoCD will eventually sync everything, but you want to know when self-heal is firing constantly. That signals a controller fight or a human messing with prod. Export argocd_app_info to Prometheus and alert on sync_status_code != "Synced" for longer than 10 minutes.
Trying to use ApplicationSet generators for cluster bootstrap. Bootstrap (cert-manager, ingress, the platform itself) belongs in a separate ApplicationSet with a different generator — usually the list generator with explicit clusters. Mixing it with the stream-service set causes circular dependencies during DR.
No sync windows. I’ll say it twice because every team learns this the hard way.

Wrapping Up

The shape above scales to a few hundred services per ArgoCD instance, which is roughly the ceiling before you want to shard ArgoCD itself (multiple instances, one per business unit, with a shared catalog in Backstage). The principles transfer cleanly.

Next up: the multi-tenancy story below ArgoCD — namespaces, quotas, and what Kubernetes 1.27 actually gives you for hard isolation. Spoiler: it’s better than 1.24 was, but it’s still not magic.

The repo layout

One ApplicationSet to rule them

Filtering overlays to clusters

Project-scoped RBAC: the part most people skip

Helm vs Kustomize: pick one per service, not per org

Drift, self-heal, and the audit trail

Common Pitfalls

Wrapping Up

Related posts

FluxCD 2.0 vs ArgoCD 2.7, A Real Comparison After Running Both in Production

Pod Security Standards in 2023, Migrating Off PSPs Without Breaking Everything

OPA 0.55 and Gatekeeper 3.13, Writing Admission Policies People Will Actually Maintain

Falco 0.35 in Production, Runtime Detection Without the Alert Fatigue

Vault 1.14 Dynamic Secrets in Kubernetes, Past the Sidecar Demo

Cluster Cost Engineering, Karpenter, KEDA, and the End of Static Node Groups

Backstage 1.14 as the Backbone of an Internal Developer Platform

Progressive Delivery in 2023, Argo Rollouts and Flagger Side by Side

Let’s Start a Project