ArgoCD ApplicationSets at Scale, A Multi-Tenant Pattern That Survives 200 Services
TL;DR — Hand-authoring one ArgoCD
Applicationper service stops working around 30 services. /ApplicationSetwith Git and matrix generators turns the deployment manifest into convention-over-configuration. / Pair it with project-scoped RBAC and sync windows, or you’ll wake up to one team breaking everyone’s deploys.
The first time I rolled out ArgoCD, we had 12 services and a single cluster. The Application manifests lived in a folder, were copy-pasted from each other, and life was good. Two years later, that same shape becomes a 400-file directory where nobody can answer “why is the staging URL still pointing at the old chart version?” without grepping.
ArgoCD 2.7’s ApplicationSet controller is the answer, but only if you commit to convention. This post is the structure I now use by default for any org that has more than ~20 services and at least two clusters. It works, it’s been through audits, and it doesn’t require a dedicated GitOps engineer to keep the lights on.
If you missed the previous post, the org shape behind this — platform team, stream teams, golden paths — is covered in why platform engineering is not DevOps rebranded.
The repo layout
There’s a perennial argument about monorepo versus polyrepo for GitOps. For deployments specifically, the monorepo wins almost every time. One repo, one history, one set of CODEOWNERS rules. Here is the layout I use:
deploy/
apps/
checkout/
base/
kustomization.yaml
deployment.yaml
overlays/
staging/
kustomization.yaml
values.yaml
prod-us/
kustomization.yaml
values.yaml
prod-eu/
kustomization.yaml
values.yaml
inventory/
base/...
overlays/...
platform/
appprojects/
checkout.yaml
inventory.yaml
applicationsets/
stream-services.yaml
clusters/
staging.yaml
prod-us.yaml
prod-eu.yaml
The contract for stream teams is: own the apps/<service>/ directory. Don’t touch platform/. CODEOWNERS enforces it.
One ApplicationSet to rule them
The trick is to stop writing Application manifests entirely. The platform team owns one ApplicationSet (per service shape, more on that below) that generates them.
# platform/applicationsets/stream-services.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: stream-services
namespace: argocd
spec:
goTemplate: true
generators:
- matrix:
generators:
- git:
repoURL: https://github.com/acme/deploy.git
revision: main
directories:
- path: apps/*/overlays/*
- clusters:
selector:
matchLabels:
argocd.argoproj.io/secret-type: cluster
template:
metadata:
name: '{{ index .path.segments 1 }}-{{ index .path.segments 3 }}'
labels:
service: '{{ index .path.segments 1 }}'
env: '{{ index .path.segments 3 }}'
spec:
project: '{{ index .path.segments 1 }}'
source:
repoURL: https://github.com/acme/deploy.git
targetRevision: main
path: '{{ .path.path }}'
destination:
server: '{{ .server }}'
namespace: '{{ index .path.segments 1 }}'
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
templatePatch: |
spec:
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas
Three things are doing work here:
- The Git generator scans
apps/*/overlays/*so the moment a stream team addsapps/payments/overlays/staging/, anApplicationappears. No platform-team PR required. - The matrix generator crosses overlays with registered clusters. You don’t manually map “this service goes to prod-eu” — the overlay path encodes it via the
clusters/filter (see below). ServerSideApply=trueis the only sane choice in 2023. It makes field ownership explicit and stops fights with HPA overspec.replicas.
If you have multiple service shapes — say, batch jobs that should not auto-sync — write a second ApplicationSet with a different selector and template. Don’t try to make one set serve every case.
Filtering overlays to clusters
The naive matrix generator above will try to deploy every overlay to every cluster. You don’t want that. The fix is to gate destinations inside the overlay itself.
I encode the target cluster as a label on the overlay’s kustomization.yaml and use a selector in the generator. Even simpler: name the overlay after the cluster (staging, prod-us, prod-eu) and add a clusters generator filter:
- clusters:
selector:
matchExpressions:
- key: env
operator: In
values: ['staging', 'prod-us', 'prod-eu']
Then in the template, use {{- if eq .name (index .path.segments 3) }} (with goTemplate: true) to only emit the Application when the cluster’s name matches the overlay folder name. The ArgoCD docs cover the generator combinators in detail at argo-cd.readthedocs.io.
Project-scoped RBAC: the part most people skip
AppProject is what turns ArgoCD from a single-tenant tool into something you can hand to multiple teams. One project per service (or per team) is the right granularity. Anything coarser and you’ll end up with the payments team able to sync the inventory service’s manifests.
# platform/appprojects/checkout.yaml
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: checkout
namespace: argocd
spec:
description: Checkout service
sourceRepos:
- https://github.com/acme/deploy.git
destinations:
- namespace: checkout
server: '*'
clusterResourceWhitelist:
- group: ''
kind: Namespace
namespaceResourceBlacklist:
- group: ''
kind: ResourceQuota
- group: ''
kind: LimitRange
roles:
- name: developer
policies:
- p, proj:checkout:developer, applications, sync, checkout/*, allow
- p, proj:checkout:developer, applications, get, checkout/*, allow
groups:
- acme:checkout-team
syncWindows:
- kind: deny
schedule: '0 17 * * 5'
duration: 60h
applications:
- '*'
manualSync: true
The sync window is non-negotiable in any org that has ever had a Friday-evening incident. Stream teams can still manually sync if they really need to ship — and that conscious override creates a paper trail.
Notice that ResourceQuota and LimitRange are blacklisted from stream-team management. The platform team owns those. If a team needs more memory budget, they raise it; they don’t quietly edit it in a PR.
Helm vs Kustomize: pick one per service, not per org
I’ve stopped fighting this religious war. Some services have complex templating needs and Helm is correct. Some have minimal differences across envs and Kustomize is simpler. The ApplicationSet doesn’t care — source.path works for both. What matters is one rendering tool per service. Mixing Helm and Kustomize on the same service via the chart-rendered-then-kustomized pattern is a debugging nightmare. The diff in the ArgoCD UI becomes useless.
For Helm in 2023, the helm.valueFiles approach with one values file per env is fine. For Kustomize, the overlays-with-strategic-merge approach above is fine. Don’t combine them.
Drift, self-heal, and the audit trail
selfHeal: true is correct for stream services. If someone kubectl edits a deployment in prod, ArgoCD should revert it within seconds. The exception is anything controlled by an in-cluster controller — HPA replicas, VPA recommendations, Karpenter labels. These go in ignoreDifferences either at the ApplicationSet template level (as above) or per-service via argocd.argoproj.io/sync-options annotations.
For the audit trail, ArgoCD’s built-in events are not enough. Ship them to your SIEM. The events you care about: OperationCompleted, ResourceUpdated, Sync, and anything with reason: SyncFailed. Many teams pipe these through argocd-notifications to Slack and to a longer-term store.
Common Pitfalls
- Letting stream teams write their own Applications. As soon as one team does it, every team does. The
ApplicationSetbecomes a polite fiction. Use admission policies (Kyverno or OPA Gatekeeper) to deny manually createdApplicationresources outsideargocdnamespace, or set the cluster-scoped controller permissions accordingly. - One AppProject for everyone. Sounds simpler. Isn’t. The first incident where Team A’s sync brings down Team B’s namespace because they shared an over-broad
destinationsblock is enough to convince you. - Forgetting the chart version pin.
targetRevision: HEADon a Helm chart from a public repo is a supply-chain incident waiting to happen. Pin every version, and use Renovate or Dependabot to bump them with PRs. - No drift dashboards. ArgoCD will eventually sync everything, but you want to know when self-heal is firing constantly. That signals a controller fight or a human messing with prod. Export
argocd_app_infoto Prometheus and alert onsync_status_code != "Synced"for longer than 10 minutes. - Trying to use ApplicationSet generators for cluster bootstrap. Bootstrap (cert-manager, ingress, the platform itself) belongs in a separate
ApplicationSetwith a different generator — usually thelistgenerator with explicit clusters. Mixing it with the stream-service set causes circular dependencies during DR. - No sync windows. I’ll say it twice because every team learns this the hard way.
Wrapping Up
The shape above scales to a few hundred services per ArgoCD instance, which is roughly the ceiling before you want to shard ArgoCD itself (multiple instances, one per business unit, with a shared catalog in Backstage). The principles transfer cleanly.
Next up: the multi-tenancy story below ArgoCD — namespaces, quotas, and what Kubernetes 1.27 actually gives you for hard isolation. Spoiler: it’s better than 1.24 was, but it’s still not magic.