Service Catalog Design That Scales, Lessons From Production

Service Catalog Design That Scales, Lessons From Production

October 27, 2025 · 11 min read · by Muhammad Amal programming

TL;DR — A service catalog past about a hundred services starts to crack along three seams: ownership lies, kind explosion, and refresh cost. Get the entity model right early. Use the seven default kinds, define ownership by Group only, and make every entity owned by exactly one YAML file in exactly one repo. Everything else flows from those rules.

The Backstage catalog is deceptively simple at small scale. A Component, a Group, an owner field, and you’ve got a working ownership model. The trouble starts when the catalog grows past the point where one platform engineer can keep all entities in their head. That’s somewhere between fifty and a hundred services. After that, you’re maintaining a catalog the same way you maintain a database, with schemas, conventions, and migration plans.

This post is a tour of the design decisions I’d defend at the next company I join, based on running catalogs of three thousand entities across two production deployments. The Backstage 1.34 catalog API and the entity model are the same as they’ve been for years; the bugs in how teams use them are also the same.

The earlier posts in this series cover the portal setup and golden-path templates that feed entities into the catalog. Here the question is how to model the catalog itself so it survives growth.

1. The Seven Kinds and What They Mean

Backstage ships with seven first-class kinds: Component, API, Resource, System, Domain, Group, and User. These are the only kinds I use. Custom kinds are technically supported but I’ve never seen them pay back the divergence cost. The right move is to use the standard kinds with strict conventions on spec.type.

Domain (e.g., "commerce")
  +- System (e.g., "checkout")
       +- Component (service, library, website)
       |     +- providesApis: [API]
       |     +- dependsOn: [Resource, Component]
       +- API (openapi, grpc, graphql)
       +- Resource (database, queue, bucket)

Domain is the top of the hierarchy. Domains map to lines of business or product areas. A single domain typically owns three to ten systems. Don’t make domains too fine-grained or you’ll need a domain-of-domains, which Backstage doesn’t model.

System groups related components. The rule is “a system is the unit a single team would conceivably take on call for”. If two engineers from different teams disagree about whether something is one system or two, it’s two.

Component is what most people think of when they hear “service catalog entry”. A component has a spec.type. The conventional values are service, library, website, documentation. Pick a list and stick with it. Don’t let teams invent new types in their catalog-info.yaml. The catalog will gladly accept spec.type: hummus and never tell you.

API is the explicit interface contract. Components providesApis: [...] and consume them via consumesApis: [...]. The API entity has a definition field with the actual OpenAPI or proto spec inline or by reference.

Resource covers everything that isn’t a Component or API. Databases, queues, S3 buckets, CDN distributions. The resource is owned by the team that operates the underlying infrastructure, which may not be the team consuming it.

Group and User are the people. Group is a team or a department. User is an individual. Both come from your identity provider via the catalog ingestion.

2. The Ownership Model

The single most important rule in catalog design: every entity is owned by exactly one Group. Not a User. Not “team-payments” as a string. A real Group entity reference. Like this:

spec:
  owner: group:default/team-payments

The full ref (group:default/...) is verbose but unambiguous. Half-strings like team-payments work in some places and break in others. Adopt the full ref everywhere via lint.

Why Group and not User? Three reasons. Individuals leave. The catalog grows orphaned entries. Permissions get harder to reason about when they’re per-user.

Some teams want “I, personally, own this prototype service”. Fine. Make a Group called lab-firstname and own the prototype with that. Now the prototype is part of the catalog’s normal ownership graph, and when the engineer leaves, the platform team’s offboarding script handles lab-* groups uniformly.

Validate ownership refs in CI. The catalog will accept dangling refs silently:

# .github/workflows/catalog-lint.yml
name: Catalog Lint
on:
  pull_request:
    paths: ['**/catalog-info.yaml']
jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 22 }
      - run: npm install -g @backstage/cli@1.34.0
      - run: backstage-cli repo lint catalog-info.yaml
      - name: Verify owner exists in catalog
        run: |
          OWNER=$(yq '.spec.owner' catalog-info.yaml)
          curl -sf -H "Authorization: Bearer $BACKSTAGE_TOKEN" \
            "https://backstage.acme.internal/api/catalog/entities/by-name/${OWNER//:///}" \
            > /dev/null
        env:
          BACKSTAGE_TOKEN: ${{ secrets.BACKSTAGE_CATALOG_TOKEN }}

If the owner Group doesn’t exist in the live catalog, the PR fails. Catches every typo, every renamed team, every stale reference.

3. Files in Repos, Not Buttons in UI

Disable the “Register Existing Component” button in the UI for everyone except the platform team. The reason: every entity should be owned by a catalog-info.yaml file in the repo of the thing it describes. Entities created via the UI have no source of truth; they live in the catalog database with no audit trail.

# app-config.yaml
catalog:
  rules:
    - allow: [Component, System, API, Resource, Location, Domain, Group, User]
  locations: []   # Empty. Use providers, not static locations.
  providers:
    github:
      providerId:
        organization: acme-engineering
        catalogPath: /catalog-info.yaml
        filters:
          branch: main
          repository: '.*'
        schedule:
          frequency: { minutes: 30 }
          timeout: { minutes: 10 }

The empty locations list combined with the GitHub provider means every entity in the catalog came from a repo. No exceptions. The discovery provider scans every repo for catalog-info.yaml and ingests them on a thirty-minute cycle.

If you have a strong reason to register a one-off entity via the UI (say, a third-party SaaS that doesn’t have a repo), put the YAML for it in a dedicated meta-repo:

acme-engineering/catalog-meta/
+- third-party/
|  +- datadog.yaml          # Resource: external monitoring
|  +- snowflake.yaml        # Resource: warehouse
+- groups/
|  +- team-payments.yaml    # Group: managed manually
+- systems/
   +- legacy-billing.yaml   # System: pre-migration legacy

Now even third-party entries have file-based source of truth, version history, and PR-based review.

4. Relations and the Graph

The catalog’s killer feature isn’t the list view. It’s the graph. Relations let you ask “what’s downstream of this database?” or “which APIs does my team own?”. Get the relation declarations right and queries become trivial.

The relations that pay back are:

providesApis / consumesApis on Component
dependsOn / dependencyOf between any two entities
partOf from Component to System, and System to Domain
subComponentOf for cases where a component is part of a larger component

Example of a component with dense relations:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: checkout-frontend
  namespace: default
spec:
  type: website
  lifecycle: production
  owner: group:default/team-checkout
  system: system:default/checkout
  consumesApis:
    - api:default/payments-api-v2
    - api:default/cart-api-v1
  dependsOn:
    - resource:default/cdn-cloudfront

The graph then lets the entity page render an interactive dependency view. More importantly, it powers programmatic queries via the catalog API. Want to find every component owned by a team that consumes a deprecated API? One catalog query.

Be honest with dependsOn. The temptation is to declare every dependency, including transitive ones, which produces a graph so dense it’s useless. Declare only direct dependencies. If the Postgres database is reached through a connection pool service, depend on the pool service, not the database. The database is reachable through the pool’s own dependsOn.

5. The Refresh Story

At a thousand entities, refresh cost matters. Each refresh re-validates every entity, recomputes relations, and re-evaluates filters. The default schedule of “every 100 ms” for the processing engine is fine for small catalogs and a disaster for large ones.

Tune the engine in app-config.yaml:

catalog:
  processingInterval: { minutes: 30 }
  orphanStrategy: delete
  stitchingStrategy:
    mode: deferred

The stitchingStrategy: deferred mode batches relation re-stitching rather than doing it per-entity. For three thousand entities this drops the processor’s Postgres load by roughly 60 percent.

For the provider side, stagger schedules. Don’t have all providers run at minute zero:

catalog:
  providers:
    githubOrg:
      production:
        schedule:
          frequency: { hours: 1 }
          timeout: { minutes: 15 }
          initialDelay: { minutes: 0 }
    github:
      production:
        schedule:
          frequency: { minutes: 30 }
          timeout: { minutes: 10 }
          initialDelay: { minutes: 5 }
    aws:
      production:
        schedule:
          frequency: { hours: 6 }
          timeout: { minutes: 30 }
          initialDelay: { minutes: 10 }

The initialDelay spreads the cold-start load across the first ten minutes after the pod boots. Without it, every provider hammers the network and the DB simultaneously and the catalog UI is unresponsive for the first minute.

+-------------+ scan @ :00 +----+
| Org provider+------------>+ DB |
+-------------+              |    |
+-------------+ scan @ :05  |    |
| Repo provid +------------>+    |
+-------------+              |    |
+-------------+ scan @ :10  |    |
| AWS provider+------------>+    |
+-------------+              +----+

6. Lifecycle and Deprecation

Every entity has a spec.lifecycle field. The conventional values are experimental, production, deprecated. Use them honestly. A lifecycle: production service that hasn’t been deployed in a year is a lie, and the lie compounds.

Build a recurring report that flags entities where the declared lifecycle disagrees with reality. The signal sources:

production Components with no deploys in 90 days -> probably deprecated
production Components with deploys but no Kubernetes pods -> deleted, but entity remains
experimental Components older than 180 days -> either promote or remove

A custom backend module runs this nightly and posts to a Slack channel for the platform team. It’s the cheapest catalog hygiene investment that exists.

For actual deprecation, set spec.lifecycle: deprecated and add a backstage.io/deprecated-since annotation. The entity page should render a banner. A simple theme override does that:

// packages/app/src/components/catalog/EntityPage.tsx
const DeprecationBanner = () => {
  const { entity } = useEntity();
  if (entity.spec?.lifecycle !== 'deprecated') return null;
  return (
    <Alert severity="warning">
      This entity is deprecated as of{' '}
      {entity.metadata.annotations?.['backstage.io/deprecated-since']}.
    </Alert>
  );
};

Drop it into the entity layout above the tabs. Now every user sees the warning before they invest time.

7. The Anti-Patterns

A short list of catalog design choices I’ve seen repeatedly and would not repeat.

Custom kinds for every domain. Some teams invent kind: MicroService, kind: Database, kind: Frontend. This produces a catalog where queries that worked for one kind don’t generalize. Use Component with spec.type.

The catalog as a CMDB. The catalog is for things engineering owns and operates. Don’t ingest your laptops, your conference rooms, or your AWS accounts as entities. The model breaks down. Run a real CMDB for those.

Entities that span repos. A repo with three services in it might want to declare three catalog-info.yaml files. Backstage supports this via locations:

metadata:
  annotations:
    backstage.io/source-location: url:https://github.com/acme/monorepo/blob/main/services/payments/

But the operational story is worse. Refresh fanout multiplies. Ownership gets muddled. Just split the repo, or live with the monorepo and accept that the catalog will model the monorepo as one entity. Don’t try to have it both ways.

Implicit ownership via path. “Everything under services/payments/* is owned by team-payments.” Tempting, painful. Make the ownership explicit in YAML. Code searches and grep go from “look in this magic config” to “search the catalog file directly”.

Common Pitfalls

Mixing namespaces randomly. The default namespace: default is fine. If you start using namespaces for environments (namespace: prod and namespace: staging), you’ve doubled the entity count and broken everyone’s mental model. Use namespaces for true tenancy (separate orgs sharing the portal), not environments.
Letting system be free-form text. A typo like system: chekout creates a dangling reference and the entity floats orphan in the graph. Either lint that system resolves to an existing System entity, or fail PR builds.
No deletion strategy. Set catalog.orphanStrategy: delete so entities whose source disappears get removed. Otherwise deleting a repo leaves the entity in the catalog forever, with a broken link to the source file.
Refresh cycles too aggressive. Past a few hundred entities, scanning every five minutes is wasteful. Half-hour or hour cycles are plenty for code-located YAML that only changes via PR.

Troubleshooting

Entity has invalid spec on PR lint but works in the UI. Local lint uses the schema bundled with the CLI version you installed. The running portal may have schema modifications loaded via plugins. Pin the CLI version to match the deployed portal version.
Graph view shows orphan entities. Usually means a dependsOn or partOf ref doesn’t resolve. Hit the catalog API at /api/catalog/entities/by-name/<ref> for each declared relation. The 404 will point at the missing entity.
Catalog UI slow above a thousand entities. The default UI does client-side filtering. Switch to server-side filtering and pagination via the catalog table’s presetFilters and paginationMode: 'backend' props. Cuts the initial payload by 10x.

Wrapping Up

A catalog is a long-lived dataset. The decisions you make in the first month outlive any one platform engineer’s tenure. Standardize on the seven default kinds, force Group ownership, ingest only from files in repos, and tune the refresh cycles for your scale. Everything else is variation on these themes.

The catalog reference docs cover the entity schema and every annotation in detail. The final post in this series gets into measuring developer experience with DORA and SPACE metrics.

1. The Seven Kinds and What They Mean

2. The Ownership Model

3. Files in Repos, Not Buttons in UI

4. Relations and the Graph

5. The Refresh Story

6. Lifecycle and Deprecation

7. The Anti-Patterns

Common Pitfalls

Troubleshooting

Wrapping Up

Related posts

Designing a Service Catalog Developers Actually Use

Building an Internal Developer Portal with Backstage 1.34

Golden Paths, How Self-Service Actually Sticks

Building an Internal Developer Platform with Backstage and Kubernetes

Backstage 1.14 as the Backbone of an Internal Developer Platform

Lessons From Running Platform Engineering Teams in 2025

Measuring Developer Experience with DORA and SPACE in Backstage

Developer Onboarding with Backstage and ArgoCD, An End to End Tutorial

Let’s Start a Project