Designing a Service Catalog Developers Actually Use

Designing a Service Catalog Developers Actually Use

January 17, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — A catalog rots when it’s optional. Make every deployed service unfindable in CI if it’s missing from the catalog. Then prune ruthlessly and automate ownership inference. A 200-entry catalog with 20% stale is worse than a 50-entry catalog that’s correct.

Anyone who’s stood up a service catalog has seen the lifecycle. Month 1: shiny new portal, everyone loves it. Month 6: half the services have owner: unknown. Month 12: nobody trusts it; people grep GitHub instead. We covered the Backstage build-out earlier this month. The catalog is the bit most people get wrong inside it.

This isn’t a Backstage-specific problem. Cortex, Port, OpsLevel, and the in-house Confluence catalogs all rot in the same way. The cause is the same in every case: the catalog is treated as a place where humans put data, instead of a place that data flows into automatically.

The two failure modes

I’ve audited maybe a dozen internal catalogs over the years. They fail in two distinct ways.

Failure mode 1: too thin. The catalog has every service as a row, but the only metadata is name, owner, and a description copy-pasted from the README. No dependencies, no API specs, no SLOs, no on-call info. Developers look at it, get nothing useful, stop looking.

Failure mode 2: too thick. The catalog has 27 fields per entity. Half are empty for most services. Engineers asked to fill them in resent it. Compliance bolted on a “data classification” field nobody updates. The catalog drowns in its own metadata model.

The fix for both is the same: design the catalog around the question “what does someone actually need to know about this service in the next 60 seconds?” If a field doesn’t answer that, defer it.

The minimum viable entity

For a 90-day rollout, my answer is:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: orders-api
  description: "Order placement and status API for the storefront"
  annotations:
    github.com/project-slug: acme/orders-api
    backstage.io/kubernetes-id: orders-api
    pagerduty.com/service-id: PXXXXX
    grafana/dashboard-selector: "label_values(service) =~ \"orders-api\""
spec:
  type: service
  lifecycle: production
  owner: group:default/orders-team
  system: orders
  providesApis:
    - orders-api-v1
  consumesApis:
    - users-api-v1
    - inventory-api-v1
  dependsOn:
    - resource:default/orders-db

That’s enough to answer the 60-second questions:

Who owns it? owner plus the linked PagerDuty service.
What does it talk to? consumesApis and dependsOn.
Where’s the code? GitHub annotation.
Where’s it deployed? Kubernetes annotation.
Is it live or experimental? lifecycle.

Skip the data classification field. Skip the cost center. Skip the SLA tier. Add them later when there’s a real user with a question those fields would answer.

The ownership problem

Ownership is the single most important field. It’s also the one that rots fastest.

People leave teams. Teams reorg. A service hits production and the team that built it is dissolved six months later. The catalog still says owner: foo-team. Foo-team doesn’t exist. Now what?

A few patterns that work:

Owners are groups, not individuals. A Group entity that maps to a Slack channel or rotation. The group can be rosterless and the rotation source-of-truth lives in PagerDuty or similar.
Group membership is sourced from your HRIS or identity provider. Backstage’s GitHub org integration pulls this for free. Workday integrations exist for the enterprise crowd.
An orphaned-ownership alert. A nightly job that finds Components whose owner group has no members and pings the platform team’s Slack. A dead owner is a tracked incident.

// A pseudocode catalog validator — runs in CI on PRs touching catalog-info.yaml
import { CatalogClient } from '@backstage/catalog-client';

const validateOwner = async (owner: string) => {
  const group = await catalog.getEntityByRef(owner);
  if (!group) {
    throw new Error(`Owner ${owner} does not exist`);
  }
  const members = group.spec?.members ?? [];
  if (members.length === 0) {
    throw new Error(`Owner ${owner} has no members`);
  }
};

That validator runs on PRs and prevents new orphaned entries. Combine it with the nightly scan that catches drift on existing entries and ownership stays healthy.

Making the catalog mandatory

A catalog that’s optional is a catalog that rots. The non-negotiable principle: a service cannot reach production without a valid catalog entry.

Concrete enforcement points:

CI checks catalog-info.yaml exists in every repo and validates against the schema.
The deployment pipeline refuses to deploy any image whose source repo isn’t registered in the catalog.
Argo CD or your delivery tool refuses to sync Applications that don’t reference a registered Component.

The third one is the strongest. We added a custom mutating webhook in front of Argo CD that looked up the target Component in the catalog and rejected the sync if it was missing or had an invalid owner. Six weeks later, catalog completeness was at 100% for production services. Engineers stopped finding it optional because the system stopped letting them treat it that way.

Some teams will resist this with “but it’s just metadata.” It’s not just metadata. It’s the directory you call when something is on fire at 2am. Make it mandatory.

Automating what you can

The flip side of mandatory is: don’t make humans type things a machine could figure out.

Auto-discover from GitHub. Backstage’s githubOrg provider can ingest repos and create stub Components automatically. Pair with a required catalog-info.yaml and you get the best of both — stubs catch missing entries, but real ownership data comes from the repo.
Auto-fill the Kubernetes annotation. If your CI tags images with the component name, you can have a controller annotate workloads on apply.
Auto-discover APIs. Your CI publishes OpenAPI specs to a known location. A Backstage processor ingests them and creates API entities.
Auto-detect dependencies. Tools like Cortex Scorecards or custom processors that read package.json, go.mod, or trace data from OpenTelemetry can populate dependsOn automatically.

# app-config.yaml — Backstage 1.21 OpenAPI processor
catalog:
  providers:
    githubOpenapi:
      acme:
        organization: acme
        catalogPath: '/.backstage/openapi.yaml'
        schedule:
          frequency: { hours: 6 }
          timeout: { minutes: 3 }

Every minute an engineer saves typing a field that doesn’t add information is a minute the catalog stays current.

The relations graph

Entity relations — providesApis, consumesApis, dependsOn, partOf — are the catalog’s most underrated feature. They’re what turn a list into a graph.

The graph view in Backstage answers questions a flat catalog can’t:

“If I deprecate the users-api-v1, which services break?”
“What does orders-api need to be healthy?”
“Which systems consume our payments domain?”

Most teams populate owner and stop. Spend the marginal effort to populate API relations on day one. It pays for itself the first time someone asks the deprecation question.

Pruning matters

Catalogs accumulate cruft. A service is deprecated and the entry sticks around for a year. A pilot project lingers in lifecycle: experimental long after it was shut down.

Prune quarterly. The rules I use:

Anything with lifecycle: experimental older than 90 days gets a soft warning to the owner. Older than 180, archive.
Anything with lifecycle: deprecated and no traffic in 30 days gets archived.
Anything whose source repo is archived in GitHub gets auto-archived in the catalog.

A clean catalog is more trustworthy than a complete one.

Common Pitfalls

Manually-edited locations file. Don’t maintain a list of every service repo in one YAML. Use the GitHub org provider, or per-repo catalog-info.yaml with a discoverer.
A custom entity kind for every special case. Stick to Component, API, System, Group, Resource. Custom kinds are a maintenance tax that rarely earns out.
Treating Backstage as the source of truth for everything. PagerDuty owns rotations. Your HRIS owns people. Backstage links to those — it doesn’t replace them.
Skipping the catalog UI for engineers. If engineers don’t use the catalog UI to answer real questions, they won’t notice when it rots. Surface the catalog in places they already live — pull request templates, Slack /whoowns commands, the dashboard sidebar.
No history. When ownership transitions, store the previous owner with a date. The audit trail matters when investigating “who owned this six months ago?”

The one I got wrong: I tried to model team-to-product hierarchies via Domain and System entities both. Two parallel groupings. Engineers found it confusing and started ignoring both. I cut Domain entirely and stuck with System plus Group. Cleaner.

Wrapping Up

A service catalog that works is a catalog the rest of your tooling cannot bypass. Make it mandatory, automate what you can, prune what you must, and design for the 60-second questions. Skip the urge to model every dimension of metadata on day one. Add fields when you can name the user and the use case.

For deeper reading, the Backstage catalog model docs cover the full entity schema. Next post in the series tackles developer experience metrics — the dark art of measuring whether the platform is actually working.

The two failure modes

The minimum viable entity

The ownership problem

Making the catalog mandatory

Automating what you can

The relations graph

Pruning matters

Common Pitfalls

Wrapping Up

Related posts

Golden Paths, How Self-Service Actually Sticks

Service Catalog Design That Scales, Lessons From Production

Measuring Developer Experience, DORA Metrics in Practice

Building an Internal Developer Platform with Backstage and Kubernetes

Why Platform Engineering Won the DevOps Debate in 2024

Measuring Developer Experience with DORA and SPACE in Backstage

Developer Onboarding with Backstage and ArgoCD, An End to End Tutorial

Building an Internal Developer Portal with Backstage 1.34

Let’s Start a Project