background-shape
Federated GraphQL with Apollo Router, Patterns for 2024
July 3, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — Federation 2 plus Apollo Router 1.48 is genuinely production-ready in 2024. The hard parts are entity ownership, query plan visibility, and getting auth right. The Rust router is fast enough that the bottleneck moves entirely to your subgraphs.

I have run federated GraphQL in three different companies now. The first one was on the JS gateway in 2020 and it was painful for reasons that don’t matter anymore. The current generation, built on Apollo Router with Federation 2 directives, is a different product. It is fast, observable, and the schema composition errors are finally actionable.

This post is what I would tell a team adopting federation today. Not the marketing version. The version with the operational scars.

I will not rehash the “what is federation” intro. If you need that, the Apollo docs cover it well. What I want to talk about is what changes when you actually run this in production with five or more subgraphs.

The subgraph contract

The first thing teams get wrong is treating subgraphs like microservices that happen to speak GraphQL. They’re not. A subgraph is a contributor to a single composed schema, and the schema is the product. Your subgraph team does not own Customer. They own the fields on Customer that their service is authoritative for.

The mental model that works:

  • Each entity has exactly one owner subgraph (the one with @key and the resolver).
  • Other subgraphs extend that entity with their own fields.
  • The router stitches it together via the query plan.
# customers subgraph - owns Customer
type Customer @key(fields: "id") {
  id: ID!
  email: String!
  createdAt: DateTime!
}

# billing subgraph - extends Customer
extend type Customer @key(fields: "id") {
  id: ID! @external
  invoices(first: Int = 10): InvoiceConnection!
  outstandingBalance: Money!
}

# support subgraph - extends Customer
extend type Customer @key(fields: "id") {
  id: ID! @external
  openTickets: [Ticket!]!
}

This composition gives you one logical Customer type at the gateway with fields contributed by three teams. The router calls each subgraph only for its fields. No subgraph needs to know about the others.

The discipline is in keeping each subgraph’s slice small. If your billing subgraph ends up resolving email because “it’s convenient”, you’ve broken the model. You now have two sources of truth and you will get inconsistent results during deploys.

Apollo Router configuration that actually matters

Router 1.48 has a lot of knobs. The ones that change production behavior:

# router.yaml
supergraph:
  listen: 0.0.0.0:4000
  introspection: false # disable in prod

traffic_shaping:
  all:
    timeout: 5s
    deduplicate_query: true
  subgraphs:
    billing:
      timeout: 3s
      compression: gzip

apq:
  router:
    cache:
      in_memory:
        limit: 512
      redis:
        urls: ["redis://redis:6379"]
        ttl: 24h

headers:
  all:
    request:
      - propagate:
          named: "authorization"
      - propagate:
          named: "x-request-id"

telemetry:
  instrumentation:
    spans:
      mode: spec_compliant
  exporters:
    tracing:
      otlp:
        endpoint: http://otel-collector:4317

A few things to call out. deduplicate_query: true means if two concurrent requests happen to hit the same subgraph with the same query and variables, the router will collapse them. This is free latency and free load shedding. Turn it on.

The APQ config with Redis is non-optional for any multi-instance router deployment. Without shared APQ storage, a request that registers a persisted query on instance A will fail on instance B until the client re-sends the full query. You will not catch this in staging. You will catch it the first time you autoscale.

Header propagation defaults to “nothing”, which means your subgraphs receive no auth context. You have to explicitly propagate authorization. I have debugged this enough times that I now put it in a template.

Auth, the part everyone gets wrong

There are two viable patterns for auth in federated GraphQL. Pick one and stick to it.

Pattern A, JWT at the router, claims at the subgraph. The router validates the JWT once, extracts claims, and forwards them as headers (x-user-id, x-user-roles) to subgraphs. Subgraphs trust those headers because they’re only reachable from the router on an internal network.

Pattern B, JWT passthrough. The router validates the JWT, then forwards it as-is. Each subgraph re-validates. Higher CPU cost, but no header-spoofing risk if a subgraph ever gets exposed.

I prefer Pattern A in 2024 because the operational cost of Pattern B is real and the threat model rarely justifies it. But you have to lock down your subgraph ingress. If billing.svc.cluster.local:4000 is reachable from anywhere except the router, you have a hole. Use a NetworkPolicy or a mesh.

For the JWT validation itself in Go subgraphs, I covered the patterns in securing Go microservices with JWT. Same primitives apply here.

Query plans, the thing you must monitor

The single most underrated feature of Apollo Router is query plan export. Every query gets a plan that shows which subgraphs the router calls, in what order, with what fetches. If you don’t look at these in production, you are flying blind.

curl -X POST http://router:4000/ \
  -H 'apollo-include-query-plan: true' \
  -H 'content-type: application/json' \
  -d '{"query":"{ customer(id:\"1\") { email outstandingBalance { amount } openTickets { id } } }"}'

The response includes a extensions.queryPlan block showing the fetch tree. Watch for:

  • Sequential fetches that should be parallel. Federation 2 parallelizes where it can, but if a field on billing requires data from customers, it serializes. Sometimes you can restructure the schema to break that dependency.
  • Repeated fetches to the same subgraph. Usually means your @key is wrong or you have entities scattered across subgraphs that should be in one.
  • Plans with more than 5-6 hops. The composition has gotten too deep. Time to refactor.

I export query plan metrics to a Prometheus histogram tagged by operation name. The p99 on plan depth and total fetches will tell you when an innocuous schema change has caused a regression weeks before users notice.

Subgraph implementation in Go

For Go subgraphs, gqlgen with the federation plugin works well. The generator handles _entities resolver wiring, which is the boilerplate that used to be the worst part.

// in resolver.go
func (r *entityResolver) FindCustomerByID(ctx context.Context, id string) (*model.Customer, error) {
    return r.customerService.Get(ctx, id)
}

That single method is the entity lookup that the router calls when it needs to resolve fields on Customer from another subgraph. Keep it cheap. Keep it batched if you can — gqlgen supports _entities resolver batching via @key(fields: "id") @entityResolver(multi: true) since Federation 2.3, which is a meaningful win.

Common Pitfalls

  • One entity, multiple owners. Two subgraphs both declare @key and resolve the entity. Composition will sometimes accept this. Production will not. Pick one owner.
  • No persisted queries. Same warning as standalone GraphQL — without APQ, your public endpoint accepts arbitrary queries. Federation makes this worse because a malicious query can fan out to every subgraph.
  • Subgraphs talking to each other. If your billing subgraph calls your customers subgraph directly, you’ve defeated federation. All cross-subgraph data flows through the router via @key lookups.
  • Forgetting @shareable on common scalars. If two subgraphs both define a Money type and you forget @shareable, composition fails. Federation 2 is stricter than 1.
  • Cold starts on subgraph deploys. Apollo Router will retry, but if a subgraph takes 30s to warm up, you’ll see error spikes. Use readiness probes and keep one instance hot.

Wrapping Up

Federation in 2024 is a solved engineering problem. It is not a solved organizational problem, and the difference matters. The router will do its job. Your schema review process, your entity ownership model, your subgraph deployment cadence — those are what determine whether federation scales for your team.

My current rule of thumb: if you have three or more teams that need to contribute to a single API surface, federation pays for itself within six months. Below that, run a single GraphQL service or push the aggregation into a BFF. The operational overhead of running a router, managing supergraph composition in CI, and coordinating subgraph deploys is real, and it doesn’t make sense for two teams.

What’s next for me is the schema registry side — getting composition checks into CI properly with rover subgraph check so we catch breaking changes before merge. That’s where the next round of operational wins lives.