background-shape
Native Postgres Partitioning, Patterns That Hold at Scale
November 18, 2024 · 8 min read · by Muhammad Amal programming

TL;DR — Native declarative partitioning is good enough for 99% of workloads. Range by time for events, hash for write distribution, list for tenant isolation. Get the partition key right or pay forever. pg_partman handles the maintenance.

I resisted native Postgres partitioning until version 13. Inheritance-based partitioning was a nightmare, and the early native implementation was missing too many features to trust in production. Postgres 17 has settled that question. Declarative partitioning works, the planner is smart enough about pruning, and the operational patterns are well understood.

This post is the partitioning playbook I use today. It covers the three strategies (range, hash, list), partition pruning, foreign keys, the migration path from a non-partitioned table, and the operational issues that surface only at scale.

I’m going to be opinionated about partition sizing because that’s the single decision most teams get wrong. Too many partitions hurts the planner; too few defeats the point.

Pick the strategy first

Three strategies, three use cases.

Range partitioning is for time-series and any column with a natural ordering. created_at is the canonical example. Each partition holds a contiguous range.

CREATE TABLE events (
  id bigserial,
  occurred_at timestamptz NOT NULL,
  payload jsonb NOT NULL,
  PRIMARY KEY (id, occurred_at)
) PARTITION BY RANGE (occurred_at);

CREATE TABLE events_2024_11 PARTITION OF events
  FOR VALUES FROM ('2024-11-01') TO ('2024-12-01');

Note the primary key includes the partition column. That’s a Postgres requirement and the first place schemas fight you.

Hash partitioning is for distributing writes evenly when you don’t have a natural range column, or when you do but want even parallelism.

CREATE TABLE users (
  id bigint NOT NULL,
  email text NOT NULL,
  PRIMARY KEY (id)
) PARTITION BY HASH (id);

CREATE TABLE users_p0 PARTITION OF users FOR VALUES WITH (modulus 8, remainder 0);
-- ... users_p1 through users_p7

Hash gives you predictable spread but kills the ability to drop old data by dropping a partition. Don’t use it for retention-driven workloads.

List partitioning is for tenant isolation or any small fixed set of values.

CREATE TABLE orders (
  id bigserial,
  region text NOT NULL,
  PRIMARY KEY (id, region)
) PARTITION BY LIST (region);

CREATE TABLE orders_us PARTITION OF orders FOR VALUES IN ('us-east', 'us-west');
CREATE TABLE orders_eu PARTITION OF orders FOR VALUES IN ('eu-west', 'eu-central');
CREATE TABLE orders_other PARTITION OF orders DEFAULT;

List is the right choice for multi-region or multi-tenant when you want operational isolation per partition (different backup schedules, different retention, different storage tiers).

Partition sizing

The math people skip. A partition that’s too small turns every query into a planner-side iteration over hundreds of partitions. A partition that’s too large makes maintenance (vacuum, reindex) painful and defeats the operational case for partitioning.

My rule: each partition between 10 GB and 100 GB. Below 10 GB you’re paying overhead for nothing. Above 100 GB you can’t VACUUM it in a maintenance window.

For a 5 TB events table that’s 50 partitions of 100 GB. With monthly range partitioning that’s about 4 years of retention. Fine.

For a 50 TB events table, monthly is way too coarse — partitions hit 1 TB each. Switch to weekly or daily, accept that you’ll have 1000+ partitions over the retention window, and prune aggressively.

The planner handles up to a few thousand partitions in 17 without falling over. Past that, performance degrades — query planning time grows. The hard ceiling for me is 4000 partitions per parent.

Partition pruning, the whole point

Partitioning earns its keep through pruning: the planner skipping partitions it knows can’t match a query. This is automatic in 17 when the WHERE clause references the partition column with constants or stable expressions.

EXPLAIN SELECT count(*) FROM events
WHERE occurred_at >= '2024-11-01' AND occurred_at < '2024-11-15';

You want to see Subplans Removed: in the output, or only the matching partitions in the plan. If every partition shows up, pruning failed.

Common pruning killers:

  • WHERE occurred_at::date = '2024-11-10': the cast hides the partition value. Use >= '2024-11-10' AND < '2024-11-11'.
  • WHERE occurred_at = now() - interval '1 day': now() is stable within a statement so this works in 17. In 12 and earlier it didn’t.
  • WHERE occurred_at = ANY(some_subquery): subqueries usually defeat pruning. Materialize first.

The planner can also do “execution-time pruning” inside nested loops in 17, but it’s slower than plan-time. Aim for plan-time pruning by writing direct comparisons against the partition column.

The partitioning docs cover the pruning conditions in detail.

Foreign keys, finally usable

Until Postgres 12, foreign keys to partitioned tables didn’t work. They do now, and 17 handles them efficiently. Foreign keys from a partitioned table always worked.

CREATE TABLE order_items (
  order_id bigint NOT NULL,
  occurred_at timestamptz NOT NULL,
  product_id bigint NOT NULL,
  FOREIGN KEY (order_id, occurred_at) REFERENCES orders (id, occurred_at)
);

Note the composite FK matches the composite PK on the partitioned table. This works in 17 with reasonable performance.

The constraint enforcement is per-partition, so dropping an old partition with DROP TABLE skips constraint checking. That’s a feature for retention drops but bites if you DROP a partition with live references in another table. Use DETACH PARTITION first if you’re not sure.

Operational patterns

The repeatable patterns that hold up in production.

Pre-create future partitions

Never let a write fail because the partition doesn’t exist. For range-partitioned tables, create partitions a few months ahead:

CREATE TABLE events_2025_01 PARTITION OF events
  FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

Or use pg_partman, which automates this. It runs in cron or pg_cron, creates upcoming partitions, drops old ones based on retention.

Drop, don’t delete

The killer feature of partitioning is DROP TABLE events_2023_11. Instant, no vacuum needed, reclaims disk. Compare to DELETE FROM events WHERE occurred_at < '2024-01-01' which writes WAL for every row and leaves dead tuples for vacuum to chase.

Index per partition

Indexes on a partitioned table create matching indexes on every existing partition and on every new partition. In 17 this is fully automatic, including UNIQUE indexes (as long as they include the partition key).

CREATE INDEX events_user_id_idx ON events (user_id);
-- propagates to every partition, future ones too

For partition-specific indexes (e.g., a partial index that only makes sense for recent data), index the partition directly, not the parent.

Default partition for safety

Always have a default partition that catches unrouted values. It saves you from WHERE violations:

CREATE TABLE events_default PARTITION OF events DEFAULT;

Monitor its size. Anything in there is a bug in your write path.

Migrating an existing table

The painful step. There’s no ALTER TABLE ... PARTITION BY in 17 — you have to create a new table, copy data, swap names.

-- 1. create new partitioned table
CREATE TABLE events_new (LIKE events INCLUDING ALL) PARTITION BY RANGE (occurred_at);

-- 2. create partitions
-- ...

-- 3. backfill in batches, off-peak
INSERT INTO events_new SELECT * FROM events
  WHERE occurred_at >= '2024-01-01' AND occurred_at < '2024-02-01';
-- repeat per month

-- 4. swap (during a short maintenance window)
BEGIN;
ALTER TABLE events RENAME TO events_old;
ALTER TABLE events_new RENAME TO events;
COMMIT;

For multi-TB tables, the backfill is the slow part. Use INSERT ... SELECT in batches with ON CONFLICT DO NOTHING to be restartable, or set up logical replication from old to new and switch.

The blue-green pattern with logical replication described in logical replication for blue green Postgres deploys works well here.

Common Pitfalls

Things that bite people running this in production.

  • Forgetting the partition column in the primary key. Postgres requires it for partitioned tables. If your PK was id, it must become (id, occurred_at).
  • SELECT max(id) is slow. Without a single global index on id, the planner must scan every partition’s index. Use the natural primary key (id, occurred_at) and query bounded by time.
  • Bulk loads via the parent. Postgres routes each row to its partition, with overhead. For initial loads, target the leaf partition directly (COPY events_2024_11 FROM ...).
  • Cross-partition updates. Updating the partition column moves the row to a different partition. In 17 this works but is twice as expensive as a normal update. Avoid by design.
  • pg_dump of partitioned tables. Dumps the parent plus children. Restoring with --data-only can fail if partitions don’t exist yet. Use --section to control ordering.
  • Default partition catches all the misses. It can grow huge silently. Alert on its row count.

The mistake that costs the most is choosing the wrong partition key. Changing it later means a full table rewrite. Spend a day thinking about access patterns before you commit.

Wrapping Up

Native partitioning in Postgres 17 is mature enough that I default to it for any table that will exceed a few hundred GB or that needs time-based retention. The ergonomics are good, the planner is smart, and the operational story (with pg_partman) is automatable.

The single biggest payoff is the ability to drop old data instantly. Once you’ve done that on a 5 TB events table at 3 AM during a disk-full incident, you’ll never go back to monolithic tables.

Pick the strategy that matches your access pattern, size partitions in the 10-100 GB range, and let pg_partman handle the calendar.