background-shape
Vacuum and Bloat Management for Busy Postgres Tables
November 20, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — Tune autovacuum per table for hot tables, not globally. Postgres 17’s new vacuum memory engine eats wide indexes for breakfast. Monitor bloat with pg_stat_user_tables and pgstattuple. Use pg_repack when REINDEX isn’t enough.

Every Postgres operator I respect has spent at least one weekend chasing autovacuum. The default settings work fine for a 10 GB database with moderate write traffic. They fail predictably and silently on busy production workloads, and the symptom is usually “everything got slow and we don’t know why”.

This post is the bloat-and-vacuum playbook I follow on tables doing more than a few hundred writes per second. I’ll cover what bloat actually is, how to measure it, the autovacuum knobs that matter, Postgres 17’s specific improvements, and the escalation path to pg_repack when nothing else works.

There’s no clever trick at the end. The discipline is “measure, tune per table, monitor, repeat”. But knowing which numbers to look at saves you weeks.

What bloat actually is

Postgres uses MVCC. Every UPDATE writes a new row version and marks the old one dead. Every DELETE marks rows dead without removing them. Dead rows stay in the table until vacuum cleans them up.

Bloat is the accumulation of dead rows plus the empty space within pages where dead rows used to be. A “10 GB” table that’s 30% bloated is really 7 GB of live data in 10 GB of allocated storage. Queries read all 10 GB. The buffer cache holds all 10 GB. Backups, replicas, and WAL all pay for all 10 GB.

There are two related issues:

  • Heap bloat: dead rows in the table itself.
  • Index bloat: index entries pointing to dead row versions, even after the heap is cleaned up.

Autovacuum handles heap bloat continuously. Index bloat requires REINDEX. Both need attention.

Measure bloat first

Don’t tune autovacuum without knowing what you’re tuning against.

SELECT relname,
       n_live_tup,
       n_dead_tup,
       round(100 * n_dead_tup::numeric / nullif(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
       last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

dead_pct above 20% on a hot table means autovacuum can’t keep up. Above 50% you’re in trouble.

For deeper inspection, install the pgstattuple extension:

CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstattuple('orders');
-- table_len, tuple_count, tuple_percent, dead_tuple_count, dead_tuple_percent, free_percent

pgstattuple reads the whole table to give exact numbers. It’s slow. Use it on suspected problem tables, not as a routine query.

For indexes:

SELECT * FROM pgstatindex('orders_pkey');
-- avg_leaf_density: 90 is healthy, below 50 means bloat

A B-tree index normally sits at 75-90% density. Below 60% you should REINDEX CONCURRENTLY.

Autovacuum, the right way to tune

The default autovacuum thresholds are designed not to break small tables, which means they’re way too lazy for busy ones.

# postgresql.conf defaults
autovacuum_vacuum_threshold = 50
autovacuum_vacuum_scale_factor = 0.2
autovacuum_analyze_scale_factor = 0.1

scale_factor = 0.2 means autovacuum runs when 20% of the table is dead. On a 100M row table that’s 20M dead rows before anything happens. By then queries are slow.

Per-table overrides are the right answer:

ALTER TABLE orders SET (
  autovacuum_vacuum_scale_factor = 0.02,
  autovacuum_vacuum_threshold = 1000,
  autovacuum_analyze_scale_factor = 0.01,
  autovacuum_vacuum_cost_limit = 2000
);

This says: vacuum when 2% of the table is dead, or 1000 rows whichever is bigger; analyze at 1%; let vacuum work at higher cost (more aggressive). For a 100M row table, this triggers at 2M dead rows instead of 20M.

The cost knobs are the second half of the story:

  • autovacuum_vacuum_cost_limit: total work units per round. Default 200 is timid. Production busy tables want 1000-3000.
  • autovacuum_vacuum_cost_delay: pause between batches. Default 2 ms is fine; 1 ms or 0 for very busy tables.
  • autovacuum_max_workers: how many tables can be vacuumed in parallel. Default 3. Bump to 6-10 on multi-table-heavy clusters.

The skill is matching these to your I/O budget. Autovacuum consuming 30% of available IOPS during business hours is fine if the alternative is bloat-induced query slowness.

Postgres 17’s vacuum memory engine

The biggest internal change in 17 is how vacuum stores the list of dead tuple identifiers (TIDs) it’s collected. Old versions kept them in a flat array sized by maintenance_work_mem, capped at 1 GB. Wide indexes meant multiple vacuum passes over the table, each one expensive.

17 uses a radix-tree storage that’s an order of magnitude more memory-efficient. The 1 GB cap is also gone — you can give vacuum 16 GB if you want.

SET maintenance_work_mem = '8GB';
VACUUM (VERBOSE) large_table;

The practical effect is one-pass vacuums even on tables with many wide indexes. A vacuum that used to take 4 hours and 3 passes on a 1 TB table now takes 90 minutes and one pass on the same hardware. This is the single biggest “free” upgrade win for write-heavy workloads in 17.

Set maintenance_work_mem high in postgresql.conf (4-8 GB on a modern server with 64+ GB RAM), and autovacuum benefits automatically.

The routine vacuuming docs cover the full set of knobs.

When to REINDEX

VACUUM doesn’t shrink indexes. It only marks entries as removable when the corresponding heap tuple is dead. The index pages stay allocated.

Run REINDEX INDEX CONCURRENTLY when:

  • pgstatindex shows avg_leaf_density < 60.
  • Index size is more than 50% larger than its “fresh” size.
  • A long-running transaction recently held back vacuum (called “xmin horizon” pinning) and you suspect bloat accumulated.
REINDEX INDEX CONCURRENTLY orders_user_idx;

CONCURRENTLY in 17 is solid. It builds a new index alongside the old, swaps them, drops the old. Brief locks at swap time, otherwise fully online.

REINDEX TABLE CONCURRENTLY rebuilds all indexes on a table. Useful but slow. Schedule for off-peak.

When REINDEX isn’t enough, pg_repack

If the heap itself is bloated — not just indexes — REINDEX doesn’t help. VACUUM FULL does, but takes an AccessExclusive lock for the duration. On a live system that’s unacceptable.

pg_repack rebuilds the table online. It creates a shadow table, replays writes via triggers, swaps at the end. Brief lock at swap, otherwise online.

pg_repack -h db.internal -U postgres -d app -t orders --no-superuser-check

I use pg_repack on bloated heaps once or twice a year on the busiest tables. With proper per-table autovacuum tuning it should be rare. If you’re running pg_repack monthly, autovacuum is mistuned.

What blocks vacuum

Three things stop vacuum from cleaning up dead tuples even when it runs:

  1. Long-running transactions. Vacuum can’t reclaim rows that a still-open transaction might be able to see. Monitor with:

    SELECT pid, now() - xact_start AS age, state, query
    FROM pg_stat_activity
    WHERE xact_start IS NOT NULL
    ORDER BY xact_start LIMIT 10;
    

    Any transaction older than a few minutes during business hours is a problem. Set statement_timeout and idle_in_transaction_session_timeout in production.

  2. Replication slots with stale catalog_xmin or xmin. A logical replication slot whose subscriber is offline holds back vacuum on the primary. Drop unused slots or alert on slot lag. See the operational notes in logical replication for blue green Postgres deploys.

  3. Prepared transactions. PREPARE TRANSACTION outside an XA coordinator left running. Almost always a bug. Find with:

    SELECT gid, prepared, owner, database FROM pg_prepared_xacts;
    

If n_dead_tup keeps climbing despite autovacuum running, the cause is almost certainly one of those three.

Common Pitfalls

The mistakes that cost the most sleep.

  • Global autovacuum tuning. Setting cluster-wide aggressive thresholds hurts small/cold tables. Always tune per table.
  • VACUUM FULL in production. Takes AccessExclusive lock. Almost never the right answer. Use pg_repack.
  • Index bloat after a long transaction. Even after the transaction ends, indexes don’t shrink. REINDEX CONCURRENTLY periodically on hot tables.
  • maintenance_work_mem too low. With Postgres 17, bump to several GB. Autovacuum and manual vacuum both benefit.
  • HOT updates not happening. If you can fit updates in the same page (small free-space-map fillfactor matters), Postgres does HOT updates that don’t require index updates. fillfactor = 90 on update-heavy tables helps significantly.
  • Ignoring TOAST tables. Bloated TOAST tables for wide JSONB or text columns are common and invisible to most monitoring. Check pg_stat_user_tables for the pg_toast.pg_toast_<oid> entries.

The mistake I see most often is treating autovacuum as a background concern. On any database doing real write traffic, it deserves the same monitoring attention as query latency.

What’s Next

Postgres 17 made vacuum substantially faster, which gives you more headroom before you have to tune. But headroom isn’t a strategy. The discipline of measuring bloat, tuning per table, alerting on long transactions, and reindexing periodically is what keeps write-heavy clusters healthy over years.

If you do nothing else after reading this, set idle_in_transaction_session_timeout = '5min' cluster-wide. It’s the single highest-leverage change for vacuum health, and it prevents 90% of the “vacuum can’t keep up” incidents I get called into.