Reading pg_stat_io and Modern Postgres Internals
TL;DR —
pg_stat_iois the first time Postgres has exposed real IO mechanics from inside the engine. Learn what its columns mean, what the contexts are, and you’ll diagnose storage and memory problems in minutes that used to take hours of OS-level forensics.
Before Postgres 16 introduced pg_stat_io, diagnosing IO problems in Postgres meant stitching together iostat output, vmstat, application logs, and a lot of guessing. The engine itself didn’t tell you much about what it was actually doing to your storage. Postgres 17 improved the view considerably and added more contexts, more backend types, and better integration with the streaming IO infrastructure.
I want to walk through pg_stat_io properly because most documentation explains the columns but not how to read them as a system. The view is information-dense. Each row tells you something specific about how Postgres is interacting with its storage layer, and once you can read it fluently, capacity planning becomes a quantitative exercise rather than guesswork.
This is going to be more internals than tutorial. If you want a hands-on tuning walkthrough first, check the Postgres 17 tuning guide. Here we’re going to look at how the engine actually moves data.
1. The Buffer Cache, Briefly
Postgres reads and writes data in 8 KB blocks (the page size). Every block goes through shared_buffers, an in-memory cache shared across all backends.
Backend Process
|
v
+-----------+-----------+
| shared_buffers |
| (cache, NBuffers) |
+-----------+-----------+
|
miss / dirty flush
v
+-----------+-----------+
| OS Page Cache |
| (kernel-managed) |
+-----------+-----------+
|
v
+-----------+-----------+
| Block Device |
+-----------------------+
When a backend needs a block:
- Look it up in
shared_buffers. If it’s there, that’s ahit. - If not, evict a buffer (
eviction) and read the block from the OS (read). - The OS might serve from page cache (fast) or hit disk (slow).
- If the block needs to be written, mark it dirty. Eventually the checkpointer, background writer, or backend flushes it (
write).
pg_stat_io exposes the counters for each of these operations, segmented by who did them and why.
2. The Shape Of The View
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'pg_stat_io'
ORDER BY ordinal_position;
You get a row per (backend_type, object, context) combination. The columns include reads, writes, extends, hits, evictions, reuses, fsyncs, plus timing columns in Postgres 17.
backend_type
Who is doing the IO:
client backend— regular query backends.autovacuum worker— vacuum.background writer— flushing dirty buffers.checkpointer— checkpoints.walwriter— WAL flushing.walsender— replication.standalone backend— single-user mode (rare).
object
What kind of storage:
relation— table or index data.temp relation— temp tables and sort spill files.
context
Why this IO happened:
normal— regular reads through shared_buffers.bulkread— sequential scans that bypass shared_buffers via the ring buffer.bulkwrite— COPY, CREATE TABLE AS, and similar bulk operations.vacuum— vacuum-driven IO.
The context column is the most important one for diagnosis. It tells you the intent behind the IO.
3. Reading Cache Behavior
Start with the simple question: how much of your workload is hitting shared_buffers?
SELECT backend_type,
sum(hits) AS hits,
sum(reads) AS reads,
round(100.0 * sum(hits)::numeric / nullif(sum(hits) + sum(reads), 0), 2) AS hit_ratio_pct
FROM pg_stat_io
WHERE object = 'relation' AND context = 'normal'
GROUP BY backend_type
ORDER BY hit_ratio_pct;
What you want to see: hit ratio above 95% for client backends on busy OLTP systems. Below 90% means your shared_buffers is too small or your working set is much larger than memory.
But hit ratio alone is misleading. A workload doing 1000 reads per second with a 99% hit ratio still does 10 misses per second. If your storage is slow, that’s a problem.
Eviction signals memory pressure
SELECT backend_type, object, context,
evictions,
reuses,
round(100.0 * evictions::numeric / nullif(evictions + reuses, 0), 2) AS evict_ratio
FROM pg_stat_io
WHERE evictions > 0 OR reuses > 0
ORDER BY evictions DESC;
evictions is when a buffer needed by a new request displaces an existing buffer. reuses is when the same backend gets a buffer back that it owned (ring buffer behavior for bulk operations).
A high eviction rate on client backend + normal context means your buffer pool is too small. The active working set doesn’t fit, so you’re constantly evicting and reading the same blocks.
Background writer vs backends
The two most important rows for write performance:
SELECT backend_type, writes, write_time
FROM pg_stat_io
WHERE backend_type IN ('background writer', 'client backend', 'checkpointer')
AND context = 'normal' AND writes > 0;
You want most writes to come from the background writer and checkpointer. If client backend writes are high, that means individual query backends are flushing dirty pages because the background writer can’t keep up. Every such write adds latency to that backend’s query.
The fix is either tuning bgwriter_lru_maxpages higher or accepting that your workload writes more dirty data than your bgwriter can flush.
4. Sequential Scans And bulkread
bulkread is special. When Postgres does a sequential scan on a table larger than 25% of shared_buffers, it doesn’t pull the whole table into the cache (which would evict everything useful). Instead it uses a small “ring buffer” of 256 KB and recycles those pages.
SELECT backend_type, reads, hits, evictions, reuses
FROM pg_stat_io
WHERE context = 'bulkread' AND reads > 0
ORDER BY reads DESC;
You’ll see:
reads= how many blocks were read from storage during sequential scans.reuses= how many times the ring buffer cycled.hits= how many of those blocks happened to be in shared_buffers anyway.
If bulkread.reads is high and you’re not running analytics, something is doing unintended seq scans. Find what:
SELECT query,
seq_scan, seq_tup_read,
calls
FROM pg_stat_statements
JOIN pg_stat_user_tables ON true -- gross, but illustrative
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC LIMIT 10;
Usually it’s a missing index, a query the planner can’t optimize, or ANALYZE being out of date.
5. Vacuum IO Behavior
Vacuum has its own IO context, and it’s worth watching closely.
SELECT backend_type, reads, writes, extends, fsyncs,
read_time, write_time
FROM pg_stat_io
WHERE context = 'vacuum'
ORDER BY reads + writes DESC;
What this tells you:
reads= how much data vacuum is reading.writes= how much dead-tuple cleanup is happening.extends= vacuum extending tables (rare, usually means freezing batch growth).fsyncs= how many forced syncs (should be low; vacuum batches its IO).
Postgres 17’s improved vacuum uses the streaming IO infrastructure, so you’ll see vacuum IO is now meaningfully faster than 16. If vacuum.read_time is dominating your IO budget, autovacuum is fighting for resources. Either lower autovacuum_vacuum_cost_delay so vacuum makes progress faster, or raise autovacuum_vacuum_cost_limit.
Vacuum read-ahead
In 17, vacuum uses async IO read-ahead. You’ll see this as reads being substantially more than what a synchronous scan would produce in the same wall-clock time. This is good. It means vacuum is parallelizing IO with CPU work.
6. Putting It Together For Capacity Planning
Here’s the diagnostic query I run when a Postgres database “feels slow” and I don’t know why yet.
WITH io_summary AS (
SELECT
backend_type, context,
sum(reads) AS reads,
sum(writes) AS writes,
sum(extends) AS extends,
sum(hits) AS hits,
sum(evictions) AS evictions,
sum(coalesce(read_time, 0)) AS read_ms,
sum(coalesce(write_time, 0)) AS write_ms
FROM pg_stat_io
WHERE object = 'relation'
GROUP BY backend_type, context
)
SELECT backend_type, context,
reads, writes, extends,
round(100.0 * hits::numeric / nullif(hits + reads, 0), 1) AS hit_pct,
evictions,
round(read_ms::numeric, 1) AS read_ms,
round(write_ms::numeric, 1) AS write_ms,
round(read_ms::numeric / nullif(reads, 0), 3) AS ms_per_read
FROM io_summary
WHERE reads + writes + extends > 0
ORDER BY reads + writes DESC;
The ms_per_read column is the storage latency Postgres is observing. If it’s above 2-3 ms on NVMe, your storage is overloaded or there’s queue depth contention. If it’s above 10 ms on any modern storage, fix that before tuning anything else.
Reading the signal
A few patterns to recognize:
- High
client backendwrites with lowbackground writerwrites: bgwriter underprovisioned, raisebgwriter_lru_maxpages. - High
autovacuum workerreads: vacuum scanning large tables, normal if you have big tables. Worry if read_time per read is high. - High
bulkreadreads fromclient backend: unintended seq scans. - High
vacuum.fsyncs: vacuum is being interrupted (cancellation timeouts), checkpg_stat_progress_vacuum. - Evictions on
client backendnormal context: shared_buffers too small.
Comparing to OS-level metrics
pg_stat_io shows what Postgres asked for. iostat shows what the kernel did. The difference between them tells you about the OS page cache. If Postgres says 100k reads/sec but iostat shows 5k reads/sec, the page cache is absorbing 95% of those reads. That’s good. If they match, the page cache isn’t helping you.
7. The Statistics Snapshot
pg_stat_io is a snapshot since stats reset. To make it useful, you need delta measurements:
-- Save snapshot
CREATE TEMP TABLE io_t1 AS SELECT *, now() AS sampled FROM pg_stat_io;
-- Wait some time, do work
SELECT pg_sleep(60);
-- Compute deltas
WITH t2 AS (SELECT *, now() AS sampled FROM pg_stat_io)
SELECT t2.backend_type, t2.context,
t2.reads - t1.reads AS delta_reads,
t2.writes - t1.writes AS delta_writes,
t2.evictions - t1.evictions AS delta_evictions
FROM t2 JOIN io_t1 t1 USING (backend_type, object, context)
WHERE t2.reads + t2.writes > t1.reads + t1.writes
ORDER BY delta_reads DESC;
In production, scrape pg_stat_io every minute into your monitoring system and compute rates from there. The Postgres exporter for Prometheus does this.
Resetting stats
You can reset the cumulative counters:
SELECT pg_stat_reset_shared('io');
Useful for “give me a clean 5 minute sample of this workload” but don’t do it casually because long-term trend data is lost.
Common Pitfalls
1. Treating hit ratio as a north star
A high hit ratio doesn’t mean fast queries. You can have 99% hit ratio and still be IO-bound if the 1% misses are on a hot path. Look at absolute miss counts and latency, not just ratio.
2. Confusing bulkread and normal reads
bulkread reads are intentional (sequential scans, vacuum) and use a ring buffer. They don’t pollute shared_buffers. Don’t treat them the same as normal reads. A high bulkread.reads is fine if it’s expected workload.
3. Ignoring extends
extends is when Postgres adds a new block to a table file. High extends without high inserts means you have a hot table with lots of small writes causing many file extensions. This can be a contention point. Postgres 17 made extension cheaper, but it’s still worth watching for high-write tables.
4. Reading stats from a replica
pg_stat_io on a replica shows replay IO, not the application’s IO. If you’re tuning a primary, look at the primary’s stats.
Troubleshooting
read_time numbers are zero
track_io_timing is off. Enable it:
ALTER SYSTEM SET track_io_timing = on;
SELECT pg_reload_conf();
There’s a small overhead (~3% on busy systems) but for production it’s worth it. Most managed Postgres providers have it on by default.
pg_stat_io has rows with all zeros
Postgres tracks every combination it sees, even if counts are zero in the current sample. Filter WHERE reads + writes + hits + evictions > 0.
Numbers don’t match my OS metrics
OS metrics include page cache reads (which don’t hit disk) as well as actual disk IO. Postgres counts everything it asked for. The difference is what your OS cache absorbed. If they’re very different, your OS cache is doing useful work.
The Postgres monitoring documentation has the canonical reference for every column in pg_stat_io.
Wrapping Up
pg_stat_io is the closest Postgres has come to making its IO behavior legible. With practice, you can spot memory pressure, vacuum issues, sequential scan problems, and storage latency from this one view. It pairs naturally with pg_stat_statements for the workload side and OS-level tools for the storage side.
Spend an hour reading your production pg_stat_io output. It will teach you more about your database than a week of tuning blog posts.
Next, if you’re running TimescaleDB, the time-series hypertable guide shows how IO patterns differ for hypertables specifically.