Reading EXPLAIN ANALYZE Like a Senior DBA
TL;DR — Read EXPLAIN ANALYZE bottom up, compare
rowstoactual rowsfor estimation skew, watchBuffersfor hidden I/O, and ignore total runtime until you’ve looked at the suspect node. Postgres 17 added a couple of new lines worth knowing.
I’ve watched a lot of engineers stare at an EXPLAIN ANALYZE plan and zoom straight to total runtime at the bottom. That number tells you the query is slow, which you already knew. It doesn’t tell you why.
A senior DBA reads plans in a specific order, with specific numbers in mind. There’s no magic to it — the executor leaves a trail, and you follow it. This post is the reading order I use, with annotated examples on Postgres 17. By the end you should be able to look at a 40-line plan and find the broken node in under a minute.
I’m assuming you know what an index scan is and can run EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS) <query>. If not, that’s the right starting incantation. BUFFERS is on by default in Postgres 17, but type it anyway so the habit transfers to 15 and 16 clusters.
The shape of a plan
A plan is a tree. The executor walks it depth-first, left-to-right, and each node pulls rows from its children. EXPLAIN prints it pre-order with indentation, so the root is at top:
Nested Loop (cost=0.42..187.34 rows=12 width=48) (actual time=0.234..14.892 rows=11 loops=1)
Buffers: shared hit=842
-> Index Scan using orders_customer_idx on orders o ...
-> Index Scan using line_items_order_idx on line_items li ...
Read it bottom up. The lowest, most indented nodes are the leaves that touch storage. Their times and row counts feed the parents. If a leaf is wrong, every parent above it is wrong, but the parent’s time includes the children, which makes the parent look guilty when it isn’t.
The first number to look at is at the bottom of the tree, not the top.
Estimated vs actual
Every node prints both an estimate and an actual:
Index Scan using orders_customer_idx on orders o
(cost=0.42..8.44 rows=1 width=20)
(actual time=0.018..0.020 rows=137 loops=1)
rows=1 is the planner’s estimate. actual rows=137 is reality. That’s a 137x miss, and it’s almost always the root cause of a bad plan. The planner picked a nested loop because it thought one row was coming back. With 137 rows, you wanted a hash join.
The senior move: scan every node and find the one with the largest estimate-vs-actual gap. That’s your bug. Common causes:
- Stale
pg_statistic. RunANALYZE. - Correlated columns the planner thinks are independent. Add
CREATE STATISTICS ... ON (col_a, col_b) FROM tbland ANALYZE. - A
WHEREclause with an opaque function the planner can’t see through. Inline it or add a functional index with statistics. - A type mismatch silently casting and disabling stats. Watch for
varcharvstext,intvsbigint.
If estimates and actuals are within 2x at every node, the plan is shaped right. Look elsewhere for the slowness.
Loops, the silent multiplier
loops=1 at the end of an actual line means the node ran once. loops=137 means the executor called this subtree 137 times — once per outer row in the loop above it:
Nested Loop (rows=137 loops=1)
-> Index Scan on orders (actual time=0.018..0.020 rows=137 loops=1)
-> Index Scan on line_items (actual time=0.045..0.812 rows=4 loops=137)
The line_items time of 0.812 ms is per loop. Real cost is 137 * 0.812 = 111 ms. Postgres 17 prints averages, so multiply mentally. If you see loops in the thousands, nested loop is probably the wrong join strategy and a hash or merge would crush this query.
Buffers, where the real I/O hides
Buffers: shared hit=842 read=0 means 842 8-KB pages came from the buffer cache, none from disk. Multiply by 8 to get KB. This query touched 6.5 MB of memory.
read=N means N pages were not in cache. The kernel may have had them in its page cache (free, almost), or they came from disk (slow). Postgres can’t tell the difference, but a high read after a warm cache is suspicious. Postgres 17 also surfaces temp read= and temp written= when a sort or hash spills to disk. Those are the actually expensive ones.
Sort (actual time=2300..2410 rows=2.1M loops=1)
Sort Method: external merge Disk: 184MB
Buffers: shared hit=4231, temp read=23552 written=23552
184 MB of temp files is a sort that should have been work_mem. Bump work_mem for this session or rewrite the query to avoid the sort.
Reading the new Postgres 17 lines
Postgres 17 added a few output details worth recognizing.
I/O Timings is now broken out per node when track_io_timing = on. Combined with the streaming I/O subsystem, you can see prefetched read time separated from synchronous read time:
Buffers: shared hit=842 read=128
I/O Timings: shared read=12.4 read=2.1
Parallel Workers Launched: 4 is no longer at the end of the plan — it’s per Gather node. Confirm you actually got the workers you wanted. If Launched < Planned, you’re hitting max_parallel_workers and the plan ran underpowered.
Memoize nodes appear when the planner figured out it can cache inner-side results from a nested loop. When they work, you’ll see Hits: 9821 Misses: 12. When they don’t, you’ll see Evictions: 9000 and the query is slower than it would have been without memoization.
A worked example
Here’s a real plan I tuned last week. The query is “find the last order per customer for a 90-day window”.
EXPLAIN (ANALYZE, BUFFERS)
SELECT DISTINCT ON (customer_id) customer_id, order_id, created_at
FROM orders
WHERE created_at > now() - interval '90 days'
ORDER BY customer_id, created_at DESC;
Original plan:
Unique (actual time=4321..4502 rows=82412 loops=1)
-> Sort (actual time=4321..4421 rows=2.1M loops=1)
Sort Key: customer_id, created_at DESC
Sort Method: external merge Disk: 184MB
-> Seq Scan on orders (actual time=0.012..820 rows=2.1M)
Filter: (created_at > now() - '90 days')
Rows Removed by Filter: 18000000
Reading bottom up: seq scan over 20M rows, filter discards 18M, sort spills 184 MB to disk. The Rows Removed by Filter is the smoking gun — we’re scanning the entire table for a 10% slice. A (created_at, customer_id, created_at DESC) index turned this into:
Index Scan using orders_created_customer_idx on orders
(actual time=0.018..420 rows=2.1M loops=1)
Buffers: shared hit=42312 read=128
418 ms total, no sort. The trick was matching the DISTINCT ON sort order in the index. The senior move is recognizing that DISTINCT ON wants a sorted input and giving it one.
For the related question of which index to reach for in which situation, see choosing the right Postgres index, BRIN, GIN, HNSW, IVFFlat.
Common Pitfalls
A few traps that cost junior engineers hours.
- Running EXPLAIN ANALYZE on writes. It actually runs the query. Wrap in
BEGIN; EXPLAIN ANALYZE ...; ROLLBACK;forUPDATE/DELETE/INSERT. - Caching effects. Run the query twice and look at the second. The first warms the cache. Or use
pg_buffercacheto inspect what’s hot. - JIT distortion. With
jit = on, small queries can spend 200 ms on JIT compilation. TheJIT:block at the bottom of the plan tells you. Disable for OLTP, enable for OLAP. loopsmath. Always multiply per-loop time by loop count. The plan total at the top does this for you, but per-node you have to.- Misreading
cost=. Cost is an estimate in arbitrary units. Two plans with the same cost can have wildly different runtimes. Trust actuals.
The single most common mistake I see is anchoring on the top-of-plan total runtime and trying to “optimize the slow node” without checking estimates first. Fix the estimate, the plan changes shape, the slow node disappears.
The official EXPLAIN documentation is short and worth a re-read every couple of years.
Wrapping Up
EXPLAIN ANALYZE rewards a reading discipline. Bottom up, estimates vs actuals, loops, buffers. After a few hundred plans the pattern recognition becomes automatic and you stop reaching for tooling to interpret them.
If you’re starting out, pick five slow queries from pg_stat_statements, capture plans, and find the worst estimate skew in each. That single exercise will teach you more than any blog post, including this one. Postgres 17 made the output slightly richer; the reading order didn’t change.