background-shape
Postgres Bulk Loading, COPY and Prepared Statements
October 24, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Postgres COPY is 100× faster than per-row INSERT. UNLOGGED tables for staging skip WAL. Drop indexes during bulk load, recreate after. Batch sizes 1K-10K rows. The single most impactful Postgres tuning for ETL.

After Debezium CDC, the destination side. When loading data into Postgres at volume, COPY is the only reasonable choice.

Why COPY

Concrete numbers loading 1M rows on a typical Postgres setup:

Method Time
Per-row INSERT, no transaction 1500s
Per-row INSERT in transaction 180s
INSERT with VALUES (multi-row) 75s
Prepared INSERT in transaction 60s
COPY 8s
COPY into UNLOGGED table 4s

200× difference end-to-end. The trick is in network round-trips and WAL writes. COPY streams; INSERT round-trips.

COPY from text

Standard form:

COPY orders (id, customer_id, status, total_cents)
FROM STDIN
WITH (FORMAT csv, HEADER true);
1,42,active,1000
2,42,active,2500
3,99,active,500
\.

From the command line:

psql -c "\copy orders FROM 'data.csv' WITH (FORMAT csv, HEADER true)"

From application code:

Python (psycopg2):

import psycopg2

conn = psycopg2.connect(...)
cur = conn.cursor()
with open('data.csv', 'r') as f:
    cur.copy_expert("COPY orders FROM STDIN WITH (FORMAT csv, HEADER true)", f)
conn.commit()

Go (pgx):

rows := [][]any{
    {1, 42, "active", 1000},
    {2, 42, "active", 2500},
}
_, err := pool.CopyFrom(ctx,
    []string{"orders"},
    []string{"id", "customer_id", "status", "total_cents"},
    pgx.CopyFromRows(rows),
)

pgx’s CopyFrom accepts iterators too — for streaming from a remote source without buffering everything.

COPY with conflict handling

COPY doesn’t support ON CONFLICT directly. Workaround: stage then merge:

-- Create staging table
CREATE UNLOGGED TABLE orders_staging (LIKE orders);

-- COPY into staging
COPY orders_staging FROM ...;

-- Merge
INSERT INTO orders SELECT * FROM orders_staging
ON CONFLICT (id) DO UPDATE SET ...;

-- Cleanup
DROP TABLE orders_staging;

Staging table is UNLOGGED (no WAL, much faster). Final INSERT goes through WAL but reads from local memory.

For very large merges, drop unique indexes on the target before INSERT, recreate after. Trade-off: brief window of unique-constraint violation.

UNLOGGED tables

CREATE UNLOGGED TABLE staging (
  id bigint,
  data jsonb
);

Pros: 2-3× faster writes. No WAL.

Cons:

  • Not replicated (no streaming replication, no logical replication)
  • Truncated on crash recovery
  • Data lost on Postgres restart in some failure modes

Perfect for: ephemeral staging tables you populate then discard.

Not for: permanent data, replication sources, anything you care about persisting.

To convert:

ALTER TABLE staging SET LOGGED;     -- promote to permanent
ALTER TABLE staging SET UNLOGGED;   -- demote (data preserved, WAL stops)

Index management during load

Indexes slow down writes (each insert updates every index). For bulk loads of millions of rows:

-- Drop indexes (except primary key constraint usually)
DROP INDEX orders_customer_id_idx;
DROP INDEX orders_created_at_idx;

-- Bulk COPY
COPY orders FROM ...;

-- Recreate concurrently (no lock)
CREATE INDEX CONCURRENTLY orders_customer_id_idx ON orders (customer_id);
CREATE INDEX CONCURRENTLY orders_created_at_idx ON orders (created_at);

Total time often shorter than load-with-indexes. Drop saves ~30%, rebuild adds ~10%, net 20% faster.

For ongoing-load tables (not one-time backfill), this is too disruptive. Use it for migrations only.

Batch sizes for streaming COPY

For pipelines that COPY in batches from a source:

BATCH_SIZE = 5000

while True:
    rows = source_fetch(limit=BATCH_SIZE)
    if not rows: break
    copy_into_destination(rows)

Sweet spot: 1K-10K rows. Smaller = network overhead dominates; larger = memory pressure and worse parallelism.

For very high throughput, parallel COPY into different partitions of a partitioned table. Multiple connections, each writing to a different partition. Linear scaling up to disk I/O.

prepared INSERT — when COPY isn’t available

Some setups (cross-DB, Restricted permissions) don’t allow COPY. Prepared statements are the next-best:

stmt = conn.prepare("INSERT INTO orders (id, customer_id, status) VALUES ($1, $2, $3)")
for row in rows:
    conn.execute(stmt, row)

5-10× faster than non-prepared. Still way slower than COPY.

For pgx in Go, multi-row INSERT via VALUES:

values := []any{}
placeholders := []string{}
for i, row := range rows {
    n := i * 3
    placeholders = append(placeholders, fmt.Sprintf("($%d,$%d,$%d)", n+1, n+2, n+3))
    values = append(values, row.ID, row.CustomerID, row.Status)
}
sql := "INSERT INTO orders (id, customer_id, status) VALUES " + strings.Join(placeholders, ",")
pool.Exec(ctx, sql, values...)

5-10× faster than per-row INSERT.

Vacuum after bulk load

Big bulk loads create many dead tuples (if upsert) or skew statistics. After a major load:

ANALYZE orders;
VACUUM orders;

Optional but worth it before queries start hitting the new data.

Common Pitfalls

Per-row INSERT loop. Even in a transaction, 100× slower than COPY.

Forgetting to commit. Long-running COPY in autocommit-off mode without commit = invisible to other sessions.

COPY with ON CONFLICT. Not supported. Stage + merge.

Indexes during huge bulk load. Drop + recreate often faster.

UNLOGGED table for production data. Data loss on crash. Use only for staging.

Mid-COPY error handling. A bad row halts entire COPY. Pre-validate rows or use \copy with ON_ERROR_STOP=0.

VACUUM FULL after bulk load. Locks the table. Use plain VACUUM instead.

Wrapping Up

COPY + UNLOGGED staging + index management = 100× faster bulk loads. Wednesday: backfilling historical data.