Backfilling Historical Data Safely
TL;DR — Backfill historical data in chunks (by date or ID range), rate-limited so source DB doesn’t suffer. Run in parallel where idempotent. Merge into the same destination table as live ingest if upserts are idempotent; otherwise to a parallel
_backfilltable then swap.
After bulk loading with COPY, the related challenge. New pipeline goes live; the warehouse is empty. Need to backfill 2 years of history. How without melting the source DB or breaking live ingest.
The shape
Three phases:
- Decide what to backfill: range and granularity
- Backfill in chunks: parallel where possible, rate-limited
- Reconcile with live data: same table or swap
Each has gotchas.
Choose chunk granularity
For a table with created_at column:
# Daily chunks for 2 years = 730 chunks
chunks = [
(date(2020, 10, 26) + timedelta(days=i),
date(2020, 10, 26) + timedelta(days=i+1))
for i in range(0, 730)
]
Daily is usually a good size. Each chunk:
- ~50K-500K rows for typical tables (manageable memory)
- Independent (idempotent — re-run is safe)
- Finite (won’t hang on one chunk forever)
For very high-volume tables, hourly chunks. For low-volume, weekly is fine.
The backfill loop
def backfill():
for start_day, end_day in chunks:
if already_backfilled(start_day):
continue
try:
backfill_chunk(start_day, end_day)
mark_backfilled(start_day)
except Exception as e:
log.error(f"chunk {start_day} failed: {e}")
# Don't mark; will retry on next run
def backfill_chunk(start: date, end: date):
rows = source.query("""
SELECT * FROM orders
WHERE created_at >= %(start)s AND created_at < %(end)s
""", {"start": start, "end": end})
destination.copy_into(rows)
Progress table:
CREATE TABLE backfill_progress (
table_name text,
chunk_id text PRIMARY KEY,
completed_at timestamptz,
rows_loaded bigint
);
Resumable. Crash mid-backfill = next run picks up.
Rate limiting
The source DB is serving production. Backfill queries hammering it = production outage.
import time
QUERY_INTERVAL = 5 # seconds between backfill queries
last_query = 0
def backfill_chunk_rate_limited(start, end):
global last_query
elapsed = time.time() - last_query
if elapsed < QUERY_INTERVAL:
time.sleep(QUERY_INTERVAL - elapsed)
rows = source.query(...)
destination.copy_into(rows)
last_query = time.time()
Or use the source’s pg_stat_activity to back off when concurrent connections climb:
def safe_to_query():
count = source.execute("""
SELECT count(*) FROM pg_stat_activity
WHERE state = 'active' AND datname = 'app'
""").fetchone()[0]
return count < 20 # threshold tuned to load
Adaptive: only query when source has spare capacity.
Parallel backfill
For 2 years of daily chunks (730 chunks), serial backfill at 5s per chunk = 1 hour. Parallel where idempotent:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=4) as ex:
futures = [ex.submit(backfill_chunk, s, e) for s, e in chunks]
for f in futures:
try:
f.result()
except Exception as e:
log.error(f"chunk failed: {e}")
4 workers × 5s = ~15 minutes total. Watch source load; tune workers.
For source DB protection: 2-4 workers. For destination throughput: depends on the destination.
Live ingest during backfill
Two patterns:
Pattern A — same table, idempotent upsert.
Live ingest writes to orders. Backfill also writes to orders. Both use upsert (ON CONFLICT DO UPDATE). Order doesn’t matter:
INSERT INTO orders ...
ON CONFLICT (id) DO UPDATE SET ...
WHERE EXCLUDED.updated_at >= orders.updated_at;
The WHERE clause ensures newer data wins. Backfill won’t overwrite a newer live update.
Pattern B — parallel table, swap at end.
Backfill into orders_backfill (separate table). Live ingest into orders. After backfill completes:
BEGIN;
INSERT INTO orders SELECT * FROM orders_backfill
ON CONFLICT (id) DO NOTHING; -- live data wins
DROP TABLE orders_backfill;
COMMIT;
Simpler reasoning; brief big transaction.
For most teams: Pattern A. Less complexity once you commit to idempotent destination.
Watermark coordination
If live ingest uses a watermark:
- Backfill operates BELOW the current live watermark — different ranges, no overlap
- Don’t update live watermark from backfill — it’s a separate concern
Conceptually: backfill = “fill in what’s missing.” Live ingest = “stream what’s new.” They don’t conflict if both write idempotently to the same destination.
Schema considerations
Source schema may have changed during the 2 years of history:
- Columns added/removed
- Types changed
- New table dependencies
Two strategies:
Use current schema for backfill. Map old data to current columns; missing → NULL. Loses some data but simpler.
Honor historical schema. Query different columns per date range. Painful but accurate.
Most teams: current schema, NULL the missing values. Document the limitation.
Validation after backfill
Spot-check counts and aggregates:
-- Source
SELECT date_trunc('day', created_at) AS day, count(*)
FROM orders
WHERE created_at < '2020-10-26'
GROUP BY 1;
-- Destination
SELECT day, count(*)
FROM analytics.orders
WHERE day < '2020-10-26'
GROUP BY 1;
Compare. If off by < 0.1%, probably timezone or boundary issue. If off by > 1%, real bug.
Common Pitfalls
Single backfill query for the whole range. Source DB chokes. Always chunk.
Forgetting rate limiting. Backfill takes down production source.
Parallel without idempotency. Race conditions; duplicates.
No progress tracking. Crash = restart from scratch. Save progress per chunk.
Source schema assumption. 2 years ago the table didn’t have column X. Backfill SELECT crashes.
Backfill schedule conflicts with maintenance. Source vacuum during backfill = slow. Schedule.
Watermark accidentally advanced by backfill. Live ingest skips a day. Keep watermark advance to live ingest only.
Wrapping Up
Chunked + rate-limited + parallel + idempotent destination = safe backfill. Friday: monitoring ETL pipelines.