background-shape
Backfilling Historical Data Safely
October 26, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Backfill historical data in chunks (by date or ID range), rate-limited so source DB doesn’t suffer. Run in parallel where idempotent. Merge into the same destination table as live ingest if upserts are idempotent; otherwise to a parallel _backfill table then swap.

After bulk loading with COPY, the related challenge. New pipeline goes live; the warehouse is empty. Need to backfill 2 years of history. How without melting the source DB or breaking live ingest.

The shape

Three phases:

  1. Decide what to backfill: range and granularity
  2. Backfill in chunks: parallel where possible, rate-limited
  3. Reconcile with live data: same table or swap

Each has gotchas.

Choose chunk granularity

For a table with created_at column:

# Daily chunks for 2 years = 730 chunks
chunks = [
    (date(2020, 10, 26) + timedelta(days=i),
     date(2020, 10, 26) + timedelta(days=i+1))
    for i in range(0, 730)
]

Daily is usually a good size. Each chunk:

  • ~50K-500K rows for typical tables (manageable memory)
  • Independent (idempotent — re-run is safe)
  • Finite (won’t hang on one chunk forever)

For very high-volume tables, hourly chunks. For low-volume, weekly is fine.

The backfill loop

def backfill():
    for start_day, end_day in chunks:
        if already_backfilled(start_day):
            continue
        try:
            backfill_chunk(start_day, end_day)
            mark_backfilled(start_day)
        except Exception as e:
            log.error(f"chunk {start_day} failed: {e}")
            # Don't mark; will retry on next run

def backfill_chunk(start: date, end: date):
    rows = source.query("""
        SELECT * FROM orders
        WHERE created_at >= %(start)s AND created_at < %(end)s
    """, {"start": start, "end": end})
    destination.copy_into(rows)

Progress table:

CREATE TABLE backfill_progress (
  table_name text,
  chunk_id text PRIMARY KEY,
  completed_at timestamptz,
  rows_loaded bigint
);

Resumable. Crash mid-backfill = next run picks up.

Rate limiting

The source DB is serving production. Backfill queries hammering it = production outage.

import time

QUERY_INTERVAL = 5  # seconds between backfill queries

last_query = 0
def backfill_chunk_rate_limited(start, end):
    global last_query
    elapsed = time.time() - last_query
    if elapsed < QUERY_INTERVAL:
        time.sleep(QUERY_INTERVAL - elapsed)

    rows = source.query(...)
    destination.copy_into(rows)
    last_query = time.time()

Or use the source’s pg_stat_activity to back off when concurrent connections climb:

def safe_to_query():
    count = source.execute("""
        SELECT count(*) FROM pg_stat_activity
        WHERE state = 'active' AND datname = 'app'
    """).fetchone()[0]
    return count < 20    # threshold tuned to load

Adaptive: only query when source has spare capacity.

Parallel backfill

For 2 years of daily chunks (730 chunks), serial backfill at 5s per chunk = 1 hour. Parallel where idempotent:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=4) as ex:
    futures = [ex.submit(backfill_chunk, s, e) for s, e in chunks]
    for f in futures:
        try:
            f.result()
        except Exception as e:
            log.error(f"chunk failed: {e}")

4 workers × 5s = ~15 minutes total. Watch source load; tune workers.

For source DB protection: 2-4 workers. For destination throughput: depends on the destination.

Live ingest during backfill

Two patterns:

Pattern A — same table, idempotent upsert.

Live ingest writes to orders. Backfill also writes to orders. Both use upsert (ON CONFLICT DO UPDATE). Order doesn’t matter:

INSERT INTO orders ...
ON CONFLICT (id) DO UPDATE SET ...
WHERE EXCLUDED.updated_at >= orders.updated_at;

The WHERE clause ensures newer data wins. Backfill won’t overwrite a newer live update.

Pattern B — parallel table, swap at end.

Backfill into orders_backfill (separate table). Live ingest into orders. After backfill completes:

BEGIN;
INSERT INTO orders SELECT * FROM orders_backfill
ON CONFLICT (id) DO NOTHING;   -- live data wins
DROP TABLE orders_backfill;
COMMIT;

Simpler reasoning; brief big transaction.

For most teams: Pattern A. Less complexity once you commit to idempotent destination.

Watermark coordination

If live ingest uses a watermark:

  • Backfill operates BELOW the current live watermark — different ranges, no overlap
  • Don’t update live watermark from backfill — it’s a separate concern

Conceptually: backfill = “fill in what’s missing.” Live ingest = “stream what’s new.” They don’t conflict if both write idempotently to the same destination.

Schema considerations

Source schema may have changed during the 2 years of history:

  • Columns added/removed
  • Types changed
  • New table dependencies

Two strategies:

Use current schema for backfill. Map old data to current columns; missing → NULL. Loses some data but simpler.

Honor historical schema. Query different columns per date range. Painful but accurate.

Most teams: current schema, NULL the missing values. Document the limitation.

Validation after backfill

Spot-check counts and aggregates:

-- Source
SELECT date_trunc('day', created_at) AS day, count(*)
FROM orders
WHERE created_at < '2020-10-26'
GROUP BY 1;

-- Destination
SELECT day, count(*)
FROM analytics.orders
WHERE day < '2020-10-26'
GROUP BY 1;

Compare. If off by < 0.1%, probably timezone or boundary issue. If off by > 1%, real bug.

Common Pitfalls

Single backfill query for the whole range. Source DB chokes. Always chunk.

Forgetting rate limiting. Backfill takes down production source.

Parallel without idempotency. Race conditions; duplicates.

No progress tracking. Crash = restart from scratch. Save progress per chunk.

Source schema assumption. 2 years ago the table didn’t have column X. Backfill SELECT crashes.

Backfill schedule conflicts with maintenance. Source vacuum during backfill = slow. Schedule.

Watermark accidentally advanced by backfill. Live ingest skips a day. Keep watermark advance to live ingest only.

Wrapping Up

Chunked + rate-limited + parallel + idempotent destination = safe backfill. Friday: monitoring ETL pipelines.