Backfilling Historical Data Safely | Hi, I'm Muhammad Amal

Etl article cover illustration on a gradient background

October 26, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Backfill historical data in chunks (by date or ID range), rate-limited so source DB doesn’t suffer. Run in parallel where idempotent. Merge into the same destination table as live ingest if upserts are idempotent; otherwise to a parallel _backfill table then swap.

After bulk loading with COPY , the related challenge. New pipeline goes live; the warehouse is empty. Need to backfill 2 years of history. How without melting the source DB or breaking live ingest.

The shape

Three phases:

Decide what to backfill: range and granularity
Backfill in chunks: parallel where possible, rate-limited
Reconcile with live data: same table or swap

Each has gotchas.

Choose chunk granularity

For a table with created_at column:

# Daily chunks for 2 years = 730 chunks
chunks = [
    (date(2020, 10, 26) + timedelta(days=i),
     date(2020, 10, 26) + timedelta(days=i+1))
    for i in range(0, 730)
]

Daily is usually a good size. Each chunk:

~50K-500K rows for typical tables (manageable memory)
Independent (idempotent — re-run is safe)
Finite (won’t hang on one chunk forever)

For very high-volume tables, hourly chunks. For low-volume, weekly is fine.

The backfill loop

def backfill():
    for start_day, end_day in chunks:
        if already_backfilled(start_day):
            continue
        try:
            backfill_chunk(start_day, end_day)
            mark_backfilled(start_day)
        except Exception as e:
            log.error(f"chunk {start_day} failed: {e}")
            # Don't mark; will retry on next run

def backfill_chunk(start: date, end: date):
    rows = source.query("""
        SELECT * FROM orders
        WHERE created_at >= %(start)s AND created_at < %(end)s
    """, {"start": start, "end": end})
    destination.copy_into(rows)

Progress table:

CREATE TABLE backfill_progress (
  table_name text,
  chunk_id text PRIMARY KEY,
  completed_at timestamptz,
  rows_loaded bigint
);

Resumable. Crash mid-backfill = next run picks up.

Rate limiting

The source DB is serving production. Backfill queries hammering it = production outage.

import time

QUERY_INTERVAL = 5  # seconds between backfill queries

last_query = 0
def backfill_chunk_rate_limited(start, end):
    global last_query
    elapsed = time.time() - last_query
    if elapsed < QUERY_INTERVAL:
        time.sleep(QUERY_INTERVAL - elapsed)

    rows = source.query(...)
    destination.copy_into(rows)
    last_query = time.time()

Or use the source’s pg_stat_activity to back off when concurrent connections climb:

def safe_to_query():
    count = source.execute("""
        SELECT count(*) FROM pg_stat_activity
        WHERE state = 'active' AND datname = 'app'
    """).fetchone()[0]
    return count < 20    # threshold tuned to load

Adaptive: only query when source has spare capacity.

Parallel backfill

For 2 years of daily chunks (730 chunks), serial backfill at 5s per chunk = 1 hour. Parallel where idempotent:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=4) as ex:
    futures = [ex.submit(backfill_chunk, s, e) for s, e in chunks]
    for f in futures:
        try:
            f.result()
        except Exception as e:
            log.error(f"chunk failed: {e}")

4 workers × 5s = ~15 minutes total. Watch source load; tune workers.

For source DB protection: 2-4 workers. For destination throughput: depends on the destination.

Live ingest during backfill

Two patterns:

Pattern A — same table, idempotent upsert.

Live ingest writes to orders. Backfill also writes to orders. Both use upsert (ON CONFLICT DO UPDATE). Order doesn’t matter:

INSERT INTO orders ...
ON CONFLICT (id) DO UPDATE SET ...
WHERE EXCLUDED.updated_at >= orders.updated_at;

The WHERE clause ensures newer data wins. Backfill won’t overwrite a newer live update.

Pattern B — parallel table, swap at end.

Backfill into orders_backfill (separate table). Live ingest into orders. After backfill completes:

BEGIN;
INSERT INTO orders SELECT * FROM orders_backfill
ON CONFLICT (id) DO NOTHING;   -- live data wins
DROP TABLE orders_backfill;
COMMIT;

Simpler reasoning; brief big transaction.

For most teams: Pattern A. Less complexity once you commit to idempotent destination.

Watermark coordination

If live ingest uses a watermark:

Backfill operates BELOW the current live watermark — different ranges, no overlap
Don’t update live watermark from backfill — it’s a separate concern

Conceptually: backfill = “fill in what’s missing.” Live ingest = “stream what’s new.” They don’t conflict if both write idempotently to the same destination.

Schema considerations

Source schema may have changed during the 2 years of history:

Columns added/removed
Types changed
New table dependencies

Two strategies:

Use current schema for backfill. Map old data to current columns; missing → NULL. Loses some data but simpler.

Honor historical schema. Query different columns per date range. Painful but accurate.

Most teams: current schema, NULL the missing values. Document the limitation.

Validation after backfill

Spot-check counts and aggregates:

-- Source
SELECT date_trunc('day', created_at) AS day, count(*)
FROM orders
WHERE created_at < '2020-10-26'
GROUP BY 1;

-- Destination
SELECT day, count(*)
FROM analytics.orders
WHERE day < '2020-10-26'
GROUP BY 1;

Compare. If off by < 0.1%, probably timezone or boundary issue. If off by > 1%, real bug.

Common Pitfalls

Single backfill query for the whole range. Source DB chokes. Always chunk.

Forgetting rate limiting. Backfill takes down production source.

Parallel without idempotency. Race conditions; duplicates.

No progress tracking. Crash = restart from scratch. Save progress per chunk.

Source schema assumption. 2 years ago the table didn’t have column X. Backfill SELECT crashes.

Backfill schedule conflicts with maintenance. Source vacuum during backfill = slow. Schedule.

Watermark accidentally advanced by backfill. Live ingest skips a day. Keep watermark advance to live ingest only.

Wrapping Up

Chunked + rate-limited + parallel + idempotent destination = safe backfill. Friday: monitoring ETL pipelines .

The shape

Choose chunk granularity

The backfill loop

Rate limiting

Parallel backfill

Live ingest during backfill

Watermark coordination

Schema considerations

Validation after backfill

Common Pitfalls

Wrapping Up

Related posts

October Retro, ETL Pragmatism

Monitoring ETL Pipelines, Lag, Errors, Throughput

Kafka as a Sync Backbone

Schema Drift, Handling Source Changes Without Breaking Pipelines

Idempotent Pipelines and Watermarks

Building an ETL Pipeline in Go

Building an ETL Pipeline in Python

CDC vs Polling, Choosing Your Sync Strategy

Let’s Start a Project