CDC vs Polling, Choosing Your Sync Strategy

Etl article cover illustration on a gradient background

October 5, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Polling: simple, eventually consistent, query-driven. CDC: complex, near-real-time, requires source DB support. Polling for most teams. CDC when sub-minute freshness or accurate delete-tracking matters. Hybrid: polling for some tables, CDC for others.

After the lightweight ETL intro , the strategic question: how do you actually know what changed?

Polling — query the source

The simpler approach:

SELECT * FROM orders WHERE updated_at > :last_watermark ORDER BY updated_at;

Periodic; relies on a monotonically increasing updated_at column. Pros and cons obvious:

Pros:

No source DB changes
Works on any DB that lets you query
Easy to reason about
Backfilling is trivial (just change the watermark)
Latency = polling interval

Cons:

Doesn’t catch deletes (you can’t query for “what no longer exists”)
Misses fast-modified rows if updated_at lags behind real changes
High poll frequency = high source DB load
1-minute freshness is the practical floor

For most “I need data in BigQuery within an hour” use cases, polling is sufficient.

CDC — read the database log

Change Data Capture reads the database’s write-ahead log. Every change (insert, update, delete) is in the log; CDC tooling streams them.

Pros:

Near-real-time (sub-second)
Captures deletes
Lower source DB load than aggressive polling
Exact “what changed” not “what’s newer than X”

Cons:

Requires source DB to expose WAL (Postgres logical replication, MySQL binlog, SQL Server CDC, etc.)
More moving parts (replication slot, parsing, downstream queue)
Setup is non-trivial
Backfilling is harder (initial snapshot then stream)

The classic CDC stack: source DB → Debezium → Kafka → consumer → destination.

When polling wins

Source DB doesn’t support CDC (or you can’t change it)
Sync intervals of 5+ minutes acceptable
Small data volumes
Deletes don’t matter (or are soft-deletes via flag)
Team is small; ops capacity limited

90% of “I need a daily extract to my warehouse” cases. Polling.

When CDC wins

Sub-minute freshness required
Hard delete tracking (e.g., GDPR compliance — when row is deleted, must propagate)
High write volume (polling would hammer the DB)
Building event-driven architecture (every change is an event)

The factory observability project we’ve been running uses polling for the slow stuff (configuration changes), CDC-like (logical replication to TimescaleDB) for sensor data.

A hybrid pattern that works

Most realistic shops end up here:

Source-of-truth tables (orders, customers, products): polled hourly to data warehouse
High-frequency event data (sensor readings, user actions): CDC or Kafka producer
Sensitive deletion-required data (PII, GDPR): CDC for accurate delete propagation
Aggregate / derived data: re-built periodically from raw

Mixed strategy. Each table gets the sync method that matches its needs.

What “polling at scale” looks like

For Postgres source, ~50 tables, 100 GB:

TABLES = [
    {"name": "orders",      "watermark_column": "updated_at", "interval": 60},
    {"name": "customers",   "watermark_column": "updated_at", "interval": 300},
    {"name": "products",    "watermark_column": "updated_at", "interval": 600},
    # ...
]

def sync_all():
    for cfg in TABLES:
        try:
            sync_table(cfg)
        except Exception as e:
            log_error(cfg, e)

# Cron entry: every minute

Each table polled at its own cadence. A metadata table tracks watermarks per table. Errors logged.

For 50 tables × hourly polling on a 100 GB DB: 50 SELECTs at the top of each hour, each pulling tens of MB. DB barely notices.

What “CDC at scale” looks like

For the same Postgres source:

Enable logical replication
Run Debezium connector pointing at it
Stream to Kafka topics (one per table)
Consumers read Kafka, write to destination

Ops investment:

Debezium: a JVM service to operate
Kafka cluster (or use managed)
Per-table consumer code

For 50 tables, CDC has more parts than the 50-line polling script. Real ops cost.

Common Pitfalls

No watermark column. Tables without an updated_at (or equivalent) can’t be polled incrementally. Either add the column or full-extract each time (small tables only).

updated_at not actually updated. Some application code forgets to bump it on certain change paths. Polling misses these changes. Audit.

Polling too frequently. Hitting the source every second hammers it for no benefit. Match poll interval to actual freshness need.

CDC without backfill plan. CDC starts from “now.” Historical data needs a separate initial snapshot.

Mixing strategies without metadata. Some tables CDC, some polled — and nobody knows which is which six months later. Document.

Forgetting delete tracking. Polling doesn’t capture deletes. If you care, add soft delete + watermark on delete column, or switch that table to CDC.

Wrapping Up

Polling for boring; CDC for real-time + delete tracking. Most projects: hybrid. Friday: Postgres logical replication for CDC — implementing the CDC source.

Polling — query the source

CDC — read the database log

When polling wins

When CDC wins

A hybrid pattern that works

What “polling at scale” looks like

What “CDC at scale” looks like

Common Pitfalls

Wrapping Up

Related posts

October Retro, ETL Pragmatism

Monitoring ETL Pipelines, Lag, Errors, Throughput

Backfilling Historical Data Safely

Kafka as a Sync Backbone

Schema Drift, Handling Source Changes Without Breaking Pipelines

Idempotent Pipelines and Watermarks

Building an ETL Pipeline in Go

Building an ETL Pipeline in Python

Let’s Start a Project