CDC vs Polling, Choosing Your Sync Strategy
TL;DR — Polling: simple, eventually consistent, query-driven. CDC: complex, near-real-time, requires source DB support. Polling for most teams. CDC when sub-minute freshness or accurate delete-tracking matters. Hybrid: polling for some tables, CDC for others.
After the lightweight ETL intro, the strategic question: how do you actually know what changed?
Polling — query the source
The simpler approach:
SELECT * FROM orders WHERE updated_at > :last_watermark ORDER BY updated_at;
Periodic; relies on a monotonically increasing updated_at column. Pros and cons obvious:
Pros:
- No source DB changes
- Works on any DB that lets you query
- Easy to reason about
- Backfilling is trivial (just change the watermark)
- Latency = polling interval
Cons:
- Doesn’t catch deletes (you can’t query for “what no longer exists”)
- Misses fast-modified rows if
updated_atlags behind real changes - High poll frequency = high source DB load
- 1-minute freshness is the practical floor
For most “I need data in BigQuery within an hour” use cases, polling is sufficient.
CDC — read the database log
Change Data Capture reads the database’s write-ahead log. Every change (insert, update, delete) is in the log; CDC tooling streams them.
Pros:
- Near-real-time (sub-second)
- Captures deletes
- Lower source DB load than aggressive polling
- Exact “what changed” not “what’s newer than X”
Cons:
- Requires source DB to expose WAL (Postgres logical replication, MySQL binlog, SQL Server CDC, etc.)
- More moving parts (replication slot, parsing, downstream queue)
- Setup is non-trivial
- Backfilling is harder (initial snapshot then stream)
The classic CDC stack: source DB → Debezium → Kafka → consumer → destination.
When polling wins
- Source DB doesn’t support CDC (or you can’t change it)
- Sync intervals of 5+ minutes acceptable
- Small data volumes
- Deletes don’t matter (or are soft-deletes via flag)
- Team is small; ops capacity limited
90% of “I need a daily extract to my warehouse” cases. Polling.
When CDC wins
- Sub-minute freshness required
- Hard delete tracking (e.g., GDPR compliance — when row is deleted, must propagate)
- High write volume (polling would hammer the DB)
- Building event-driven architecture (every change is an event)
The factory observability project we’ve been running uses polling for the slow stuff (configuration changes), CDC-like (logical replication to TimescaleDB) for sensor data.
A hybrid pattern that works
Most realistic shops end up here:
- Source-of-truth tables (orders, customers, products): polled hourly to data warehouse
- High-frequency event data (sensor readings, user actions): CDC or Kafka producer
- Sensitive deletion-required data (PII, GDPR): CDC for accurate delete propagation
- Aggregate / derived data: re-built periodically from raw
Mixed strategy. Each table gets the sync method that matches its needs.
What “polling at scale” looks like
For Postgres source, ~50 tables, 100 GB:
TABLES = [
{"name": "orders", "watermark_column": "updated_at", "interval": 60},
{"name": "customers", "watermark_column": "updated_at", "interval": 300},
{"name": "products", "watermark_column": "updated_at", "interval": 600},
# ...
]
def sync_all():
for cfg in TABLES:
try:
sync_table(cfg)
except Exception as e:
log_error(cfg, e)
# Cron entry: every minute
Each table polled at its own cadence. A metadata table tracks watermarks per table. Errors logged.
For 50 tables × hourly polling on a 100 GB DB: 50 SELECTs at the top of each hour, each pulling tens of MB. DB barely notices.
What “CDC at scale” looks like
For the same Postgres source:
- Enable logical replication
- Run Debezium connector pointing at it
- Stream to Kafka topics (one per table)
- Consumers read Kafka, write to destination
Ops investment:
- Debezium: a JVM service to operate
- Kafka cluster (or use managed)
- Per-table consumer code
For 50 tables, CDC has more parts than the 50-line polling script. Real ops cost.
Common Pitfalls
No watermark column. Tables without an updated_at (or equivalent) can’t be polled incrementally. Either add the column or full-extract each time (small tables only).
updated_at not actually updated. Some application code forgets to bump it on certain change paths. Polling misses these changes. Audit.
Polling too frequently. Hitting the source every second hammers it for no benefit. Match poll interval to actual freshness need.
CDC without backfill plan. CDC starts from “now.” Historical data needs a separate initial snapshot.
Mixing strategies without metadata. Some tables CDC, some polled — and nobody knows which is which six months later. Document.
Forgetting delete tracking. Polling doesn’t capture deletes. If you care, add soft delete + watermark on delete column, or switch that table to CDC.
Wrapping Up
Polling for boring; CDC for real-time + delete tracking. Most projects: hybrid. Friday: Postgres logical replication for CDC — implementing the CDC source.