Why Lightweight ETL Beats Big Data Tools for Most Teams
TL;DR — Most “ETL” work is < 100 GB/day, fits in a single VM, and doesn’t need a distributed framework. A 200-line Python script with SQLAlchemy + a cron job moves data better than Airflow + Spark + a team to operate them. Reserve big-data tools for big data.
After September’s observability theme, October pivots to data movement. Most teams I see have outsized data tooling for undersized problems. This month is what to use instead.
The “big data” trap
Companies with 50 GB of data and 5 engineers buy Airflow, Spark, dbt, Snowflake. Why?
- The tooling is loud
- “Data engineering” implies these tools
- Vendor marketing
- Resume-driven development
Each tool is excellent at its job. The job isn’t your job, probably.
A 200-line Python script reading from Postgres, transforming, writing to BigQuery, run on a 30-minute cron, handles the same workload at 1/100th the operational cost. Less to learn. Less to break.
When you genuinely need the big tools
Real signals:
- Multi-TB datasets processed daily — Spark / Trino / DuckDB-with-effort
- Hundreds of DAGs with complex dependencies — Airflow / Dagster
- Real-time streaming at high volume — Kafka / Flink
- Data team of 5+ maintaining shared transformations — dbt
- Compliance requirements for lineage and audit — formal tools
For most teams: none of these apply.
What lightweight ETL looks like
The shape:
- Source: Postgres, MySQL, BigQuery, S3, an API
- Transform: pandas / Polars / plain SQL / Go structs
- Destination: another database, data warehouse, S3
- Orchestration: cron + a small Python/Go script + a metadata table
Operational footprint:
- One VM or container
- One database (often Postgres)
- A
cronentry or scheduled Lambda
That’s it. No Kubernetes cluster. No web UI. No DAG. Just a script that runs on a schedule.
A working pattern
For “sync orders from Postgres to BigQuery hourly”:
import sqlalchemy
from google.cloud import bigquery
pg = sqlalchemy.create_engine(os.environ['DATABASE_URL'])
bq = bigquery.Client()
def sync():
# Get watermark
with bq.query("SELECT MAX(updated_at) as wm FROM analytics.orders").result() as row:
watermark = next(row).wm or datetime(2020, 1, 1)
# Fetch incremental rows
with pg.connect() as conn:
rows = conn.execute(sqlalchemy.text("""
SELECT * FROM orders WHERE updated_at > :wm ORDER BY updated_at
"""), {"wm": watermark}).mappings().all()
if not rows:
return
# Load to BigQuery
table = bq.dataset("analytics").table("orders")
errors = bq.insert_rows_json(table, [dict(r) for r in rows])
if errors:
raise RuntimeError(f"BQ errors: {errors}")
print(f"Synced {len(rows)} rows, watermark: {rows[-1]['updated_at']}")
if __name__ == "__main__":
sync()
Cron entry:
*/30 * * * * /usr/bin/python /opt/sync/orders.py
That’s a complete incremental ETL. 30 lines. Reliable. Easy to debug.
What this month covers
12 more posts:
- Oct 5: CDC vs polling
- Oct 7: Postgres logical replication for CDC
- Oct 10: Python ETL pipeline patterns
- Oct 12: Go ETL pipeline patterns
- Oct 14: Idempotent pipelines + watermarks
- Oct 17: Schema drift handling
- Oct 19: Kafka as sync backbone
- Oct 21: Debezium for Postgres → Kafka
- Oct 24: Postgres COPY for bulk loading
- Oct 26: Backfilling historical data safely
- Oct 28: Monitoring ETL pipelines
- Oct 31: Month retro
What I’m NOT going to cover
- Spark, Hadoop, Databricks — out of scope
- dbt in depth — useful when you have it; covered briefly
- Airflow / Dagster / Prefect — only as comparison to “just use cron”
- Real-time streaming at scale — separate problem space
When to graduate
You’ll know it’s time to leave lightweight ETL when:
- One script grows past ~500 lines
- You have > 10 scripts and forget what they do
- Dependencies between scripts get complex
- You need retries with exponential backoff and you’ve reinvented half of a framework
Then Dagster or Airflow earns its weight. Until then, scripts win.
Common Pitfalls (preview)
- Cron without monitoring. Script fails silently. Wire up alerting from day one.
- Re-runs that double-write. Idempotency matters; covered Oct 14.
- No watermark. Re-fetching everything each run; doesn’t scale.
- Building Airflow because “best practice.” Best practice for FAANG; overkill for most.
Wrapping Up
Boring scripts beat fancy frameworks for most teams. October walks through what those scripts actually look like. Wednesday: CDC vs polling — picking sync strategy.