background-shape
Why Lightweight ETL Beats Big Data Tools for Most Teams
October 3, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Most “ETL” work is < 100 GB/day, fits in a single VM, and doesn’t need a distributed framework. A 200-line Python script with SQLAlchemy + a cron job moves data better than Airflow + Spark + a team to operate them. Reserve big-data tools for big data.

After September’s observability theme, October pivots to data movement. Most teams I see have outsized data tooling for undersized problems. This month is what to use instead.

The “big data” trap

Companies with 50 GB of data and 5 engineers buy Airflow, Spark, dbt, Snowflake. Why?

  • The tooling is loud
  • “Data engineering” implies these tools
  • Vendor marketing
  • Resume-driven development

Each tool is excellent at its job. The job isn’t your job, probably.

A 200-line Python script reading from Postgres, transforming, writing to BigQuery, run on a 30-minute cron, handles the same workload at 1/100th the operational cost. Less to learn. Less to break.

When you genuinely need the big tools

Real signals:

  • Multi-TB datasets processed daily — Spark / Trino / DuckDB-with-effort
  • Hundreds of DAGs with complex dependencies — Airflow / Dagster
  • Real-time streaming at high volume — Kafka / Flink
  • Data team of 5+ maintaining shared transformations — dbt
  • Compliance requirements for lineage and audit — formal tools

For most teams: none of these apply.

What lightweight ETL looks like

The shape:

  • Source: Postgres, MySQL, BigQuery, S3, an API
  • Transform: pandas / Polars / plain SQL / Go structs
  • Destination: another database, data warehouse, S3
  • Orchestration: cron + a small Python/Go script + a metadata table

Operational footprint:

  • One VM or container
  • One database (often Postgres)
  • A cron entry or scheduled Lambda

That’s it. No Kubernetes cluster. No web UI. No DAG. Just a script that runs on a schedule.

A working pattern

For “sync orders from Postgres to BigQuery hourly”:

import sqlalchemy
from google.cloud import bigquery

pg = sqlalchemy.create_engine(os.environ['DATABASE_URL'])
bq = bigquery.Client()

def sync():
    # Get watermark
    with bq.query("SELECT MAX(updated_at) as wm FROM analytics.orders").result() as row:
        watermark = next(row).wm or datetime(2020, 1, 1)

    # Fetch incremental rows
    with pg.connect() as conn:
        rows = conn.execute(sqlalchemy.text("""
            SELECT * FROM orders WHERE updated_at > :wm ORDER BY updated_at
        """), {"wm": watermark}).mappings().all()

    if not rows:
        return

    # Load to BigQuery
    table = bq.dataset("analytics").table("orders")
    errors = bq.insert_rows_json(table, [dict(r) for r in rows])
    if errors:
        raise RuntimeError(f"BQ errors: {errors}")

    print(f"Synced {len(rows)} rows, watermark: {rows[-1]['updated_at']}")

if __name__ == "__main__":
    sync()

Cron entry:

*/30 * * * * /usr/bin/python /opt/sync/orders.py

That’s a complete incremental ETL. 30 lines. Reliable. Easy to debug.

What this month covers

12 more posts:

What I’m NOT going to cover

  • Spark, Hadoop, Databricks — out of scope
  • dbt in depth — useful when you have it; covered briefly
  • Airflow / Dagster / Prefect — only as comparison to “just use cron”
  • Real-time streaming at scale — separate problem space

When to graduate

You’ll know it’s time to leave lightweight ETL when:

  • One script grows past ~500 lines
  • You have > 10 scripts and forget what they do
  • Dependencies between scripts get complex
  • You need retries with exponential backoff and you’ve reinvented half of a framework

Then Dagster or Airflow earns its weight. Until then, scripts win.

Common Pitfalls (preview)

  • Cron without monitoring. Script fails silently. Wire up alerting from day one.
  • Re-runs that double-write. Idempotency matters; covered Oct 14.
  • No watermark. Re-fetching everything each run; doesn’t scale.
  • Building Airflow because “best practice.” Best practice for FAANG; overkill for most.

Wrapping Up

Boring scripts beat fancy frameworks for most teams. October walks through what those scripts actually look like. Wednesday: CDC vs polling — picking sync strategy.