Why Lightweight ETL Beats Big Data Tools for Most Teams

Etl article cover illustration on a gradient background

October 3, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Most “ETL” work is < 100 GB/day, fits in a single VM, and doesn’t need a distributed framework. A 200-line Python script with SQLAlchemy + a cron job moves data better than Airflow + Spark + a team to operate them. Reserve big-data tools for big data.

After September’s observability theme, October pivots to data movement. Most teams I see have outsized data tooling for undersized problems. This month is what to use instead.

The “big data” trap

Companies with 50 GB of data and 5 engineers buy Airflow, Spark, dbt, Snowflake. Why?

The tooling is loud
“Data engineering” implies these tools
Vendor marketing
Resume-driven development

Each tool is excellent at its job. The job isn’t your job, probably.

A 200-line Python script reading from Postgres, transforming, writing to BigQuery, run on a 30-minute cron, handles the same workload at 1/100th the operational cost. Less to learn. Less to break.

When you genuinely need the big tools

Real signals:

Multi-TB datasets processed daily — Spark / Trino / DuckDB-with-effort
Hundreds of DAGs with complex dependencies — Airflow / Dagster
Real-time streaming at high volume — Kafka / Flink
Data team of 5+ maintaining shared transformations — dbt
Compliance requirements for lineage and audit — formal tools

For most teams: none of these apply.

What lightweight ETL looks like

The shape:

Source: Postgres, MySQL, BigQuery, S3, an API
Transform: pandas / Polars / plain SQL / Go structs
Destination: another database, data warehouse, S3
Orchestration: cron + a small Python/Go script + a metadata table

Operational footprint:

One VM or container
One database (often Postgres)
A cron entry or scheduled Lambda

That’s it. No Kubernetes cluster. No web UI. No DAG. Just a script that runs on a schedule.

A working pattern

For “sync orders from Postgres to BigQuery hourly”:

import sqlalchemy
from google.cloud import bigquery

pg = sqlalchemy.create_engine(os.environ['DATABASE_URL'])
bq = bigquery.Client()

def sync():
    # Get watermark
    with bq.query("SELECT MAX(updated_at) as wm FROM analytics.orders").result() as row:
        watermark = next(row).wm or datetime(2020, 1, 1)

    # Fetch incremental rows
    with pg.connect() as conn:
        rows = conn.execute(sqlalchemy.text("""
            SELECT * FROM orders WHERE updated_at > :wm ORDER BY updated_at
        """), {"wm": watermark}).mappings().all()

    if not rows:
        return

    # Load to BigQuery
    table = bq.dataset("analytics").table("orders")
    errors = bq.insert_rows_json(table, [dict(r) for r in rows])
    if errors:
        raise RuntimeError(f"BQ errors: {errors}")

    print(f"Synced {len(rows)} rows, watermark: {rows[-1]['updated_at']}")

if __name__ == "__main__":
    sync()

Cron entry:

*/30 * * * * /usr/bin/python /opt/sync/orders.py

That’s a complete incremental ETL. 30 lines. Reliable. Easy to debug.

What this month covers

12 more posts:

Oct 5: CDC vs polling
Oct 7: Postgres logical replication for CDC
Oct 10: Python ETL pipeline patterns
Oct 12: Go ETL pipeline patterns
Oct 14: Idempotent pipelines + watermarks
Oct 17: Schema drift handling
Oct 19: Kafka as sync backbone
Oct 21: Debezium for Postgres → Kafka
Oct 24: Postgres COPY for bulk loading
Oct 26: Backfilling historical data safely
Oct 28: Monitoring ETL pipelines
Oct 31: Month retro

What I’m NOT going to cover

Spark, Hadoop, Databricks — out of scope
dbt in depth — useful when you have it; covered briefly
Airflow / Dagster / Prefect — only as comparison to “just use cron”
Real-time streaming at scale — separate problem space

When to graduate

You’ll know it’s time to leave lightweight ETL when:

One script grows past ~500 lines
You have > 10 scripts and forget what they do
Dependencies between scripts get complex
You need retries with exponential backoff and you’ve reinvented half of a framework

Then Dagster or Airflow earns its weight. Until then, scripts win.

Common Pitfalls (preview)

Cron without monitoring. Script fails silently. Wire up alerting from day one.
Re-runs that double-write. Idempotency matters; covered Oct 14.
No watermark. Re-fetching everything each run; doesn’t scale.
Building Airflow because “best practice.” Best practice for FAANG; overkill for most.

Wrapping Up

Boring scripts beat fancy frameworks for most teams. October walks through what those scripts actually look like. Wednesday: CDC vs polling — picking sync strategy .

The “big data” trap

When you genuinely need the big tools

What lightweight ETL looks like

A working pattern

What this month covers

What I’m NOT going to cover

When to graduate

Common Pitfalls (preview)

Wrapping Up

Related posts

Building an ETL Pipeline in Go

Building an ETL Pipeline in Python

October Retro, ETL Pragmatism

Monitoring ETL Pipelines, Lag, Errors, Throughput

Backfilling Historical Data Safely

Kafka as a Sync Backbone

Schema Drift, Handling Source Changes Without Breaking Pipelines

Idempotent Pipelines and Watermarks

Let’s Start a Project