background-shape
October Retro, ETL Pragmatism
October 31, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Four pipelines running in prod by month-end. Three lightweight Python; one Kafka+Debezium for high-frequency event data. Self-hosted observability picked up where Datadog would have charged. Worth defending: the discipline around idempotency and watermarks.

End of October. Quick retro on what shipped in the ETL theme.

What shipped

Four pipelines now in production:

  1. orders → BigQuery — hourly cron, ~50K rows/run, lightweight Python
  2. users → BigQuery — daily cron, ~5K rows/run, lightweight Python
  3. sensor_readings → ClickHouse — Kafka consumer, ~500K/hour, Debezium + Go consumer
  4. events → S3 archive — daily cron, batch archive, lightweight Python

Total time investment: ~5 days across the month. Once running, ~30 min/week maintenance.

What worked

The cron + script pattern. Most pipelines didn’t need orchestration. A */15 * * * * entry, a 100-line Python script, and basic monitoring covered everything except the high-frequency one.

Idempotent destination + watermark in metadata table. Tested twice during the month: once when a cron fired twice (firewall issue caused first run to hang), once during manual re-runs. Zero duplicate data. The pattern works.

Pushgateway for cron monitoring. Cheap, reliable, integrates with the existing observability stack from September. All four pipelines now report to Prometheus.

Debezium for the high-frequency case. The sensor reading pipeline went from custom polling (high source DB load) to Debezium → Kafka → custom consumer. Source DB CPU dropped 40%. Worth the setup complexity.

What didn’t

Tried dbt for the warehouse transformations. dbt is great. Also: for our 12 models, the operational overhead (separate compile/test/run cycles, schema interpretation rules) was more than the benefit. Reverted to plain SQL views materialized weekly.

Polars for the Python pipelines. Faster than pandas but our datasets are small. Switched back to pandas for team familiarity. Will reach for Polars when data grows.

Initial backfill for the events pipeline. First try: single Python script running for 6 hours. Realized halfway it would crash on day 18 of 365 due to a memory leak in the source DB driver. Restructured to chunked backfill with progress tracking. Completed cleanly the second time. Lesson: chunked from day one.

Kafka KRaft mode. Tried KRaft (no ZooKeeper) on the Kafka deployment. Works but tooling is rough; some operations require workarounds. Reverted to ZooKeeper for stability. Will revisit once Kafka 3.5+.

What I’d do differently

Set up monitoring BEFORE the first pipeline ships. I built the orders pipeline, ran it for 3 days, realized I had no idea if it was working. Spent half a day backfilling monitoring. Should have been first.

Schema-as-code from day one. Pipeline 2 hit schema drift mid-month. Without committed schemas, took an hour to diagnose “is this expected?” An hour avoidable.

Document the watermark semantics. When I rebuilt the events pipeline I had to re-derive whether the watermark advances inclusively or exclusively. Cost an afternoon. Should be in the README.

Skip Kafka for pipelines 1-3. Only one of four genuinely needed Kafka. The other three are fine on cron. Earlier in my career I’d have put everything on Kafka “for consistency.” This time I resisted; right call.

What’s load-bearing now

Patterns I’d defend in code review:

  • Watermark table per pipeline. One source of truth for “where am I.”
  • Upsert at destination. Idempotency. Non-negotiable.
  • Push to Pushgateway after success. Stale-timestamp alerts catch silent failures.
  • Chunked backfill with progress tracking. Resumable; safe; doesn’t melt the source.

Patterns I treat as optional:

  • Polars over pandas (depends on data size)
  • Kafka (only when justified)
  • Schema drift auto-detection (nice-to-have)

What October didn’t cover

Honest gaps:

  • Pure streaming ETL (Flink, Spark Structured Streaming). Different problem space.
  • Data warehouse transformations (dbt deep dive). Other people cover it well.
  • Real-time analytics serving (Materialize, RisingWave). Out of scope.
  • Snowflake-specific features (Snowpipe, Streams, Tasks). I don’t use Snowflake.

If those matter for your shop, October isn’t the comprehensive guide.

What November looks like

November theme: DevSecOps — Securing backend APIs, advanced JWT authentication, rate-limiting. Pivots from data engineering to API security. Same shape: 13 articles, M/W/F.

Why this theme: it’s the topic I get asked about most when consulting. JWT misuse + sloppy rate limiting are the most common bugs I see. Time to write down the right patterns.

See you Wednesday.