background-shape
Kafka as a Sync Backbone
October 19, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Kafka shines as the backbone for many-source, many-destination data flows. Topic per table; producers (Debezium, services) write; consumers (warehouse loaders, search indexers, etc.) read independently. Overkill for 1-source-1-destination pipelines.

After schema drift, Kafka as the central nervous system. Kafka is a Swiss army knife — used for log aggregation, event streaming, queueing, ETL backbone. This post focuses on the ETL use case.

Where Kafka earns it

Two or more of these typically apply when Kafka pays off:

  • Multiple producers for the same data type (services emit events; CDC tools emit DB changes)
  • Multiple consumers for the same stream (warehouse loader + search indexer + analytics + ML feature store)
  • Replay capability needed (consumer crashes; resume from last offset; rewind for backfill)
  • Decoupling time between producer and consumer (consumer can be down for hours; events buffer)
  • High volume that warrants partitioning

For 1-source-1-destination at 1000 events/min: cron + script. For 5 sources × 8 destinations × 100K events/min: Kafka.

The topology

[Postgres (Debezium)]
[App service events]    ──→ [Kafka cluster]  ──→  [BigQuery loader]
[Mobile clickstream]                          ──→  [Elasticsearch indexer]
                                              ──→  [ML feature pipeline]
                                              ──→  [Audit log archive]

Each producer writes to topics. Each consumer reads independently, tracking its own offset. Failures are local — the BigQuery loader being down doesn’t affect search indexing.

Topic naming

Patterns that scale:

<domain>.<entity>.<event-type>.<version>

billing.subscriptions.changes.v1
billing.invoices.events.v1
auth.users.changes.v1
inventory.products.updates.v1

Versioned in the name. Breaking changes = new topic; old consumers continue reading old topic until migrated.

Partitioning: partition key = the entity ID (subscription_id, user_id). Ensures changes for the same entity arrive in order on a single partition.

Producers

CDC tooling (Debezium) is the most common producer. Application code can produce too:

import "github.com/segmentio/kafka-go"

writer := &kafka.Writer{
    Addr:         kafka.TCP("kafka:9092"),
    Topic:        "billing.subscriptions.changes.v1",
    Balancer:     &kafka.Hash{},
    RequiredAcks: kafka.RequireAll,
}

err := writer.WriteMessages(ctx, kafka.Message{
    Key:   []byte(subscriptionID),
    Value: marshalEvent(event),
})

RequireAll waits for all in-sync replicas. Slower; safer. For event-sourcing-critical messages, use this.

RequireOne waits for the leader. Faster; slight risk of loss on failover.

Consumers

Each consumer group reads independently:

reader := kafka.NewReader(kafka.ReaderConfig{
    Brokers:  []string{"kafka:9092"},
    Topic:    "billing.subscriptions.changes.v1",
    GroupID:  "bigquery-loader",
    MinBytes: 10e3,
    MaxBytes: 10e6,
})

for {
    m, err := reader.FetchMessage(ctx)
    if err != nil { return err }

    if err := processMessage(m); err != nil {
        log.Printf("error: %v", err)
        // Don't commit; will be redelivered
        continue
    }

    if err := reader.CommitMessages(ctx, m); err != nil {
        log.Printf("commit error: %v", err)
    }
}

Two patterns:

Auto-commit (default): consumer commits offsets periodically. Easier; risk of reprocessing on crash.

Manual commit (above): explicit commit only after processing succeeds. Stronger guarantees; more code.

For ETL: manual commit. Reprocessing = duplicate writes; idempotent destination handles it.

Replay and backfill

Want to rebuild the warehouse from scratch:

# Reset consumer group to earliest offset
kafka-consumer-groups --bootstrap-server kafka:9092 \
  --group bigquery-loader \
  --topic billing.subscriptions.changes.v1 \
  --reset-offsets --to-earliest --execute

Restart consumer. Reads from beginning. Idempotent destination de-dupes.

This is the killer feature. Backfill becomes a config change, not a one-off script.

Retention determines how far back you can replay:

topic config:
  retention.ms = 604800000   # 7 days
  retention.bytes = -1       # unlimited size

7 days = can rebuild within a week of bad data. Longer retention costs disk but enables longer reach-back.

For unlimited retention with cost control, use tiered storage (Kafka 3.0+; KIP-405) or upgrade to Confluent / Aiven’s tiered storage features.

Compacted topics

For “current state per key” use cases:

topic config:
  cleanup.policy = compact

Kafka retains the LATEST message per key, indefinitely. Old versions of the same key get compacted away. Useful for:

  • Configuration distribution
  • “Current state of subscription X” snapshots
  • Materialized views

A consumer joining late gets the current state of every key by reading from earliest.

Operational cost

Running Kafka:

  • 3-node cluster minimum for production HA
  • ZooKeeper (or KRaft mode, recent versions)
  • Schema registry if you use Avro/Protobuf
  • Connect cluster for Debezium / sink connectors
  • Monitoring (JMX → Prometheus)

Managed (Confluent Cloud, AWS MSK, Aiven): trade money for ops time. Worth it for most teams.

For a 50K events/sec workload: managed Kafka runs $200-500/month. Self-host on dedicated cluster: lower marginal cost but real ops investment.

When to skip Kafka

If you have:

  • < 10 events/sec
  • One source, one destination
  • No replay requirement
  • Tight team / no Kafka knowledge

…cron + Postgres + a script is simpler.

Don’t introduce Kafka because it’s “what real engineering teams do.” Introduce it when its specific capabilities solve a real problem.

Common Pitfalls

Too many topics. 10K topic-partitions degrade Kafka performance. Group related events into one topic.

Partition keys not stable. Same entity goes to different partitions over time; events arrive out of order at consumers.

Auto-commit with side-effectful processing. Crash between processing and commit = lost event.

Schema-less JSON. Producers and consumers drift; data quality degrades.

No tiered storage / retention plan. Brokers fill disks. Set retention.

Consumer groups without monitoring lag. Slow consumer falls behind for days; nobody notices.

Wrapping Up

Kafka = backbone for many-to-many data flows. Topic per entity, partition by key, consumer-group per destination. Friday: Debezium for Postgres → Kafka CDC.