Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Kafka Ecosystem
Kafka is more than just a message broker. It is a complete event streaming platform — think of it as both the highway and the logistics system for your data. The core broker handles message transport, but the ecosystem tools handle the hard parts: connecting to external systems, enforcing data contracts, and processing streams in real time. If Kafka itself is the engine, then Connect, Schema Registry, and ksqlDB are the transmission, dashboard, and GPS — each one independently valuable, but together they form a complete vehicle.1. Kafka Connect
Problem: Writing custom code to move data between Kafka and other systems (databases, S3, Elasticsearch) is tedious and error-prone. Every team ends up building the same “read from Postgres, write to Kafka” plumbing, complete with the same bugs around offset tracking, error handling, and schema changes. Solution: Kafka Connect is a framework for scalably and reliably streaming data between Kafka and other systems using pre-built, battle-tested connectors. Think of it as the “USB port” for Kafka — plug in a connector, provide configuration, and data flows automatically.Architecture
- Source Connectors: Pull data from an external system (e.g., PostgreSQL, MongoDB, S3) into Kafka topics. The connector handles polling, offset tracking, and fault tolerance.
- Sink Connectors: Push data from Kafka topics into an external system (e.g., Elasticsearch, data warehouses, Slack). The connector handles batching, retries, and exactly-once delivery where supported.
Example: PostgreSQL Source Connector (CDC)
Capture every INSERT, UPDATE, and DELETE from a database table and stream them to Kafka in real time. This pattern is called Change Data Capture (CDC), and it is how companies keep their search indexes, caches, and data warehouses in sync with the source database. Debezium is the most popular open-source CDC connector for Kafka. Thetopic.prefix determines the Kafka topic names — for example, db-server1.public.orders for the orders table.
2. Schema Registry
Problem: In a distributed system, producers and consumers are developed by different teams, deployed at different times, and evolve independently. How do you ensure they agree on the data format? What happens when someone adds a field, removes a field, or changes a type? Without guardrails, a schema change in one service silently breaks ten downstream consumers. Solution: A centralized registry for managing schemas (Avro, Protobuf, JSON Schema). Think of it as a “contract server” — before any message is sent, both sides agree on the format.How it works
- Producer registers its schema with the Schema Registry before sending the first message.
- Producer serializes data (e.g., to Avro) and includes only the Schema ID (4 bytes) in the message — not the full schema.
- Consumer reads the Schema ID from the message, fetches the corresponding schema from the Registry, and deserializes the data.
Benefits
- Data Governance: Enforce schema compatibility rules. Backward compatibility means new schema can read old data (safe for consumer upgrades). Forward compatibility means old schema can read new data (safe for producer upgrades). Full compatibility is both.
- Smaller Payloads: The schema is stored once in the registry, not repeated in every message. For schemas with many fields, this reduces message size by 50-80% compared to JSON.
- Breaking Change Detection: If a producer tries to register an incompatible schema (e.g., removing a required field), the registry rejects it before any data is sent. You find out at deploy time, not at 3 AM when consumers start crashing.
3. ksqlDB
Problem: Stream processing with Kafka Streams (Java) or Flink requires significant boilerplate code, JVM expertise, and deployment infrastructure. For simple transformations and aggregations, this is overkill. Solution: ksqlDB allows you to build stream processing applications using SQL — a language most developers and data analysts already know. No Java, no custom consumers, no deployment pipelines for stream processors.4. Kafka vs RabbitMQ vs ActiveMQ
This is one of the most frequently asked questions in system design interviews. The key insight is that Kafka and RabbitMQ solve fundamentally different problems despite both being called “message brokers.”| Feature | Kafka | RabbitMQ | ActiveMQ |
|---|---|---|---|
| Model | Log-based (Pull) — consumers control their own pace | Queue-based (Push) — broker pushes to consumers | Queue-based (Push) |
| Throughput | Extremely High (millions of messages/sec) | High (tens of thousands/sec) | Moderate |
| Persistence | Long-term (days to years) — replayable | Short-term (until consumed and acknowledged) | Short-term |
| Ordering | Per-partition ordering guaranteed | Per-queue ordering | Per-queue ordering |
| Use Case | Event streaming, audit logs, analytics, CDC | Complex routing, task queues, RPC | Enterprise integration (JMS) |
Common Pitfalls
Key Takeaways
- Use Kafka Connect to integrate with external systems without writing custom producer/consumer code. There are hundreds of pre-built connectors for databases, cloud storage, search engines, and more.
- Use Schema Registry to manage data contracts and schema evolution. Start with it from day one — retrofitting is painful.
- Use ksqlDB for simple, SQL-based stream processing. Graduate to Kafka Streams or Flink when SQL is not expressive enough.
- Choose Kafka for high-throughput event streaming (audit logs, CDC, analytics pipelines), and RabbitMQ for complex routing patterns and task queues.
- The Kafka ecosystem is not “just Kafka” — in production, you typically run Kafka + Schema Registry + Connect as a minimum viable setup, often with a stream processing layer on top.
Interview Deep-Dive
Your team is debating whether to use Kafka Connect with Debezium for CDC or write a custom consumer that polls the database for changes. Walk me through the trade-offs.
Your team is debating whether to use Kafka Connect with Debezium for CDC or write a custom consumer that polls the database for changes. Walk me through the trade-offs.
- The custom polling approach (query the database on a schedule, compare against last known state, push changes to Kafka) is conceptually simple but fails in production for several reasons. First, polling interval creates latency — if you poll every 30 seconds, changes are delayed by up to 30 seconds. Reducing the interval to 1 second puts significant load on the database. Second, you cannot reliably detect deletes (the row is gone, so your poll does not see it). Third, if you poll a large table, the query becomes expensive and slow. Fourth, you own all the failure handling: what happens if the poller crashes mid-batch?
- Debezium with Kafka Connect reads the database’s write-ahead log (WAL/binlog), capturing every INSERT, UPDATE, and DELETE in real time with sub-second latency. It does not query the database at all — it tails the log. This means zero additional load on the database, reliable delete capture (deletes appear in the WAL), and exactly-once processing guarantees (offsets track the WAL position, not a polling state).
- The trade-off: Debezium adds operational complexity. You need to run Kafka Connect workers, manage connector configurations, monitor connector health, and understand WAL-level configuration on the database (e.g., setting
wal_level=logicalin PostgreSQL, or enabling binlog in MySQL). If a connector fails, you need to understand how to restart it and handle the catch-up. - My recommendation: use Debezium for any production CDC use case. The custom poller only makes sense for one-off migrations, prototypes, or situations where the database does not support WAL tailing (rare with modern databases).
A team is sending JSON messages without a Schema Registry. Six months in, they have 15 consumers making different assumptions about field names and types. How do you introduce Schema Registry without breaking existing consumers?
A team is sending JSON messages without a Schema Registry. Six months in, they have 15 consumers making different assumptions about field names and types. How do you introduce Schema Registry without breaking existing consumers?
- This is a common migration scenario, and the key is backward compatibility. I would not switch existing topics to Avro overnight — that would break every consumer simultaneously.
- Phase 1: Deploy the Schema Registry and register the current JSON schema as an Avro or JSON Schema definition. Set the compatibility level to BACKWARD (new schemas can read old data). This is a read-only step that does not affect producers or consumers.
- Phase 2: Update producers to use the Schema Registry’s serializer. For Avro, switch from
StringSerializertoKafkaAvroSerializer. The serializer registers the schema automatically and embeds a 4-byte schema ID in each message. Configure the producer to use a new topic (e.g.,user-events-v2) rather than switching the existing topic. This ensures existing consumers on the old topic are unaffected. - Phase 3: Migrate consumers one at a time from the old topic to the new topic with Avro deserialization. Each consumer team migrates on their own schedule. During the transition, the producer publishes to both old and new topics (dual-write) or you use a Kafka Streams bridge that reads from the old topic and writes to the new one with schema enforcement.
- Phase 4: Once all consumers have migrated, decommission the old topic and the dual-write logic.
- The alternative (less disruptive but less clean): use the JSON Schema format in the Schema Registry instead of Avro. This lets consumers continue reading JSON while still getting schema validation and compatibility checks. The messages are still human-readable JSON, but the registry enforces that producers cannot add incompatible changes.
In a system design interview, you are asked to choose between Kafka and RabbitMQ for an order processing system. The system needs to handle order creation, payment processing, inventory updates, and email notifications. What do you choose and why?
In a system design interview, you are asked to choose between Kafka and RabbitMQ for an order processing system. The system needs to handle order creation, payment processing, inventory updates, and email notifications. What do you choose and why?
- This system has two distinct messaging patterns, and the answer is potentially both.
- For the event stream (order created, order updated, order shipped): Kafka. These are events that happened, and multiple downstream systems need to react independently. The payment service, inventory service, email service, and analytics service all consume the same
order-eventstopic. Each runs as a separate consumer group, processing events at their own pace. If the analytics team wants to reprocess a week of orders for a new report, they reset their consumer offset and replay — this is impossible with RabbitMQ because consumed messages are deleted. - For the task queue pattern (send this specific email, process this specific payment): RabbitMQ might be a better fit. If payment processing takes 30 seconds per order and you need exactly-once delivery with complex retry logic (retry 3 times, then dead-letter, then alert), RabbitMQ’s built-in retry, dead-letter exchange, and per-message acknowledgment make this straightforward. Kafka can do this, but you build the retry and dead-letter logic yourself.
- However, in practice, most teams choose one technology. If I had to pick one, I would pick Kafka for this use case because: the event replay capability is too valuable to give up, the throughput headroom is necessary for peak traffic (Black Friday order spikes), and the retry/dead-letter pattern can be implemented in Kafka using a retry topic and a dead-letter topic with a delay consumer.
- The key insight for the interviewer: I am not choosing based on “Kafka is better” — I am matching the messaging pattern to the tool’s strengths. Events go to Kafka, tasks can go to either, and the decision for tasks depends on team expertise and operational complexity tolerance.
orders (main topic), orders-retry (retry topic with a delay), and orders-dlq (dead-letter topic). The consumer reads from orders, processes the message, and on failure publishes to orders-retry with a retry count in the header. A separate consumer reads from orders-retry, waits for a delay (using pause() and resume() on the consumer or a timer), and republishes to orders. If the retry count exceeds the maximum (e.g., 3), the message goes to orders-dlq for manual investigation. This pattern is well-established and libraries like Spring Kafka provide it out of the box.Next: Kafka Architecture →