Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Kafka Internals Deep Dive

If you love understanding how things actually work, this chapter is for you. If you just want to produce and consume messages, feel free to skip ahead. No judgment.
This chapter reveals the distributed systems magic behind Kafka. We will explore the commit log abstraction, understand how replication guarantees durability, and demystify consumer group coordination. This knowledge is what allows you to tune Kafka for millions of messages per second.

Why Internals Matter

Understanding Kafka internals helps you:
  • Tune for performance when milliseconds matter
  • Debug production issues when messages go missing
  • Design better systems that leverage Kafka’s strengths
  • Ace interviews where Kafka internals are common
  • Make informed decisions about partitioning and replication

The Fundamental Abstraction: The Commit Log

At its core, Kafka is an append-only, immutable commit log. This simple abstraction is the source of Kafka’s power.
Topic: user-events
Partition 0:
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ 0  │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │ 7  │ 8  │ 9  │ ...
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
  ▲                                            ▲
  │                                            │
Oldest                                      Newest
(may be deleted)                         (append here)
Key properties:
  • Append-only: New messages always go to the end
  • Immutable: Messages never change after writing
  • Ordered: Messages have sequential offsets within a partition
  • Durable: Messages persist to disk (not just memory)
This is fundamentally different from traditional message queues where messages are deleted after consumption.

Log Segments: How Data is Stored

A partition is not one giant file. It is split into segments for manageability.
Partition directory: /kafka-logs/user-events-0/

├── 00000000000000000000.log      # Segment 1: offsets 0-999
├── 00000000000000000000.index    # Offset index
├── 00000000000000000000.timeindex # Timestamp index
├── 00000000000000001000.log      # Segment 2: offsets 1000-1999
├── 00000000000000001000.index
├── 00000000000000001000.timeindex
├── 00000000000000002000.log      # Active segment
└── ...

Segment Structure

Each segment consists of:
FilePurpose
.logActual message data
.indexMaps offset to file position
.timeindexMaps timestamp to offset
.snapshotProducer state for idempotency

Segment Lifecycle

┌─────────────────────────────────────────────────────────────────┐
│                      Segment Lifecycle                           │
│                                                                  │
│  Create ──▶ Active ──▶ Rolled ──▶ Eligible for deletion         │
│                │                                                 │
│                │ When:                                           │
│                │ - Size > segment.bytes (1GB default)           │
│                │ - Age > segment.ms (7 days default)            │
│                │ - Index full                                    │
└─────────────────────────────────────────────────────────────────┘

Log Compaction

For topics where only the latest value per key matters:
Before compaction:
┌─────────────────────────────────────────────────────────────────┐
│ Key:A,V:1 │ Key:B,V:1 │ Key:A,V:2 │ Key:A,V:3 │ Key:B,V:2 │
└─────────────────────────────────────────────────────────────────┘

After compaction:
┌─────────────────────────────────────────────────────────────────┐
│ Key:A,V:3 │ Key:B,V:2 │
└─────────────────────────────────────────────────────────────────┘
Use cases:
  • CDC (Change Data Capture) - keep latest row state
  • User profiles - keep latest version
  • Configuration - keep current settings

The Index Files: Fast Lookups

Reading from offset 1,000,000 should not require scanning 1M messages. Indexes solve this.

Offset Index

Offset Index (.index):
┌─────────────────┬──────────────────┐
│ Relative Offset │ Physical Position│
├─────────────────┼──────────────────┤
│       0         │        0         │
│      100        │     102400       │
│      200        │     204800       │
│      ...        │       ...        │
└─────────────────┴──────────────────┘
Finding offset 150:
  1. Binary search index: 100 < 150 < 200
  2. Start at position 102400
  3. Scan forward until offset 150
Indexes are sparse (not every offset) for space efficiency.

Timestamp Index

Timestamp Index (.timeindex):
┌───────────────┬─────────────────┐
│  Timestamp    │  Relative Offset│
├───────────────┼─────────────────┤
│ 1701590400000 │       0         │
│ 1701590410000 │      100        │
│ 1701590420000 │      200        │
└───────────────┴─────────────────┘
This powers offsetsForTimes() - seek to a specific time.

Replication: The Safety Net

Kafka replicates partitions across brokers for fault tolerance.

Replication Topology

Topic: orders (replication-factor=3)

         ┌──────────────────────────────────────────────────┐
         │                 Partition 0                       │
         ├────────────────────┬──────────────┬──────────────┤
         │    Broker 1        │   Broker 2   │   Broker 3   │
         │    (Leader)        │  (Follower)  │  (Follower)  │
         │    [0,1,2,3,4]     │  [0,1,2,3,4] │  [0,1,2,3]   │
         └────────────────────┴──────────────┴──────────────┘


                                              Slightly behind

Leader and Followers

RoleResponsibilities
LeaderHandles all reads and writes for partition
FollowersReplicate from leader, ready to take over
Producers and consumers only talk to the leader. This simplifies consistency.

In-Sync Replicas (ISR)

The ISR is the set of replicas that are “caught up” with the leader:
ISR = [Broker1, Broker2, Broker3]  # All caught up
ISR = [Broker1, Broker2]           # Broker3 fell behind
ISR = [Broker1]                    # Only leader in sync (dangerous!)
A replica falls out of ISR when:
  • More than replica.lag.time.max.ms behind (default: 30s)
  • Not sending fetch requests

Durability Guarantees: acks

Producers control durability with acks:
acksMeaningDurabilityPerformance
0Fire and forgetNoneFastest
1Leader acknowledgedModerateFast
all (-1)All ISR acknowledgedHighestSlowest
acks=all with min.insync.replicas=2:

Producer ──▶ Leader (Broker1)

              ├──▶ Follower (Broker2) ──▶ ACK

              └──▶ Follower (Broker3) ──▶ ACK

            ◀──────────────────────────────┘
                    Producer receives ACK

Leader Election

When a leader fails:
  1. Controller detects leader failure (via ZooKeeper/KRaft)
  2. Controller selects new leader from ISR
  3. Controller updates metadata
  4. Brokers and clients fetch new metadata
  5. New leader starts serving requests
Before failure:
  ISR = [Broker1 (Leader), Broker2, Broker3]

Broker1 dies:
  ISR = [Broker2, Broker3]
  Controller elects Broker2 as new leader

After election:
  ISR = [Broker2 (Leader), Broker3]
Unclean leader election: If ISR is empty, Kafka can elect a non-ISR replica (data loss risk). Controlled by unclean.leader.election.enable (default: false).

Producer Internals

Understanding producer internals helps you tune for throughput.

Producer Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Kafka Producer                            │
│                                                                  │
│  ┌──────────┐   ┌──────────────┐   ┌────────────────────────┐  │
│  │  Record  │──▶│  Serializer  │──▶│     Partitioner        │  │
│  │  (K,V)   │   │              │   │ (Which partition?)     │  │
│  └──────────┘   └──────────────┘   └───────────┬────────────┘  │
│                                                  │               │
│                                    ┌─────────────▼─────────────┐│
│                                    │    Record Accumulator     ││
│                                    │  ┌─────┐ ┌─────┐ ┌─────┐ ││
│                                    │  │Batch│ │Batch│ │Batch│ ││
│                                    │  │ P0  │ │ P1  │ │ P2  │ ││
│                                    │  └─────┘ └─────┘ └─────┘ ││
│                                    └───────────┬───────────────┘│
│                                                │                │
│                                    ┌───────────▼───────────────┐│
│                                    │      Sender Thread        ││
│                                    │   (Network I/O)           ││
│                                    └───────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Batching: The Performance Trick

Producers accumulate records into per-partition batches before sending. This is the single biggest performance lever in Kafka — sending 1000 records in one network request is dramatically faster than 1000 individual requests. The analogy: mailing letters one at a time versus filling a mailbag and sending it once.
ConfigPurposeDefaultTuning Guidance
batch.sizeMax batch size in bytes16KBIncrease to 64-128KB for high-throughput producers
linger.msWait time for more records0ms5-20ms is the sweet spot for most workloads
buffer.memoryTotal memory for buffering32MBIncrease if you produce to many partitions simultaneously
linger.ms=0:   Record arrives ──▶ Send immediately (lowest latency, worst throughput)
linger.ms=10:  Record arrives ──▶ Wait 10ms ──▶ Send batch (higher throughput, 10ms added latency)
Larger batches = better throughput, higher latency. This is a fundamental trade-off — there is no free lunch. For most applications, 5-10ms of added latency is invisible to users but doubles or triples throughput.

Partitioning Strategy

Default partitioner:
  • If key is null: Round-robin across partitions
  • If key is present: hash(key) % numPartitions
// Same key always goes to same partition
producer.send(new ProducerRecord<>("orders", "user-123", orderData));
// user-123 orders always in same partition = ordered processing

Idempotent Producer

Enable with enable.idempotence=true:
Without idempotence:
  Producer sends ──▶ Network error ──▶ Retry ──▶ Duplicate message!

With idempotence:
  Producer sends (PID=1, Seq=5) ──▶ Network error ──▶ Retry (PID=1, Seq=5)


                                                   Broker: "Already have Seq=5"
                                                   Deduplicate!
Each producer gets a Producer ID (PID) and sequence numbers per partition.

Consumer Internals

Consumer groups are Kafka’s killer feature for parallel processing.

Consumer Group Coordination

Consumer Group: order-processors

┌─────────────────────────────────────────────────────────────────┐
│                    Group Coordinator                             │
│                    (Broker 2)                                    │
│                                                                  │
│   Manages:                                                       │
│   - Group membership                                             │
│   - Partition assignment                                         │
│   - Offset commits                                               │
└─────────────────────────────────────────────────────────────────┘

           ┌───────────────┼───────────────┐
           ▼               ▼               ▼
      Consumer 1      Consumer 2      Consumer 3
      [P0, P1]        [P2, P3]        [P4, P5]

The Rebalance Protocol

When group membership changes (consumer joins/leaves), partitions are reassigned:
1. Consumer sends JoinGroup request
2. Coordinator waits for all consumers (session.timeout.ms)
3. Coordinator selects a leader consumer
4. Leader runs partition assignment algorithm
5. Coordinator sends assignments to all consumers
6. Consumers start fetching from assigned partitions

Partition Assignment Strategies

StrategyBehavior
RangeAssignorAssign contiguous partitions per topic
RoundRobinAssignorDistribute partitions evenly across consumers
StickyAssignorMinimize reassignments during rebalance
CooperativeStickyAssignorIncremental rebalance (no stop-the-world)

Offset Management

Consumers track progress via offsets:
__consumer_offsets topic (internal):
┌─────────────────┬──────────────────┬─────────────────┐
│  Group          │  Topic-Partition │  Committed      │
├─────────────────┼──────────────────┼─────────────────┤
│  order-procs    │  orders-0        │  1523           │
│  order-procs    │  orders-1        │  1891           │
│  order-procs    │  orders-2        │  1456           │
└─────────────────┴──────────────────┴─────────────────┘
Commit strategies:
StrategyCodeRisk
Auto commitenable.auto.commit=trueAt-least-once, possible duplicates
Sync commitconsumer.commitSync()At-least-once, blocks thread
Async commitconsumer.commitAsync()At-least-once, no blocking
Manual offsetCommit after processingExactly-once possible

ZooKeeper vs KRaft

Kafka is transitioning from ZooKeeper to KRaft (Kafka Raft).

ZooKeeper Mode (Legacy)

┌─────────────────────────────────────────────────────────────────┐
│                    ZooKeeper Ensemble                            │
│     (Stores: broker list, topic config, controller election)    │
└────────────────────────────────────┬────────────────────────────┘

     ┌───────────────────────────────┼───────────────────────────┐
     ▼                               ▼                           ▼
 Broker 1                        Broker 2                    Broker 3
 (Controller)
ZooKeeper stored:
  • Broker registration and liveness
  • Topic and partition metadata
  • Controller election
  • ACLs and quotas

KRaft Mode (New Standard)

┌─────────────────────────────────────────────────────────────────┐
│                    Kafka Controllers                             │
│           (Raft consensus, metadata stored in Kafka)            │
│                                                                  │
│     Controller 1 ◀──▶ Controller 2 ◀──▶ Controller 3            │
└────────────────────────────────────┬────────────────────────────┘

     ┌───────────────────────────────┼───────────────────────────┐
     ▼                               ▼                           ▼
 Broker 1                        Broker 2                    Broker 3
KRaft advantages:
  • Simpler operations: One system instead of two
  • Lower latency: No ZooKeeper round-trips
  • Better scalability: Millions of partitions possible
  • Faster recovery: Metadata in Kafka itself

Interview Deep Dive Questions

Answer: 1) Sequential disk I/O (append-only log), 2) Batching (produce/consume in batches), 3) Zero-copy transfer (sendfile syscall), 4) Page cache utilization (OS caches data), 5) Compression (reduce network/disk I/O), 6) Partitioning (parallel processing across brokers).
Answer: ISR (In-Sync Replicas) is the set of replicas caught up with the leader. When acks=all, producer waits for all ISR replicas to acknowledge. If replica falls behind (lag > replica.lag.time.max.ms), it is removed from ISR. min.insync.replicas sets minimum ISR size for writes to succeed. This balances durability and availability.
Answer: 1) Consumer sends heartbeats to group coordinator, 2) On membership change, coordinator triggers rebalance, 3) All consumers stop consuming (stop-the-world), 4) Consumers send JoinGroup request, 5) Coordinator elects leader who runs assignment strategy, 6) Coordinator sends assignments, 7) Consumers start consuming assigned partitions. CooperativeStickyAssignor enables incremental rebalance.
Answer: Requires: 1) Idempotent producer (enable.idempotence=true) - prevents duplicates via PID+sequence, 2) Transactional producer (transactional.id) - atomic writes across partitions, 3) Consumer with isolation.level=read_committed - only sees committed transactions. Used in Kafka Streams for exactly-once stream processing.
Answer: 1) Controller detects failure (missed heartbeats or ZK session loss), 2) For each partition led by failed broker, controller elects new leader from ISR, 3) Controller updates metadata and broadcasts to all brokers, 4) Clients refresh metadata and connect to new leaders, 5) Followers catch up from new leader. If min.insync.replicas not met, partition becomes unavailable for writes.
Answer: 1) Log-based vs queue-based (messages not deleted on consumption), 2) Pull model vs push model (consumers control pace), 3) Replayability (seek to any offset), 4) Consumer groups (partitions enable parallel consumption), 5) Retention-based (time/size) vs delivery-based deletion, 6) Higher throughput (millions vs thousands per second), 7) Ordering per partition only.

Tuning Kafka: Key Configurations

Producer Performance

# Batching -- the primary throughput lever
batch.size=65536          # 64KB batches (4x default). Larger batches = fewer network round trips.
linger.ms=10              # Wait up to 10ms to fill the batch. Adds 10ms latency but dramatically
                          # improves throughput by allowing more records to batch together.

# Compression -- reduces both network and disk I/O at the cost of CPU
compression.type=lz4       # lz4 = fastest compression/decompression. Use zstd for better
                           # compression ratio when CPU is cheap and bandwidth is expensive.

# Throughput -- memory allocation for the producer's internal buffer
buffer.memory=67108864     # 64MB buffer across all partitions. If this fills, send() blocks.
max.block.ms=60000         # How long send() blocks when buffer is full before throwing.

Consumer Performance

# Fetch tuning -- controls how much data the consumer pulls per request
fetch.min.bytes=1048576    # Wait for 1MB of data before responding. Higher = better throughput,
                           # but adds latency on low-volume topics (waits for data to accumulate).
fetch.max.wait.ms=500      # Maximum wait time. Fetch completes when EITHER min bytes OR max wait is hit.
max.poll.records=500       # Max records returned per poll() call. Set this based on how long your
                           # processing takes -- if processing 500 records takes > max.poll.interval.ms,
                           # the consumer gets kicked out of the group.

# Per-partition fetch size
max.partition.fetch.bytes=1048576  # 1MB per partition. Increase for large messages.

Broker Performance

# Networking -- thread pools for handling client connections
num.network.threads=8       # Threads that read requests from the socket. Scale with number of clients.
num.io.threads=16           # Threads that process requests (disk I/O, replication). Scale with disk count.

# Log segment configuration
log.segment.bytes=1073741824   # 1GB segments. Smaller = more files, more frequent cleanup.
log.retention.hours=168        # 7 days retention. Balance between storage cost and replay capability.

# Replication -- controls how followers catch up with leaders
num.replica.fetchers=4          # Threads per broker for fetching from leaders. Increase if follower
                                # lag is high and the bottleneck is fetch throughput, not disk I/O.
replica.fetch.max.bytes=1048576 # Max bytes per fetch from leader. Increase for large messages.

Key Takeaways

  1. Kafka is an append-only commit log - simple abstraction, powerful results
  2. Partitions split into segments - enables efficient retention and compaction
  3. Indexes enable fast seeks - sparse offset and timestamp indexes
  4. ISR determines durability - acks=all waits for ISR acknowledgment
  5. Producers batch for throughput - linger.ms and batch.size control this
  6. Consumer groups enable parallelism - partitions assigned to consumers
  7. Rebalancing is expensive - use CooperativeStickyAssignor to minimize impact
  8. KRaft is the future - simpler, faster, scales to millions of partitions

Interview Deep-Dive

Strong Answer:
  • Traditional data transfer for a network application involves four copies: disk to kernel page cache, page cache to application buffer, application buffer to socket buffer, socket buffer to NIC. Each copy involves a context switch between user space and kernel space.
  • Kafka uses the sendfile() system call (on Linux), which transfers data directly from the page cache to the network socket, bypassing the application entirely. This eliminates two of the four copies and two context switches. The data goes: disk to page cache (or already cached), then page cache directly to NIC via DMA. The Kafka broker process never touches the bytes.
  • This matters enormously for throughput. At LinkedIn’s scale (7+ trillion messages per day), the savings are massive. Without zero-copy, each message would require the broker to allocate application memory, copy bytes in, copy bytes out to the socket, and garbage-collect the memory. With zero-copy, the broker is essentially a routing layer that points the kernel at the right file offset and says “send this to that socket.”
  • This is also why Kafka performs well with the OS page cache. Kafka is designed to let the OS manage caching rather than implementing its own cache in the JVM heap. Messages written by producers go to the page cache, and messages read by consumers are served from the same page cache — often without touching disk at all for recent data. This design avoids GC pauses that would occur with a JVM-managed cache.
Follow-up: If Kafka relies on the OS page cache, what happens when the machine runs low on memory?The OS evicts page cache pages using an LRU policy to make room for application memory. This means older Kafka data (tail of the log) falls out of cache and subsequent reads hit disk. For consumers reading recent data (the common case), the impact is minimal — recent writes are still in cache. For consumers replaying old data (e.g., a new consumer group starting from the beginning), reads become disk-bound. This is why Kafka brokers should have enough RAM to cache the “hot” portion of data (the most recent segment or two per partition). A common rule of thumb is to size broker RAM at 25-50% of the active data set.
Strong Answer:
  • The first thing I check is min.insync.replicas. If it is set to 1 (the default), then acks=all only guarantees that the leader acknowledged the write. If the replication factor is 3 but min.insync.replicas is 1, a single broker (the leader) acknowledging is considered “all ISR” when the ISR has shrunk to just the leader. If that leader then crashes before replicating, the data is lost.
  • The fix is the golden configuration: replication.factor=3, min.insync.replicas=2, acks=all. This ensures at least 2 replicas acknowledge every write. A single broker failure cannot lose data because at least one other replica has the data.
  • Second thing I check: was unclean.leader.election.enable set to true? If the ISR was empty (all ISR replicas failed simultaneously) and an out-of-sync replica was elected leader, that replica may be missing messages. The new leader’s log becomes the source of truth, and any messages it did not have are permanently lost.
  • Third: was the producer correctly handling errors? If the producer sends a message, receives a timeout (not an explicit ack), and does not retry, the message may have been written to the leader but the ack was lost in transit. Without idempotent producer (enable.idempotence=true), retries could cause duplicates, so some teams disable retries — which leads to data loss on transient errors.
  • The complete production configuration I recommend: acks=all, min.insync.replicas=2, replication.factor=3, unclean.leader.election.enable=false, enable.idempotence=true, retries=Integer.MAX_VALUE. This is the only combination that guarantees no data loss under single broker failure.
Follow-up: With min.insync.replicas=2 and only 2 replicas alive out of 3, what happens if one more broker fails?The partition becomes unavailable for writes. The producer receives NotEnoughReplicasException because the ISR has only 1 member, which is below min.insync.replicas=2. Reads may still work (the remaining leader can serve consumers). This is the availability trade-off of strong durability: you sacrifice write availability to prevent data loss. The partition stays unavailable for writes until the second replica recovers and rejoins the ISR. This is by design — the alternative (allowing writes with insufficient replicas) risks data loss, which is worse.
Strong Answer:
  • The Eager rebalancing protocol (default in older Kafka versions) is a “stop-the-world” event. When any consumer joins or leaves the group: (1) all consumers stop processing and revoke all their partitions, (2) every consumer sends a JoinGroup request to the group coordinator, (3) the coordinator elects a consumer leader who runs the partition assignment algorithm, (4) the coordinator distributes the new assignments, (5) consumers start consuming from their newly assigned partitions.
  • This is painful because during steps 1-5, no consumer in the group is processing any messages. For a consumer group processing time-sensitive data (fraud detection, real-time dashboards), even a 30-second processing gap is unacceptable. Worse, rebalances are triggered by common events: a consumer process restarting, a deployment rolling update, or a consumer taking too long between poll() calls (exceeding max.poll.interval.ms).
  • The CooperativeStickyAssignor (introduced in Kafka 2.4) eliminates the stop-the-world pause. It uses an incremental protocol: only the partitions that need to move are revoked from their current owners. All other partitions continue being processed without interruption. A rolling deployment of 10 consumer instances triggers 10 small incremental rebalances instead of one large stop-the-world event, and at each step, only the partitions owned by the restarting instance are temporarily unassigned.
  • The StickyAssignor (without Cooperative) still uses eager revocation but minimizes partition movement — it tries to keep the same assignment as before. The CooperativeStickyAssignor combines both benefits: minimal movement and no stop-the-world.
  • In production, I always configure CooperativeStickyAssignor and tune the timeout settings: session.timeout.ms=45000, heartbeat.interval.ms=15000, and ensure max.poll.interval.ms exceeds the maximum processing time for a batch. I also use static group membership (group.instance.id) for consumers with stable identities (like Kubernetes pods with stable hostnames), which avoids rebalancing entirely on temporary disconnections.
Follow-up: During a rolling deployment, each consumer instance restarts one at a time. With the default Eager protocol, how many rebalances occur and what is the total processing gap?For N consumer instances, a rolling deployment triggers 2N rebalances: N leave events and N join events. Each rebalance is a stop-the-world pause of 10-30 seconds (depending on group size and timeout settings). For 10 instances, that is 20 rebalances with a cumulative processing gap of 200-600 seconds — during which no messages are consumed. With CooperativeStickyAssignor and static group membership, the same rolling deployment triggers N incremental rebalances (only on leave, since the rejoin with the same group.instance.id is recognized), each affecting only the partitions of the restarting instance. Total processing gap: near zero for unaffected partitions, and 5-10 seconds per partition that moves.

Ready to build streaming applications? Next up: Kafka Streams where we will transform and aggregate event streams in real-time.