Log Replication Disaggregation Survey - Kafka Replication Protocol

In this post, we’re going to look at the Kafka Replication Protocol and how it separates control plane and data plane responsibilities. It’s worth noting there are other systems that separate concerns in a similar way, with RabbitMQ Streams being one that I am aware of.

Preamble

There are so many systems out there, too many for me to list without this becoming a huge research project. So I’m going to stick to systems I have directly been involved with, or have a decent level of knowledge about already. Feel free to comment on social media about interesting systems that I haven’t included.

I have classified a few ways of breaking apart a monolithic replication protocol such as Raft. The classifications are:

(A) Disaggregated roles/participants. The participants form clearly delineated roles that can be run as separate processes. The protocol itself may still be converged (control plane and data plane intertwined) but exercised by disaggregated components.
(B) Disaggregated protocol that separates the control plane from the data plane. The protocol itself is broken up into control plane and data plane duties, with clean and limited interaction between them.
(C) Segmented logs. Logs are formed from a chain of logical log segments.
(D) Pointer-based logs. Separates data from ordering.
(E) Separating ordering from IO. Avoid the need for a leader by allowing all nodes to write, but coordinate via a sequencer component.
(F) Leaderless proxies. Abstracts the specifics of the replicated log protocol from clients, via another disaggregated abstraction layer of leaderless proxies.

Kafka Replication Protocol (aka topic partition replication protocol)

This protocol is concerned with topic partition replication.

Disaggregation categories:

(A) Disaggregated roles/participants
(B) Disaggregated protocol that separates the control plane from the data plane.

In my classification, B kind of implies A. It just takes it further, or perhaps, is one kind of specific way of doing (A). I have kept them separate as I think it’s worth exploring how roles can be disaggregated in a converged protocol (A) but also how roles can be disaggregated such that the protocol itself has a clear split between control and data planes (B).

Data plane–control plane separation

Before we look at the Kafka Replication Protocol, I’d like to review what the data plane and control plane are within the context of a log replication protocol. My post, Steady on! Separating Failure-Free Ordering from Fault-Tolerant Consensus also touches on this subject.

The data plane is typically the logic of performing replication in a stable failure-free state (that I will call steady-state). Nothing bad is happening; the cluster is making progress in this steady-state. There might be minor failures, such as a transient connectivity issue between nodes, but nothing that requires a change to the configuration of the cluster. The data plane can be simple, as only failure-free ordering is required to be implemented (see the Steady On! post to for more on that).

The control plane is the logic to make changes to the steady-state configuration, either reacting to failure, or commands from some kind of operator.

A failure of some kind might be preventing the steady-state configuration from making progress. In a leader-follower system, it might mean the leader has become unreachable, requiring a leader election to elect a new leader to ensure continued availability. It could mean the permanent failure of a server in the cluster, requiring a replacement. In either case, without some kind of change, the system is either unavailable or degraded in some way.

Other reasons can be to migrate workloads to avoid contention on the underlying hardware. In this case, a move may actually be the addition of a new member and the removal of an existing member.

Whatever the case, the control plane takes the system from one steady-state configuration to another.

*Fig 1. Reconfiguration moves the system from one steady state configuration to another in response to failures, policy triggers or other factors such as load balancing.*

In Raft, reconfiguration specifically refers to membership changes, but I will use the term reconfiguration in a broader sense to include any change to the configuration/topology of the replication cluster (including who the leader is). Who the leader is, is not a minor detail. In Raft, reconfiguration and data replication are mixed into the same underlying log, exercised by the same nodes, and with a tight dependence between them for correctness. The same goes for MultiPaxos, which is why I use the term converged protocol for both.

Kafka Replication Protocol–Splitting the planes

Separating the control plane from the data plane requires loose coupling. We want there to be little to no runtime dependency between the data plane and the control plane. In other words, if the control plane goes down, we want the data plane to continue to function. In such a scenario, the data plane should only cease to function if it exits a failure-free state, requiring intervention from the control plane. This is a general principle applied in many systems with control planes, such as Kubernetes. If the Kubernetes controller goes offline, we still want pods to function. Kafka, with KRaft, achieves this decoupling.

The roles and communication

The protocol is split into the following roles:

Controller (control plane)
Partition replicas (data plane)

Leader and followers

*Fig 2. The roles in the Kafka Replication Protocol*

The above is the logical model of the roles within the scope of a single partition. In reality, one Controller interacts with multiple partitions.

To avoid confusion, let’s differentiate between a cluster control plane and the partition replication control plane:

Cluster control plane: Changing security settings, configurations, adding/ removing topics etc.
Partition replication (aka Kafka Rep Protocol) control plane: Partition replica leader election, partition replica set membership changes, etc.

In Apache Kafka, both types of control plane are the responsibility of the KRaft Controller, a Raft-based replicated state machine (RSM). The fact that the KRaft Controller uses Raft is irrelevant to this discussion. The Kafka Replication Protocol only cares about the Controller role (not its implementation). We can say that the Controller state-machine is important to the Kafka Replication protocol, not the method of consensus the state-machine uses.

In the next sections we’ll cover the responsibilities of each role and the communications between them.

The data plane

The unit of replication in the Kafka Replication Protocol is the topic partition, which is a log. There is a logical model that consists of a leader replica and multiple follower replicas. All produce requests are sent to the leader replica. Follower replicas fetch (pull) records from the leader for redundancy and to be able to take on leadership should the current leader go offline or be removed.

Physical replication is actually performed at the broker-pair level, and each fetch request between a broker pair includes multiple partitions where the fetching broker includes all partitions that have a follower replica on the fetching broker, and a leader replica on the destination broker.

*Fig 3. We usually talk and think about replication using the logical model, but the work is really carried out at the broker level using multiple fetch sessions for each broker pair.*

A partition has a steady-state configuration of:

A replica set (the membership).
A leader replica.
Zero or more follower replicas.
A leader epoch.
A partition epoch.

Any change to the above configuration of a partition causes the partition epoch to be incremented. So, we can consider the period of a given partition epoch to be the fixed-topology failure-free steady state where the partition was making progress by itself. The leader epoch gets incremented on a leader change. All this configuration is determined by the control plane.

Partition replicas are unable to change the leader themselves or perform any kind of reconfiguration. As far as the leader goes, there is a fixed membership. As far as the followers go, there is a fixed leader.

The only things that the data plane must implement is the sending and responding to fetch requests, piggybacking some key info in the fetches, such as:

Key offsets such as the high watermark (commit index) and log stable offset (required for transactions).
The leader epoch (required for fencing stale followers and leaders).

Fencing is often a key component of replication protocols, making the data plane consensus correct. Fencing basically enforces the completion of one steady state configuration, ensuring it cannot make progress (while another has been created). Split-brain is basically two steady-state configurations both making progress simultaneously. Fencing is a mechanism that exists inline with the data plane to prevent that. In Kafka, the fencing is based on the partition and leader epochs.

The control plane

The controller has a number of duties:

Broker failure detection.
Partition replica leader changes. I won’t use the term election, as it is not based on voting. The controller is the benevolent dictator.
Partition replica membership changes.

The Controller does not know anything about the partition high watermarks, stable offsets, log end offsets, or any other data plane responsibilities.

There is two-way communication between the control plane and data plane:

Controller->Broker (partition replicas). Broker metadata log learners.

The controller performs partition configuration changes (basically, it performs metadata ops).
Each broker acts as a KRaft learner. When a broker learns of a partition metadata op for a local partition, it passes that op to the partition to act on. The op may require the broker to create a local replica first, then pass on the metadata to it. Or even delete a local partition.

Broker (leader replicas)->Controller. Leader notifications.

Partition replica leaders notify the Controller when a follower replica becomes in-sync or falls out-of-sync. This allows the Controller to maintain the In-Sync-Replicas set (ISR) for each topic partition. The ISR is basically a leader candidate pool for the Controller to pick from (one per partition).

*Fig 4. The control plane←→ data plane communication.*

Failure detection is handled via a heartbeat mechanism between brokers and the Controller, as well as some extra stuff like unclean failure detection on broker start-up.

Leader changes of a topic partition consist of the Controller picking any replica in the ISR. This is a metadata op, applied to the KRaft log, and therefore the replicas learn of this change via their host broker acting as a KRaft log learner. This way, replicas learn of their role and the partition's current configuration.

Replica set member changes are relatively simple and also just a carefully controlled set of metadata ops. In general, it follows a grow->catch-up->shrink approach. For example, when replacing one replica with another, it does the following:

Grow phase

The Controller adds a new replica to the partition replica set (a metadata op), and all replicas learn of this change.
The broker of the added replica creates the local replica, that starts fetching from the leader.

Catch-up phase.

The Controller waits to be notified that the new replica has catch up to the leader.

Shrink phase:

The Controller receives the notification that the new replica has caught up. It removes the original replica from the replica set (a metadata op).
All replicas learn of the final partition state via the KRaft log. The broker of the removed replica deletes all local state for that partition.

Membership changes can be arbitrary, add one, remove one, replace one, or even replace the whole set with a new set. The logic in the controller is the same. It starts with growth, then once enough of the final replica set is caught up, it moves to the shrink phase, then finally marks it complete. The data plane has no concept of member changes at all–it just does what it does, given the configuration handed to it by the Controller.

*Fig 5. Membership changes are executed as a linear series of metadata ops, ensuring minimum ISR size for safety.*

Reflection on this kind of separation

Separating the control plane from data plane certainly has some benefits:

Each plane is more straightforward than one combined one. The separation creates clear boundaries that make the system more maintainable and evolvable.

The data plane must only implement failure-free ordering (with some key correctness stuff like epoch-based fencing).
The control plane must implement fault-tolerant consensus (via SMR), but the state machine logic can be a single-threaded event loop, acting on a linearized stream of notifications and commands.
There are fewer brain-breaking interleavings of actions to contend with. With Raft, for example, there are a few tricky areas, such as how a leader may not actually be a member of the cluster as it is being removed, but it has to stick around, as a non-member leader, in order to commit the change that removes it. Stuff like that can just hurt your brain. By separating things out into separate roles, we make the logic simpler and easier to understand.

The two planes can be evolved separately. One example of this is KIP-966, which adds an additional recovery mechanism to the protocol to handle unclean/abrupt shutdowns with relaxed fsync modes. The data plane basically didn’t change at all. Likewise, when optimizations were made to the broker-to-broker fetching, it did not affect any control plane logic whatsoever.

Going further

I’ve only scratched the surface of the Kafka Replication Protocol. If you want to understand it in excruciating detail, then check out:

My Kafka Replication Protocol description (aims for a similar level of understandability as the Raft paper).
My TLA+ specification, which is admittedly quite big and complex but is faithful to pretty much the whole protocol, including reconfiguration.

There's still room for further disaggregation. If we were to map Kafka onto the Paxos roles we would see that the data plane and control plane use separate sets of each role. The data plane co-locates all three roles into the partition replica. Likewise, KRaft which underpins the KRaft Controller, also does the same. We’ve disaggregated the protocol into control and data planes, but each plane is converged in terms of the Paxos roles.

We can definitely go a lot further in terms of disaggregation. We can separate the Paxos roles in the data plane, do away with distinguished proposer altogether and also segment the log, among other things.