This post continues my series looking at log replication protocols, within the context of state-machine replication (SMR) or just when the log itself is the product (such as Kafka). So far I’ve been looking at Virtual Consensus, but now I’m going to widen the view to look at how log replication protocols can be disaggregated in general (there are many ways). In the next post, I’ll do a survey of log replication systems in terms of the types of disaggregation described in this post.
Prior posts:
As many people have said before, everything is a log. Replicated logs often serve as the foundational mechanism for achieving consensus in distributed data systems by providing an ordered, consistent sequence of operations that all servers can agree upon. This allows each server to maintain identical state despite operating independently across a network. The most well-known form of this approach is State Machine Replication (SMR).
In state machine replication, distributed servers maintain consistency by applying the same sequence of operations from a replicated log in the same order, operating as deterministic state machines. Each state machine server applies the operations from the log to maintain local state, such as an embedded database (such as RocksDB or Pebble), or a queue, or any other data representation. The most famous converged SMR design is Raft.
The converged SMR protocol
For a long time, Raft has been the default safe choice for teams wanting to implement SMR. This happened largely because Raft included important things like leader elections, reconfigurations, and snapshotting in the protocol. While Paxos came onto the scene first, it initially left it to each team to invent some of these additional pieces themselves. The first version of Paxos is known as single decree Paxos, because it only covers one round of consensus to agree on a single value. MultiPaxos came later to optimize Paxos for replicating a sequence of values. In the meantime, Raft was published. It was a clearly described, all-in-one (and the kitchen sink) protocol that left little room for interpretation (which was helpful). Teams still tweak Raft to fit their needs, but the core design usually remains the same.
We describe Raft as a unified, or converged, protocol because it combines the data and control planes into one tightly coupled protocol:
Each Raft server implements the whole protocol. Each server participates as a peer in consensus, replication, and storage.
Replication cannot be separated from consensus, they are woven together.
The replicated log mixes data plane and control plane entries.
Using the terminology of my last post, it mixes failure-free ordering and fault-tolerant consensus into one protocol. In other words, the data plane is inextricably linked to the control plane. As I explained in my Virtual Consensus in Delos post, this approach has some drawbacks:
High coupling: Because Raft combines control plane concerns (leader elections, reconfiguration) with data plane concerns (replication, with ordering), evolving the data plane cannot be done without also impacting the control plane. The control plane is even integrated into the log itself, with a mix of data and control entries. Changes to one plane necessitate changes to the other.
More complexity: the protocol is large with many operations that can interleave in complex ways.
Less flexibility: We cannot do things like independently scale consensus and replication or use erasure coding for the data plane instead of replication, as two examples (I’m unconvinced by CRaft).
Raft also has a monolithic log. That is, the log is a single logical sequence of entries. So we can classify Raft as a unified protocol over a monolithic log. Sorry to pick on Raft again, but everyone knows Raft, so it’s the best case study for a converged protocol.
The question is, how can we disaggregate a monolithic log replication protocol such as Raft?
Breaking apart the monolith
There are various ways of breaking apart this monolithic approach. These are the ones I can think of, though there may be more.
(A) Disaggregated participants. The participants form clearly delineated roles that can be run as separate processes. The protocol itself may still be converged (control plane and data plane intertwined) but exercised by disaggregated components.
(B) Disaggregated protocol that separates the control plane from the data plane. The protocol itself is broken up into control plane and data plane duties, with clean and limited interaction between them.
(C) Segmented logs. Logs are formed from a chain of logical log segments.
(D) Pointer-based logs. Separates data from ordering.
(E) Separating ordering from IO. Avoid the need for a leader by allowing all nodes to write, but coordinate via a sequencer component.
(F) Leaderless proxies. Abstracts the specifics of the replicated log protocol from clients, via another disaggregated abstraction layer of leaderless proxies.
I’ve labeled these A-F as I’ll be referring back to them in a survey of log replication protocols. It’s possible to combine multiple together.
(A) Disaggregated participants
Paxos made a fundamental contribution to distributed consensus by formalizing the responsibilities of reaching consensus and acting on the agreed values into distinct roles . Paxos separates the consensus protocol into proposers who drive consensus by proposing values to acceptors, acceptors who form the quorum necessary for reaching agreement (consensus), and learners who need to know the decided values. This creates a clear framework that allows system designers to reason about each role's responsibilities independently while ensuring their interaction maintains safety and liveness properties. The formalization of these roles has influenced the design of practical systems and protocols for decades, even when they don't strictly adhere to the original Paxos model. This cannot be understated.
The original Paxos is known as single-decree Paxos as it is an algorithm for agreeing on a single value. MultiPaxos (or multi-decree Paxos) came later to allow a group of servers to agree on a sequence of values (aka a log). MultiPaxos made optimizations for a replicated log algorithm, such as stable leadership (distinguished proposer), reconfiguration, and so on.
Fig 1. The Paxos roles, seen here in the context of MultiPaxos. Note that the number of each role are not limited to this depiction.
MultiPaxos can be organized just like Raft, having the Proposers, Acceptors and Learners co-located on equal-peer-servers. However, you can also separate all three roles onto different servers.
Because all participants of MultiPaxos exercise both the data and control planes (and mix both into the same log), MultiPaxos is an example of a converged protocol, but with potentially disaggregated participants.
(B) Disaggregated protocol that separates the control plane from the data plane
Where Raft and MultiPaxos mix the control plane and data plane into one unified protocol, others separate the protocol itself such that there is loose coupling between the work needed for replicating log data entries from the other work such as leader elections, membership changes, and so on.
The two planes must interact, and how loosely coupled these two planes are will impact the overall complexity of the broader protocol.
Data plane: Log ordering, replication and storage.
Control plane: Failure detection, leader election, and membership changes.
Fig 2. The protocol is split into control plane and data plane.
In the next post we’ll look at some real-word systems that separate concerns into control and data planes.
(C) Segmented logs
A segmented log is a log made up of a chain of log segments, where each log segment is itself a log. Typically, the main log has a virtual sequential address space (of offsets) that maps onto the sequence of log segments (and their sequential address space of offsets). Writes go to the end segment, known as the active segment. All segments prior to the active segment must be sealed (no longer able to accept writes). Reads are directed to a given segment based on the mapping of virtual address to log segment.
Fig 3. The log is formed from a chain of logical log segments.
Each log segment is independent of the others. They do not form a doubly linked list where each segment has pointers to navigate forward and backward. The log segment chain is stored as a sequence in metadata (requiring consensus).
Breaking up the logical, continuous log into logical segments can provide a number of benefits, such as enhanced write availability, less data movement, and a cleaner abstraction for managing background work such as tiering and garbage collection. In some cases, each log segment represents a period of steady state (data plane), where each new segment represents a historical control plane intervention to route around some error or respond to load distribution conditions.
(D) Pointer-based logs
“Any problem in computer science can be solved by another layer of indirection”
Replicated logs have a sequential virtual address space that maps onto a physical address space where the complete log entries can be physically read from. Clients deal with logical addresses and storage servers map those logical addresses to physical addresses.
However, we can add another layer of indirection by having the physical log entries only contain metadata about the log entry, rather than the data payload itself. The metadata includes a pointer (or data reference) to where the data can be read from (another logical address, typically in a flat address space of something like object storage). The pointer-based log is a log of entry metadata and thus the log’s role is to apply the strict ordering guarantees and map the sequential addressing scheme of the log onto a flat, key-based address space such as file/object storage.
Fig 4. A log can store complete log entries, or can store only log entry metadata, pointing to another storage service that hosts the data payload.
Warpstream calls this “separating data from metadata” though I find the word metadata is too vague, plus the practice actually creates extra metadata in order to make the separation work. I prefer to think of it simply in terms of separating storage from ordering.
Fig 5. Data is separated from ordering.
By placing data payloads in the flat address space and the relatively tiny data references in the sequential address space, we can obtain the benefits of sequential addressing for ordering along with the scalability of flat address spaces for data storage. The key insight is that a sequential address space requires coordination to maintain order, so minimizing its size is beneficial (coordination can be costly). At the same time, flat address spaces allow each piece of data to be managed independently, making the storage layer more simple and flexible. This independence enables better horizontal scaling and parallel operations.
Of course, this type of scalability also comes with some costs:
The flat address space loses the benefits of large sequential reads/writes that are possible when laying out the sequential log data in a sequential organization on disk. As I have previously written, even modern SSDs prefer sequential writes.
Higher latencies, as reads and writes require two operations, and flat address space storage may not provide the same level of sequential read/write performance. Object storage also has higher latency.
This separation of storage from ordering is not a universally better approach. It all depends on the log workload.
(E) Separating ordering from IO
Most log replication protocols are leader-based. The leader proposes the entries to be replicated, determining the order of entries in the log. All clients must direct their writes to this leader, which naturally creates a scalability bottleneck.
However, we can replace the idea of a leader with that of a sequencer. The sequencer only has to hand out 64-bit integers corresponding to positions in the log (not perform any other duty). A set of leaderless servers can perform the majority of the work, handling client requests and performing the IO, with ordering coming from the sequencer sourced sequence numbers.
A mapping of virtual log positions to storage is required, so that:
Servers know where to write entries to, that correspond to the obtained sequence numbers.
Servers know where to read an entry from, for a given sequence number.
One final wrinkle with this sequencer approach is that a server can obtain a sequence number but never perform the write, leaving holes in the log. A mechanism for filling in these holes or skipping them is required.
Fig 6. Leaderless servers perform the IO and ordering is achieved by a simple sequencer component.
There are various strategies for dealing with the failure of the sequencer. I’ll cover this subject in the follow-up survey.
(F) Leaderless proxies
Enter another layer of indirection.
Leaderless proxies abstract many aspects of the log protocol from clients, including having to learn which server is the leader. Clients simply connect to a stable endpoint to publish to and consume from the log.
Fig 7. Leaderless proxies can hide leader-based log replication protocols
This can be combined with different types of logs, from the standard inline continuous log to a pointer-based log, where proxies use object storage for payloads and then perform writes to the log with the address of the payload. Likewise, consumer proxies consume the log and download the payloads before passing them to consumer clients.
Fig 8. Data and ordering separated.
So many replication protocols…
There are no doubt other ways of disaggregating things.
In the next post, we’ll survey the log replication systems out there in terms of the A-F disaggregation classifications. It is theoretically possible for a system to implement all them!