In this latest post of the disaggregated log replication survey, we’re going to look at the Apache BookKeeper Replication Protocol and how it is used by Apache Pulsar to form topic partitions.
Raft blends the roles and responsibilities into one monolithic protocol, MultiPaxos separates the monolithic protocol into separate roles, and Apache Kafka separates the protocol and roles into control-plane/data-plane. How do Pulsar and BookKeeper divide and conquer the duties of log replication? Let’s find out.
Preamble
There are so many systems out there, too many for me to list without this becoming a huge research project. So I’m going to stick to systems I have directly been involved with, or have a decent level of knowledge about already. Feel free to comment on social media about interesting systems that I haven’t included.
I have classified a few ways of breaking apart a monolithic replication protocol such as Raft. The classifications are:
(A) Disaggregated roles/participants. The participants form clearly delineated roles that can be run as separate processes. The protocol itself may still be converged (control plane and data plane intertwined) but exercised by disaggregated components.
(B) Disaggregated protocol that separates the control plane from the data plane. The protocol itself is broken up into control plane and data plane duties, with clean and limited interaction between them.
(C) Segmented logs. Logs are formed from a chain of logical log segments.
(D) Pointer-based logs. Separates data from ordering.
(E) Separating ordering from IO. Avoid the need for a leader by allowing all nodes to write, but coordinate via a sequencer component.
(F) Leaderless proxies. Abstracts the specifics of the replicated log protocol from clients, via another disaggregated abstraction layer of leaderless proxies.
Pulsar + BookKeeper
Disaggregation categories:
(A) Disaggregated roles/participants
(B) Disaggregated protocol that separates the control plane from the data plane.
(C) Segmented logs.
BookKeeper is a low-level distributed log service and perhaps it’s most well known usage is as the log storage sub-system of Apache Pulsar (a higher level distributed log service akin to Apache Kafka).
First, a reminder of Virtual Consensus
Pulsar and BookKeeper align quite well with the Virtual Consensus abstractions. Virtual Consensus is a set of abstractions for building a segmented log. Let’s review those abstractions, then we’ll look at BookKeeper, and how Pulsar uses BookKeeper with these abstractions in mind.
In Virtual Consensus we have two abstractions, the VirtualLog, and the Loglet. The Loglet is the log segment abstraction that can be chained to form segmented logs.
Fig 1. The shared log, formed by a chain of loglets.
The Loglet must implement fast failure-free ordering alone. That means it doesn't need to handle all failures, if it encounters a problem it can throw up its hands and notify the VirtualLog. This means the Loglet doesn’t need to be tolerant to all types of faults, only be a simple, fast data plane. No leader elections or reconfigurations are required in the Loglet abstraction. I wrote about failure-free ordering vs fault tolerant consensus that goes into this in more detail.
If a Loglet does become degraded, the Virtual Log seals the Loglet (to prevent further writes), and appends a new Loglet–known as log reconfiguration. The idea is that each Loglet represents a period of steady state where everything was going well.
Fig 2. Reconfiguration moves the system from one steady state configuration to another in response to failures, policy triggers or other factors such as load balancing.
Through that lens, in Virtual Consensus, the Loglet forms the data plane, and the VirtualLog acts as the control plane.
Fig 3. Depicts a quorum replication Loglet implementation. Loglet 3 experiences a failed server, notifies the VirtualLog which seals Loglet 3 and then appends Loglet 4 with a new set of storage servers (without the failed one).
Virtual Consensus separates the planes using abstractions. However, Virtual Consensus also has roles (that run as separate processes):
Client (drives data plane and control plane)
Separation of logic within the client through modularity.
Log reconfiguration is generic, but relies on a Loglet module for sealing and creating new Loglets. It also relies on a metadata store for consensus.
Reads and writes are routed to specific Loglets via a Loglet module.
Loglet storage (data plane storage)
Metastore (control plane storage)
Fig 4. The three disaggregated roles of Virtual Consensus (client, Loglet storage, metastore).
Pulsar and BookKeeper, thru the Virtual Consensus lens
Pulsar/BookKeeper roles and abstractions line up with Virtual Consensus as follows:
Abstractions:
Loglet: BookKeeper is a log segment storage service. Log segments are known as ledgers. The Loglet abstraction aligns well with the BookKeeper ledger.
VirtualLog: Pulsar has a module called ManagedLedger that equates to the VirtualLog. It’s job is to create, chain, seal ledgers to form topic partitions. Reads and writes go through the ManagedLedger which directs the reads and writes to the appropriate ledger.
Roles:
Client: The Pulsar broker code has a topic partition class that equates to the Virtual Consensus client as a whole.
Loglet module: The BookKeeper client equates to the Loglet module.
Loglet storage: BookKeeper servers, known as bookies, store the ledger data.
Fig 5. Pulsar brokers, using BookKeeper as its log storage sub-system. All components depend on a common metadata store.
Control plane–ManagedLedger
The responsibilities of the ManagedLedger module line up well with the VirtualLog:
Providing a single address space for the topic partition. This address space includes the Ledger Id and the Entry Id as a composite address id (or position).
Acts as a router for append and read commands, forwarding the command to the appropriate ledger. Writes always go to the active ledger.
Chaining ledgers to form an ordered log.
Reconfiguring the log under failure scenarios for fault-tolerance and high availability.
The ManagedLedger module only requires a versioned register for consistency, in case multiple brokers try to chain new ledgers onto the same topic partition. The ManagedLedger module depends on an external leader election capability that ensures that each partition is only served by a single broker at a time. If two brokers both try and serve the same topic partition at the same time, they will duel each other by closing each others active ledgers and trying to append new ones. The versioned register is the means to maintain consistency under such a broker battle.
The data plane logic is more nuanced, and we’ll look at that next.
Data plane–BookKeeper ledgers
A BookKeeper ledger is a completely independent log segment as far as BookKeeper knows. BookKeeper is only concerned with:
Creating ledgers.
Reading/writing entries to ledgers.
Sealing ledgers (also known as closing).
It knows nothing of chaining. It is firmly a Loglet implementation.
The BookKeeper replication protocol is formed by the following roles, which are typically disaggregated:
BookKeeper client (Loglet module):
Drives the ledger replication protocol. Reading/writing from remote storage servers (bookies). Fencing ledgers. Updating ledger metadata.
Bookies (Loglet storage):
Simple storage servers, with a fencing capability.
Leaderless, with no replication between bookies during normal write operations. Replication is performed by the client doing quorum writes.
Metadata store:
Versioned registers for storing ledger metadata.
Live bookie tracking (via ZK sessions).
Fig 6. The BookKeeper roles of client, bookie and metadata store.
A ledger is a log segment as already covered. But each ledger is also, itself, a sequence of log segments! BookKeeper uses the term fragment for the ledger segment. A ledger is formed by one or more fragments, where each fragment is a bounded log. Therefore each ledger is a log of logs, and therefore a Pulsar topic partition is a log of logs of logs!
Fig 7. Depicts the metadata that represents a Pulsar topic partition, and the final ledger, which has three fragments. The range of the final ledger and final fragment is not set in metadata yet, as it is still open and accepting writes.
Generically, to form a chain of log segments, only the last segment should be writable, and all prior ones must be sealed to guarantee that someone can’t write into the middle of the parent log. A ledger is sealable to facilitate this segment chaining. Likewise, within a ledger, writes only go to the final fragment. So we see that the Pulsar topic partition is segmented, but also so is the BookKeeper ledger. So both qualify for category C.
Ledger lifecycle
Let’s get into the details of the life of a ledger, from creation, reading/writing to sealing (known as closing).
Creation
When a ledger is created, there are a number of things that need to be decided:
The set of bookies that will store the ledger.
The number of bookies that must acknowledge a write for it to be considered committed.
Ledger parameters are crucial for determining these two:
Ensemble size (E): The number of bookies that the ledger will be distributed across.
Write Quorum (WQ): The number of bookies each entry will be written to (like replication factor in Kafka).
Ack Quorum (AQ): The number of acknowledgments (from bookies) for an entry to be committed.
Typically, people use E=3, WQ=3, AQ=2. Ensemble size is interesting. If we used E=6, WQ=3, then the client would stripe its quorum writes (of 3 bookies per entry) across 6 bookies. But this is rarely used in practice as it results in more random IO with the current bookie design.
When a BK client creates a ledger, based on the above parameters, it chooses the ledger ensemble (the set of bookies that will store the entries) for the first fragment. It creates the metadata for that ledger, writing it to a versioned register (such as a ZooKeeper znode). The client can then begin to perform quorum writes to the ensemble of bookies.
Fig 8. Ledger metadata
Reading and writing
It’s helpful here to look at the Paxos roles of Proposer and Acceptor.
BookKeeper clients act as Proposers in the Paxos protocol, initiating write operations by proposing entries to be stored in ledgers. The client that creates a ledger becomes the ledger’s Distinguished Proposer (or leader), ensuring a strict sequential ordering of entries. Bookies function as Acceptors, receiving, validating, and persisting the proposed entries from clients. The client’s role as Distinguished Proposer/Leader is not validated by bookies (acceptors). It is simply ensured by clients only ever performing regular writes to the ledgers they have opened themselves.
Fig 9. The client that opens a ledger is its distinguished proposer, also known as a leader. The bookie ensemble are the acceptors.
Therefore we can consider the BookKeeper replication protocol a leader-based protocol, where the client is the leader. This is why Pulsar ensures that only one broker at a time can act as the owner of a topic partition.
While we can nicely line-up Paxos roles to BookKeeper, the rest diverges in a few ways. In theory, the Loglet implementation does not need to handle all types of fault, it can do failure-free ordering until something bad happens. Then the VirtualLog can kick in to do a reconfiguration.
Interestingly, a BookKeeper ledger has these Loglet properties externally, but also, internally, has some VirtualLog properties:
If a BookKeeper client encounters a problem that it cannot solve (such as there not being enough functioning bookies to make progress), it closes the ledger and kicks the can to the VirtualLog layer above it. Very Loglet-like.
If a write to a bookie fails, but there are other functioning bookies available, the BK client can reconfigure the ledger by appending a new fragment, with a different ensemble of bookies. Very VirtualLog-like.
Fig 10. A bookie fails, causing an ensemble change.
Externally, this ledger reconfiguration (known as an ensemble change) is completely abstracted.
Closing a ledger (by the leader client)
To close a ledger, the leader client needs to update the ledger metadata, setting the ledger status to closed and setting the last entry id of the ledger. A closed ledger is bounded; it has a start entry id (0) and an end entry id.
If it is the leader client closing the ledger, then it knows the last committed entry (as it is the leader) and the metadata update is simple.
Fig 11. Closing a ledger, by the leader client, is a metadata op, setting the status and last entry id of the ledger.
Closing the ledger of a dead client–aka ledger recovery
BookKeeper itself has no leader election at all. It can get confusing here because Pulsar chooses a leader broker per partition, because BookKeeper uses a single-writer protocol. So Pulsar has a form of leader election, but BookKeeper is a layer down from that, leaving the leader election to the application layer above it.
The rule is that only the BK client that creates a ledger will write to it. One ledger corresponds to one steady-state with a stable BK client. If a client dies, leaving an open ledger, then another client must seal/close the ledger on behalf of the dead client. This type of ledger closing is known as ledger recovery.
This other client is the recovery client, and to close the ledger it must discover the ledger tail in order to update the ledger metadata. We can’t ask the leader client for this information, as it may be dead. The only way is to discover the ledger tail is to interrogate the bookies.
The recovery client performs ledger recovery in three phases:
Fencing. A quorum of the bookie ensemble of the last fragment is fenced. Fencing is scoped to the ledger, so the recovery client sends a Fence Request with the ledger id to each bookie of the ensemble. Once a ledger is fenced on a bookie, that bookie will not accept any further regular writes for that ledger.
Recovery reads and writes. The recovery client learns of the ledger tail entry by performing quorum reads. The client also performs recovery writes (a form of read repair) to increase the redundancy of the ledger tail. Recovery writes are allowed on fenced ledgers.
Metadata update. The recovery client updates the ledger metadata as above (setting last entry id and status). The update is a conditional put, which provides optimistic concurrency.
Fig 12. Ledger recovery.
The fencing prevents a split-brain scenario where:
The leader client is actually still alive and continuing to make progress on the ledger.
A second client is performing recovery, and closes the ledger. Or even multiple clients competing to close the ledger.
For example, if a leader commits entry 100, but another client already closed the ledger at entry 50, then we just effectively lost 50 entries. Recovery is designed to make that impossible.
This is a good time to talk about Flexible Paxos. Ledger recovery uses a quorum size that corresponds to the Ack Quorum (AQ). To understand this we can look to Flexible Paxos. While traditional Paxos requires majority quorums in both phases, Flexible Paxos allows for adjustable quorum sizes as long as any two quorums intersect. This can be applied to MultiPaxos by ensuring that all phase 1 (election) quorums intersect with all phase 2 (log entry) quorums.
For example, with a cluster size of 5:
With a Phase 1 quorum of two (a minority) then the Phase 2 quorum is set to four, as this guarantees overlap.
With a Phase 1 quorum of four, then the Phase 2 quorum is set to two, as this guarantees overlap.
In other words: Phase 1 Quorum + Phase 2 Quorum > Cluster Size.
In BookKeeper the Phase 1 (election) quorum corresponds to the recovery quorum (RQ).
For example:
With WQ=5, AQ=4, then RQ=2. That is, ledger recovery will only succeed if it can fence and then read from two of the bookies of the last fragment ensemble.
With WQ=5, AQ=2, then RQ=4.
A mistake I’ve seen with operating BookKeeper has been to set AQ=1, without realizing that causes RQ=WQ, which makes a ledger unrecoverable after a single bookie loss.
BookKeeper summary
BookKeeper itself qualifies for categories A, B and C
Category A (disaggregated roles). This part has been clear. BookKeeper clients (proposers) are separate processes to bookies (acceptors). Learners are an application layer role, so BookKeeper doesn’t really include learners.
Category B (control plane/data plane). At a macro-level, each ledger is a steady-state of data plane. But internally, each ledger can reconfigure itself (given enough functional bookies to reconfigure with).
Given this ledger reconfiguration ability, the ledger also has a control-plane/data-plane split. The ledger data plane involves the client performing reads/writes to a fragment’s ensemble of bookies. If errors occur, the client performs an ensemble change (a metadata op to an external consensus service), and starts quorum writing to the ensemble of the new fragment.
Category C (segmented log). Each ledger is a segmented log (a log of fragments).
Regarding fragments and ensemble changes, again, we can look at the following figure. Each ledger fragment is a steady state, an ensemble change (metadata op) is the reconfiguration, and the next fragment is the new steady state. It is the same role driving both (BK client), but backed by different components.
Fig 12. Each fragment is a steady state configuration, an ensemble change is a reconfiguration that leads to a new steady-state fragment.
You can configure a BK client to use this reconfiguration logic, or to simply close the ledger immediately on an error.
Some may ask, if Pulsar acts as the VirtualLog, why also have BookKeeper do reconfigurations? It actually makes the BookKeeper client code more complex!
The main reason that I know of, is that it is a performance optimization. When an ensemble change occurs (usually only one bookie is swapped out), there are a bunch of uncommitted entries that the client is managing. Some of these entries have been successfully written to bookies that also exist in the new fragment. So the client only has to write these uncommitted entries to the swapped in bookie, and everything continues. If we were to close the ledger, we’d have to fail all uncommitted entries, and the ManagedLedger would have to rewrite them to a new ledger. Ensemble changes avoid this fail and rewrite procedure that would likely cause a latency blip. So it does complicate the client code, but its probably worth it (and the ManagedLedger has no idea about this reconfiguration).
Final thoughts
Pulsar with BookKeeper is more disaggregated than Apache Kafka in many ways, and less in others.
How it’s more disaggregated:
In the data plane, we separate the proposers from the acceptors (whereas Kafka partition replicas play both roles).
The log is segmented, allowing it to be spread across a pool of storage servers (whereas in Kafka it is logically one log, though the prefix can be tiered).
The control plane logic is separated from consensus (as it delegates consensus to an external service). Interestingly Kafka moved away from this as it made broker reboots too slow for high partition counts (as a broker had to load so much metadata from the external store on start-up). The Kafka controller moved to the Raft state-machine model.
How it’s less disaggregated:
In both Pulsar and BookKeeper, the two planes are largely separated except that they are driven by the same component. Whereas Kafka has no single role that drives both the control plane and data plane, they are more separated. In Kafka, the partition replicas drive the data plane and the Controller drives the partition replication control plane. Partition replicas just keep going until they are told to change configuration by the Controller (acting as control plane).
It’s not as simple as saying more disaggregated = better. Having one component driving the data and control plane is not inherently worse than having them more separated, as long as that driving component is more of a delegator rather than monolithic implementation. Likewise, delegating more or less of the consensus logic to an external service is also not inherently better or worse. It all depends.
Next up
Next, we’ll look at how we can avoid centralizing all writes through a leader. The majority of the log replication systems are leader-based. In the next post of this series we’ll look at a design that separates IO from ordering.