Steady on! Separating Failure-Free Ordering from Fault-Tolerant Consensus

"True stability results when presumed order and presumed disorder are balanced. A truly stable system expects the unexpected, is prepared to be disrupted, waits to be transformed." — Tom Robbins

This post continues my series looking at log replication protocols, within the context of state-machine replication (SMR) or just when the log itself is the product (such as Kafka). I’m going to cover some of the same ground from the Introduction to Virtual Consensus in Delos post, but focus on one aspect specifically and see how it generalizes.

There is a sentence in the Virtual Consensus in Delos paper that confused me the first time I read it:

“In virtual consensus, we divide and conquer: consensus in the VirtualLog is simple and fault-tolerant (but not necessarily fast, since it is invoked only on reconfigurations), while consensus in the Loglet is simple and fast (but not necessarily fault-tolerant).”

— Virtual Consensus in Delos

I thought to myself that, of course, the loglet should have some kind of fault tolerance! The Loglet contributes redundancy, which is a cornerstone of fault tolerance. But I had missed that it was talking about the tolerance to failure of the consensus mechanism itself. On a second read, I realized my misunderstanding. This post is really about what the above paragraph means.

Reading the paper again, one sentence blew my mind in terms of information density. It perfectly captures the role of the Loglet, and how its consensus differs from that of the VirtualLog.

“The Loglet does not have to support fault-tolerant consensus, instead acting as a pluggable data plane for failure-free ordering.”

— Virtual Consensus in Delos

The great value of the Loglet abstraction is that it contributes to fault tolerance through redundancy, but does not require fault-tolerant consensus. That might sound confusing, so let’s pick it apart.

Redundancy plays a fundamental role in distributed data systems and massively contributes to the fault-tolerance. There are different types of redundancy:

Data redundancy ensures data durability and availability in the face of failures. This can take the form of replication, erasure-coding, or even depending on someone else’s durable storage (like object storage, which itself uses erasure coding).
Compute redundancy ensures that when one server goes down, there are others available to take on the load.
Geographic redundancy takes the first two in the list and applies them geographically in the case of regional failures.

The common thread is that multiple servers must work together over a network to provide availability and consistency. This collaboration and agreement is known as consensus. Distributed data systems require consensus mechanisms to ensure that multiple servers maintain consistent copies of data and agree on the ordering of operations.

So far, so good. Redundancy contributes to fault tolerance.

But fault tolerance also includes a bunch of other concerns:

Failure detection.
Fail-over mechanisms (to redundant copies/processes).
Repair and recovery mechanisms.

A lot of the consensus in a log replication protocol is concerned with those aspects.

Failure-free ordering is the concept of guaranteeing consistent ordering (across redundant copies) in a steady state condition (no failures are occurring). It isn’t saying that ordering guarantees are lost in failure scenarios. Instead, failure-free ordering is the process of enforcing the ordering of operations when the system is free of error (in a stable, steady state). This point is important to clarify.

An example of failure-free ordering would be a Raft leader of a given term determining the order of log commands when everything is running normally. If we remove all mechanisms from Raft to handle the things that can go wrong, like network connectivity issues, dead disks, and so on, Raft becomes extremely simple. It only needs to implement failure-free ordering. The Raft leader would simply be making appendEntries RPCs to followers, and their roles would never change.

This simpler Raft version doesn’t do anything about the failures beyond notification of a problem. It can still make progress when experiencing some minor/transient faults. A leader can continue to make progress when a single follower becomes unresponsive or multiple followers come and go intermittently. In terms of redundancy, it reaches the desired level of redundancy, or it doesn’t. Such a Raft version would only be capable of notifying when failures occur.

We like failure-free ordering! It’s simple to implement.

But the world is full of failures! In Raft, if the leader dies, no failure-free ordering can take place. We need consensus to kick in, and it must be tolerant to failure. We have leader elections to handle leader failures; we have membership changes when a server fails permanently, etc. Fault-tolerant consensus is the ability of the servers in the cluster to agree (achieve consensus) despite failures.

The contribution of the Loglet abstraction is that we only assign it the responsibility of failure-free ordering–the Loglet does the steady state. It is the responsibility of the VirtualLog to provide fault-tolerant consensus. When bad things happen, it reconfigures the log by sealing the current active loglet and creating a new one (rerouting around the failure). The VirtualLog abstraction bears no responsibility for redundancy, only for consensus to establish a new steady state configuration.

If we take a step back, we can look at pretty much any replication protocol through the lens of two different phases:

Steady-state (failure-free ordering).
Reacting to failure or adverse conditions (moving the system from a failed or undesirable steady state to a new steady state).

We can see this transition in Raft and Virtual Consensus:

In Raft, this is:

Transitioning from one term to another (perhaps after a leader becomes unreachable).
Transitioning from one configuration (membership) to another (perhaps after a permanent server failure or redistributing load by moving data).

In Virtual Consensus, this is sealing the active loglet and chaining a new one onto the end of the shared log. Each loglet represents one steady-state configuration.

We can call the steady state logic the data plane, and the transitions between steady states the control plane logic. Virtual Consensus separates data plane from control plane logic, and so too do some other replication algorithms (but by different means). In the next post I’ll be exploring the different ways we can disaggregate a log replication protocol. This concept of transitioning between steady state configurations is one key aspect of that, but not the only one.