Paper: VR Revisited - Log-Based Replica Recovery (part 5)

Analysis parts: Analysis intro, part1, part 2, part 3, part 4, part 6.

One of the selling points of VR Revisited is that replicas do not need to write anything to stable storage, or can choose to write to storage asynchronously which can give this protocol a latency advantage over protocols that require fsyncing of key operations.

For example, Raft requires requires nodes to write entries of an AppendEntries request to disk and fsync before responding. Raft safety guarantees do not hold under arbitrary loss of an entry. One reason for that is because the quorum of a committed entry needs to overlap with all election quorums and if an entry is silently lost on one node - this overlap is not guaranteed causing potential data loss on a leader change.

VR Revisited however, can cope with arbitrary data loss after a crash via its replica recovery sub-protocol. This only applies to data loss after an unclean shutdown (aka a crash). If the replica loses arbitrary data AND doesn’t know that something went wrong and carries on as if everything were fine then data loss can still occur.

The paper discusses three variants on recovery:

Recovery which involves sending the complete log in the recovery response - I’ll call this Basic Recovery.
Replicas write operations to their log asynchronously which means that when they crash they lose an arbitrary number of operations from the log suffix. The recovery protocol involves only recovering the log suffix instead of the complete log - I’ll call this Log Suffix Recovery.
Replicas periodically take snapshots of their application state, called checkpoints. The recovery protocol first syncs application state via checkpoints, then performs log recovery on all operations after that checkpoint - I’ll call this Checkpoint Recovery.

This part will focus on the first two with checkpoint recovery covered in the next part.

Basic Recovery

The paper states:

“When a node comes back up after a crash it sets its status to recovering and carries out the recovery protocol. While a replica’s status is recovering it does not participate in either the request processing protocol or the view change protocol.”

It describes the procedure as follows:

1. The recovering replica, i, sends a <RECOVERY x> message to all other replicas, where x is a nonce.
2. A replica j replies to a RECOVERY message only when its status is normal. In this case the replica sends a <RECOVERYRESPONSE v, x, l, n, k> message to the recovering replica, where v is its view-number and x is the nonce in the RECOVERY message. If j is the primary of its view, l is its log, n is its op-number, and k is the commit-number; otherwise these values are nil.
3. The recovering replica waits to receive at least f + 1 RECOVERYRESPONSE messages from different replicas, all containing the nonce it sent in its RECOVERY message, including one from the primary of the latest view it learns of in these messages. Then it updates its state using the information from the primary, changes its status to normal, and the recovery protocol is complete.”

There are some obvious cases to be concerned about here, such as receiving responses from stale primaries that don’t realise there is a higher view number. But the key points are that recovery only completes once:

the recovering replica has a response from a majority of normal status replicas, which guarantees we discover the highest valid view number.
and one of the responses is from the primary of this highest view.

These two rules guarantee that the log of a stale primary is not chosen.

Another safeguard is that the replica cannot take part in view-changes while in the Recovering status. This is important else it could cause data loss by providing an empty log or log prefix that could cause committed operations to be lost.

The last notable property of recovery is that it cannot complete without a replica majority. If f+1 replicas were to crash simultaneously then the cluster would become unavailable without some kind of intervention. I’ll discuss the ramifications of that design choice further down.

The basic recovery TLA+ specification gets four new actions.

Actions:

Crash - A replica crashes and restarts without any state. The replica broadcasts a Recovery request including a unique number x.
ReceiveRecoveryRequest - A replica in the Normal status receives a recovery request and if it is a primary, it responds with its view-number, complete log, op-number, commit-number and the field x. If it isn’t a primary, then it responds only with the view-number, a nil log and the field x.
ReceiveRecoveryResponse - A replica in status Recovering receives a recovery response with a value of x that matches the unique number it is expecting. It adds the response to a set of received responses.
CompleteRecovery - A replica in status Recovering has received a recovery response from a majority of replicas and one response is from a primary in the highest view of the responses received. The replica sets its view-number, log, op-number, commit-number etc to that of the response from the valid primary. Finally it switches to the Normal status. It is now a fully functioning member of the cluster and can participate in all cluster operations.

TLC has been unable to find any correctness issues with the basic recovery sub-protocol. VR Revisited has the largest state space of any protocol I have specified with TLA+ and therefore I rely on simulation mode rather than the brute-force checking mode for any models with more than 3 replicas, 2 view changes and 2 operations. With 24 hours, 28 CPU threads and 100GB RAM dedicated to TLC, it has been unable to find any invariant violations.

However, it does find a couple of liveness issues which are not a problem of the protocol, but noteworthy for the implementer. Basically there are at least two scenarios where a replica receives all the recovery responses but none of the responses are from a valid primary meaning that recovery can’t complete with that recovery round.

The recovering replica is the primary of the current view and no view change has occurred yet.
A view change occurs during recovery such that the recovery responses do not have a response from a valid primary.

In either case, the implementation would simply start a new recovery round. I decided not to perform retries in the spec as it was going to add additional complexity I didn’t think was worth it.

Log Suffix Recovery

The paper describes how sending the entire log in the recovery response can be too costly. One solution is to write to the log asynchronously and fetch the log suffix only.

A way to reduce expense is to keep a prefix of the log on disk. The log can be pushed to disk in the background; there is no need to do this while running the protocol. When the replica recovers it can read the log from disk and then fetch the suffix from the other replicas.

This is not the main recovery optimization proposal of the paper and so it does not go into any more detail than the above. The paper proposes a recovery approach based on application state synchronization using checkpoints which obviates the need to rebuild application state which could be costly. But if we did want to perform log-based recovery, how could it be done?

One very important fact to keep in mind is that VR Revisited does not include the view-number in each operation it adds to its log. Some log replication protocols use a compound identity for each log entry: the term/epoch and the index which denotes its position in the log. By using this compound identity, the log itself can be used to identify where two logs diverge. In figure 2 below we see six examples of how two logs with compound view-number/op-number entry ids can be out-of-sync.

Fig 2. Examples of how the logs of a recovering replica and the primary can be out-of-sync - with compound view-number/op-number log entry ids.

Figure 2 shows some predictable and perhaps some surprising cases:

Case B: Behind - The logs have a common prefix and the recovering replica is missing the log suffix. The recovering replica needs to fetch this missing suffix.
Case C: Ahead - The logs have a common prefix but the recovering replica has a log suffix of a higher view that the primary does not have. This can occur if the recovering replica was in a minority of replicas to have written those operations when a view change occurred and those operations didn’t make it into the next view. The recovering replica needs to truncate its operations of view 2.
Case D: Divergent - The logs have a common prefix but each replica has a different log suffix. This can occur for the same reason as Case C, but the primary of the subsequent view has written operations to its log. The recovering replica needs to overwrite the operations of view 2 with the operations of view 3.
Case E: Divergent #2 - The recovering replica is both behind that of the primary and ahead. It is missing some operations of view 1 and has operations of view 2 that do not exist on the primary. The replica needs to truncate its operations of view 2 and fetch the last operation of view 1.
Case F: Divergent #3 - The logs share no common operations at all. The recovering replica is missing the operations of view 1 and view 3 but has an operation of view 2 that the primary does not have. It must truncate its log completely and fetch the log of the primary.

Cases B-D are easy to understand but cases E and F might look strange. How could they occur? TLC was helpful enough to point out case F for me:

The history in prose can be described as:

Primary r1 receives a client request for value (a) in view 1. It appends [view: 1, op: a] to its log. I will refer to see this as operation 1a.
Before receiving the Prepare msg, r2 starts a view change to view 2.
r2 switches to view 2 after receiving DVC messages from itself and r3 which does not include operation 1a.
Concurrently, r1 has initiated another view change, to view 3. It never receives the StartView message for view 2.
r2 receives a client request for operation (b) and appends it to its log as [view: 2, op: b], or 2b.
r3 receives the SVC for view 3 from r1 and therefore never receives the StartView from r2.
r3 switches to view 3 after receiving a DVC from itself and r1, which includes log [1a] in its DVC. r3 sends a StartView message with log [1a].
r2 crashes.
r3 receives client request for operation (c) and appends it to its log to make [1a, 3c].
r2 starts up, with log [2b].

Looking at the cases A-F, what all these cases have in-common is that the two replicas share a common log prefix, where we include an empty log as a prefix.

Protocols that use a compound entry id only need the log itself in order to synchronize. The way I might implement log synchronization in VR Revisited with compound entry ids is for the recovering replica to truncate its own log to the common prefix and then request the suffix to from the primary. This would be done by:

Recovering replica adds to the Recovery message, the id (view-number and op-number) of the last entry in its log.
The primary sends back the id of the highest entry in its log that is equal to or lower than the entry id from the recovering replica. Backups send a Nil.
The recovering replica takes the response from the valid primary and truncates its log to the entry in the response. Then it initiates a state transfer that will fetch the log suffix from that safe point.

However, VR Revisited doesn’t include view-numbers in log entries and therefore we are dependent on the commit-number. For log-based recovery to work, replicas must also write their commit-numbers to disk (asynchronously would be ok).

We know that the commit-number signals the safe part of the log and the log prefix denoted by the commit-number cannot diverge. State transfer also uses the commit-number to request the log suffix for this reason. Therefore, all we need to do to turn Basic Recovery into Log Suffix Recovery is for the recovering replica to include its persisted commit-number in the Recovery message, and for the primary to send its log suffix in the response. Of course, the log suffix could be huge so an implementation would involve a stream of requests and responses.

It should be noted that in the Crash action that the persisted log suffix must end at the commit-number. If on restart the log suffix and persisted commit-number do not match, they must be forced to match by either truncating the log or lowering the commit-number.

TLC has been unable to find a defect with this design.

Conclusions

Writing to stable storage asynchronously definitely gives the system a latency advantage over synchronous persistence systems but it doesn’t come for free. The asynchronous persistence with recovery protocol has a significant risk of of losing data and becoming unrecoverable.

A protocol such as Raft that synchronously writes operations to disk can handle crashes of a majority of nodes or even the whole cluster. What it can’t handle is the permanent loss of a majority of nodes - in that case the cluster is stuck and cannot make progress without some kind intervention. However, permanently losing a majority of nodes is far, far less likely than a majority of nodes crashing simultaneously.

In my career running distributed data systems I have seen plenty of cluster crashes. Cluster crashes happen, it’s not intended behaviour but it’s always possible for a bug to occur, a corrupted entry to get replicated or excessive load to cause multiple nodes to crash. If a VR Revisited cluster with asynchronous log flushing experiences a cluster crash or just a majority crash one of two things can happen.

Firstly, it’s possible for committed data to be lost. That much is clear. But even if no data were lost at all, the cluster would remain unavailable…stuck. When a replica starts and detects it did not go through a clean shutdown sequence it must initiate the recovery sub-protocol as it cannot know how much data was lost (if any). However, recovery cannot complete without a majority in the Normal status. When a majority are in the Recovering status, none of those replicas can complete recovery.

I would be very nervous about using a distributed data system that gets stuck if a majority of replicas crash or simply get terminated before completing a controlled shutdown.

Next up we’ll look at the Checkpoint Recovery variant. You can find the specification for Basic Recovery and Log Suffix Recovery in the GitHub repo.

Analysis parts: Analysis intro, part1, part 2, part 3, part 4, part 6.