Paper: VR Revisited - View changes - Answers (part 2)

Analysis parts: Analysis intro, part1, part 3, part 4, part 5, part 6.

In part 1 I described the questions and doubts I had about the view changes sub-protocol. In this part of the analysis we’ll use TLA+ to answer those questions.

I wrote two specifications, one for increment-mode and assume-mode which only include the view changes and normal operations sub-protocols - leaving out client and replica recovery, client tables and cluster reconfiguration.

TLA+ is difficult to read if you haven’t learned anything about it, so I’ve condensed the essential logic in a TLA+ pseudocode.

Assume-mode and increment-mode share most of the same logic and variables.

Fig 1. The variables. The variables are either per replica, or global

The per replica variables for status, view number, log, op number and commit number are self-documenting. Others are worth describing:

Last normal view is the last view when the replica was in the normal status.
Peer op number is a map that the replica uses to track which op numbers have been acknowledged by a peer via a PrepareOk message. This lets it execute those ops once f+1 replicas have the op.
The “Sent” variables are used by the specification to avoid sending DVC and SV messages multiple times when enabling conditions evaluate to true.

The variable messages is a bag of messages that are either pending delivery or have already been processed. The replicas variable is a set of integers denoting the replica ids.

Safety properties (bad things that should never happen)

Fig 2. Two safety properties (invariants). The /\ symbol means AND.

I included the following safety properties in all specifications, aka invariants:

All committed operations exist on a majority of replicas. If this is violated, then a view change or loss of a minority of replicas can cause data loss. I also have an invariant that ensures that all committed operations exist on at least one replica, but for this first exploration I didn’t use it as it takes longer to hit in the case that a counterexample exists. That invariant is suitable when also modelling the total loss of a replica which I don’t do as that requires the replica recovery sub-protocol which I am not covering yet.
No log divergence. There are no two replicas, that have an operation of the same op number in both logs and which is committed on both replicas, but whose value is different. Violating this property means that the log has diverged.

Liveness properties (good things that should eventually happen)

Fig 3. Liveness properties that specify good things that should eventually happen. The \/ means OR.

Liveness properties are also critical, for two reasons. Firstly, good liveness properties can find bugs in your specification by showing you a counterexample where the system cannot reach a valid state that you expect it to (happens regularly during development). Liveness properties act as great tests. Secondly, liveness properties can find defects in your design by showing you counterexamples where the system can get stuck.

The downside of liveness is that is can be tricky to write and a lot slower for TLC to run.

I included two liveness checks.

ConvergenceToView which ensures that all replicas eventually reach the same view number in normal status. This will find a scenario where one or more replicas are unable to join a view number due to an issue with the protocol. It won’t flag up latency based liveness issues, only issues where the cluster gets stuck, unable to make progress.

OpEventuallyAllOrNothing checks that an operation will eventually be replicated to all replicas or not exist on any. For example, a view change may cause uncommitted operations to be truncated or overwritten.

Normal operations

ClientRequest - Replica must be a primary in the normal status. Appends an entry to its log containing the client value and broadcasts a Prepare message.
ReceivePrepare - A non-primary replica in the normal status receives a Prepare message with a matching view number and whose op number is exactly one greater than its op number. This ensures Prepare requests are processed in order. Replies with a PrepareOk message including the op number of the received op.
ReceivePrepareOk - A primary replica in the normal status receives a PrepareOk message with a matching view number and whose op number is larger than the registered peer op number. After a view change the primary sets the peer op number of all replicas to 0. It updates the peer op number on receiving each valid PrepareOk message. PrepareOk messages do not need to be processed in order because the peer itself ensure that it processes Prepare messages in order. Any PrepareOk message received covers all op numbers at and below the op number of the message.
ExecuteOp - A primary has a commit number < op number. The op at commit number + 1 has reached f+1 threshold (by counting the number of peers whose peer op number >= this op). The primary increment its commit number. This specification does not include the client table at this point.

The paper does not describe how the primary keeps track of the PrepareOk responses, so I use a Raft like variable called the peer op number.

So far we’ve looked at the parts that are the same for both increment and assume mode. Next we’ll look at assume-mode, then how increment differs from it. Finally we’ll go back to those questions from part 1.

Quick TLA+ guide

To read the pseudocode you still need to understand how TLA+ deals with changes to mutation of state. TLA+ deals with “states” where a state is a mapping of values to variables. Actions take us from one state to another state. An example would be a system with two counters and two actions - Inc1 and Inc2.

Inc1 ==
    /\ counter1' = counter1 + 1
    /\ UNCHANGED counter2

Inc2 ==
    /\ counter2' = counter2 + 1
    /\ UNCHANGED counter1

The /\ symbol means AND. Inc1 there increments the value of counter1 AND leaves counter2 unchanged.

Fig 4. Two counters have 4 states where both counters <= 1. There are two ways from reaching the state where both counters = 1.

In TLA+, when you see variable’ = val it means that in the next state, the value of that variable will be val.

Each action may have one or more enabling conditions that must be true for the action to be executable, then one or mutations to the variables. For example, we could limit the value of counter1 to values <= 10 with:

Inc1 ==
    /\ counter1 < 10
    /\ counter1' = counter1 + 1
    /\ UNCHANGED counter2

That should be enough to understand the TLA+ like pseudo code.

Assume-mode

The enabling conditions are colored in green, and the mutations in black.

TimerSendSVC - The timer of any non-primary can fire at any time. The replica switches to ViewChange status, increments its view number and broadcasts an SVC with the incremented view number. Because the timer can fire at any time this includes situations where a new view change is required due to a stuck view change. We essentially test correctness when view changes can happen at any time, by any non-primary.
ReceiveHigherSVC - A replica receives an SVC with a higher view number than its own. There is no restriction on status at this time as we only include the normal operations and view change sub-protocols. The replica sets its view to that of the message, changes its status to ViewChange and broadcasts an SVC with this new view number.
ReceiveMatchingSVC - A replica receives an SVC with a matching view number while in the ViewChange status. No other actions are taken in this action.
SendDVC - A replica has received f SVC messages that match its current view and it has not yet sent a DVC in this view. The replica sends a DVC to the primary of the current view. The message includes l=log, n=op number, k=commit number and v’=last normal view. If the new primary is itself, it sends itself the DVC but registers it as already processed.
ReceiveHigherDVC - A replica receives a DVC with a higher view number than its own. There is no restriction on status at this time as we only include the normal operations and view change sub-protocols. The replica sets its view to that of the message, changes its status to ViewChange and broadcasts an SVC with this new view number.
ReceiveMatchingDVC - A replica receives a DVC with a matching view number while in the ViewChange status. No other actions are taken in this action.
SendSV - A replica has received f+1 DVC messages that match its current view. It updates its state accordingly:
- sets its status to Normal
- overwrites its log with the log from the DVC with the highest last normal view (v’).
- sets its commit number to the highest commit number in the DVCs received.
- sets its op number to the last entry of the new log.
- sets its last normal view to the current view.
ReceiveSV - A replica receives an SV with a view number >= its own view number. It updates its state accordingly:
- sets its view and last normal view to that of the message
- sets its log, op number and commit number to those of the message
- if its existing commit number is lower than the SV op number, then reply with a PrepareOk message to the replica with the SV op number. The replica would also execute any operations between its current commit number and the commit number in the SV message.

Assume-mode results

The good news is that TLC has been unable to find any safety or liveness violations with assume-mode. Only small models (3 replicas, 2 view changes, 2 operations) have been run in model checking mode which performs a brute force exploration of the state space. Larger models would need more than the 1TB of disk and 32 CPU threads I have on my workstation.

To cope with larger models such as 5 replicas, 10 view changes and 3 operations I used simulation mode with 28 workers. With simulation mode, each worker starts at the initial state and randomly selects a path through the state space while checking safety and liveness properties. Simulation mode is embarrassingly parallel - each worker is independent unlike model checking mode which suffers from queue contention.

Running simulation for 24 hours found no issues on larger models.

Increment-mode

Increment mode differs from assume mode in the following ways:

it can only ever increment its view number during a view change
it cannot allow a replica to move more than 1 view ahead of a majority, therefore stuck view changes must be retried if they are stuck in the SVC phase.
ResendSVC action for when an SVC must be resent due to no corresponding SVCs received in return. Liveness properties will fail without this addition.
it has an extra variable for tracking which valid DVCs it has received. This was added in order to be able to detect the case where a replica receives DVCs with different view numbers. Each time it receives a DVC it updates its tracked DVCs, removing any that are lower than its view.
when sending an SV it may itself be in a lower view than the DVCs, so it chooses the new view to be the highest one of the DVC messages received.

Increment-mode Results

Question 6 and 7 ask the question of how to count DVC messages with increment-mode. In part 1 I described a valid history where a replica could become the primary despite receiving DVCs with view numbers which were higher, even after the incrementing of the replica’s view number. So DVCs with a higher view number than the replica view number could be counted - but could the DVCs themselves have different view numbers?

The bad news is that TLC finds a counterexample where a replica receives DVCs from different view numbers. This leads to an invariant violation of AcknowledgedWritesExistOnMajority. The counterexample is 114 steps and so a diagram cannot include every message exchanged. The below diagram omits message passing that is not important to the counterexample - so replicas move from view to view without the associated arrows.

Fig 6. Counterexample leading to data loss.

This is a state space that is so incredibly large that even simulation mode was not able to find it alone. It was reasonable to expect the existence of this invariant violating state so I had to use some tricks to get TLC there (more on that in a separate post).

The fix is obvious, which is that a replica should only count the DVCs of the highest received view - so in the case above it would have only counted one DVC, not three. Changing the specification to behave this way fixes the issue.

I added an additional invariant for the increment-only specification to verify that no replica could move more than 1 view ahead of a majority.

NoReplicaMoreThanOneViewAheadOfMajority ==
    ~\E r \in replicas :
        \E reps \in SUBSET replicas :
            /\ Cardinality(reps) > ReplicaCount \div 2
            /\ r \notin reps
            /\ \A r1 \in reps : 
                rep_view_number[r1] < rep_view_number[r] - 1

This can be described as:

There exists no replica r such that:
   There exists a subset of replicas such that:
       /\ it is a strict majority
       /\ r is not in this subset
       /\ for all replicas in the subset :
           its view number < view number of r - 1

TLC could find no counterexample that violates this invariant so it looks like we can control replica view numbers to avoid the disruptive non-functioning replica case from part 1.

Once the SendSV bug of counting DVCs of different view numbers was fixed, no further safety or liveness issues could be found. Of course increment-mode does have liveness problems, but no stuck clusters.

The answers to the questions from part 1

Are view numbers monotonic? Yes, monotonic.
How to count SVCs? Count any SVC that has been received that matches the current view.
Do view numbers just increment? I’ve been told that the intent of the wording was to describe assume-mode. However, lack of precision and other mistakes in the wording conflicted with assume-mode, leaving this question wide open. I agree that assume-mode is the better design and will be leaving increment-mode behind going forward.
How to handle stuck view changes? Assume-mode simply starts a new view change. Increment-mode can only resend the same SVCs again when stuck in the SVC phase.
What to do if an SV message gets lost? It’s ok to lose an SV message. It will simply trigger a new view change.
Can a replica receive DVCs with different view numbers? Assume-mode no. Increment-mode yes.
How to count DVCs? Assume-mode: only count DVCs of the current view. Increment-mode: Only count DVCs of the highest view number in the DVCs received.

Conclusions

We’re leaving increment-mode behind now - it was not the original intent of the authors I am told and has all many sub-optimal liveness behaviours described in part 1. Interestingly though, with only the normal operations and view change sub-protocols involved, it looks to be safe (with the above fix).

Assume-mode is how we’re going to model VR Revisited from here on. I will also leave the term assume-mode behind. Find the increment-mode and assume-mode specifications on my GitHub repo. Go to the analysis introduction for the links to the whole analysis.

Next up: Adding state transfer to our specification (part 3).

Analysis parts: Analysis intro, part1, part 3, part 4, part 5, part 6.