Paper: VR Revisited - Application state and commit-number monotonicity (part 4)

Analysis parts: Analysis intro, part1, part 2, part 3, part 5, part 6.

Part 4 was going to be focused on the replica recovery sub-protocol but while writing the replica recovery specification I discovered that I had failed to enforce a critical property - that of commit-number monotonicity.

VR Revisited is a state machine replication (SMR) protocol. Each node is a state machine that executes a sequence of commands to mutate its state/perform actions and the sequence of commands is the replicated log.

Fig 1. State-machine replication (SMR) where each node is comprised two modules: log replication and state machine.

So far I have only modelled the replication module that replicates the log, leaving out the state machine on top. In order to model replica recovery and its checkpointing mechanism I decided to add the application state to the specification. During this process I realised the specification allowed for a commit-number to go backwards which caused a data consistency issue.

Commit numbers and application state

Once a primary receives f PrepareOk messages for a given operation, the primary knows that the operation and all prior operations are stored on f+1 replicas. At this point it can execute all unexecuted operations with op-numbers <= this safe operation.

Executing an operation involves the following:

Perform an upcall to the state machine module that sits atop the replication module. The state machine applies this operation to the application state.
Increment the commit number.

This state machine can be any arbitrary state machine, a classic example is a key/value store that mutates the KV store according to the commands it receives from the replication module. The key rule here is that upcalls can only be made for committed operations. If a given operation has been applied to the application state on any node, then we know that this operation is considered safe (i.e. it will survive the loss of any f replicas or any view changes).

This means that the commit-number is used for both indicating the highest committed operation known to the replica but also the highest applied operation on that replica. Therefore, the commit number needs to be monotonic else an operation could be applied to the application state more than once, causing an inconsistency.

I have been tempted to add a Raft style apply-index which tracks the highest operation applied by the state machine to separate out the tracking of committed and applied operations. For now, I am sticking to the paper’s description which only directly refers to the commit-number. Whenever a replica receives a commit-number in a message that is higher than its own commit-number, it executes the necessary operations immediately until its commit-number matches that of the received message.

Modelling application state

To model application I have added a new variable to the specification that stores the operations that get executed. The application state can be arbitrary, but I’m going to use a log as it allows me to add simple invariants to ensure consistency between replicas.

Fig 2. Application state modelled as a log

When a replica executes an operation, it appends the operation to the app state log and increments the commit-number.

We can now add an invariant for the app state that ensures there is no divergence of app state between replicas, in the same way we do for the log itself.

StartView messages and commit-numbers

The paper describes how a backup replica handles a StartView message and advances its commit-number:

If there are non-committed operations in the log, they send a <PREPAREOK v, n> message to the primary; here n is the op-number. Then they execute all operations known to be committed that they haven’t executed previously, advance their commit-number, and update the information in their client-table.

But what should a replica do if it’s commit number is higher than that of the StartView message? Is that even possible?

Fig 3. A replica receives a StartView with a lower commit number than its own.

The paper does not explicitly describe this scenario and my mistake had been to assume that the commit-number of a StartView message could never be less than the current commit-number of the replica. Therefore I allowed the replica to update its commit-number to match the SV message without restriction which led to the violation of the commit-number monotonicity property. In hindsight it is obvious that this could happen but these little oversights are too easy to make.

I was alerted to this issue when TLC reported the following app state divergence counterexample:

Fig 4. r1 ends up applying operation (a) two times to the state machine module because the commit-number was not monotonic.

The above counterexample was caused by r1 setting its commit-number from k=1 to k=0 on receiving the SV message from r2. That then caused it to reapply operation (a) when it received k=2 in the Prepare message for operation (c).

The fix is to ensure that the commit-number is only updated if the SV message commit-number is higher. The same goes for receiving Prepare and NewState messages which can also update a replica’s commit-number. Once we update those actions we guarantee commit-number monotonicity and allow for no application state divergence.

A dive into the TLA+ specification

I wrote an introduction to TLA+ on the Splunk messaging-as-a-service engineering blog which is worth reading if you haven’t learned about TLA+ previously. It explains how TLA+ models state and changes to state. So far I have avoided including TLA+ in this analysis, instead using a slightly simpler pseudo-TLA+. But I’m going to introduce some now.

The specification models the state stored in the replicas using variables that are maps - which are known as functions in TLA+ (due to TLA+ using mathematical concepts and notation). Each variable stores a specific thing like the replica status or the log or the op-number etc. The replica state variables have a prefix of rep_, such as rep_commit_number or rep_log.

All the rep_* variables are maps whose key is the replica id and value is the value for that replica. So to know the commit-number of replica 2, you would access it with rep_commit_number[2]. To update the commit-number of replica 2 to the value of 3 you would write:

rep_commit_number' = [rep_commit_number EXCEPT ![2] = 3]

TLA+ uses the syntax function_var’ = [function_var EXCEPT ![key] = new_value] to update one entry. This is strange syntax at first, but as Leslie Lamport says, you get used to it. It is basically saying that the function in the next state is the same as the current state except for one value which is changed.

To append an operation to the log of replica 1 you would write:

rep_log' = [rep_log EXCEPT ![1] = Append(@, op)]

The @ symbol is the current value of the key 1. So this is saying that the log variable in the next state is the same as the log variable in the current state, except for the value of replica 1, which has op appended to it.

However, you won’t find the specification accessing functions using integer values. What you will see is the letter r instead. Each action uses an existential quantifier to non-deterministically select a replica. For example:

\E r \in replicas :
    rep_status[r] = Normal

The above simply states that there exists a replica in the set of replicas such that it’s status is Normal.

Coming back to commit-number monotonicity, now any time an action might need to execute one or more operations, it uses the MaybeExecuteOps formula to do so. This uses a recursive sub-formula to append all operations between the current commit-number and the new-commit number to the application state and also update the commit-number to the new one. State is only mutated if the new commit-number received is higher than the replica’s own commit-number.

RECURSIVE AppendOps(_,_,_,_)
AppendOps(log, app_state, op, commit_number) ==
  IF op > commit_number
  THEN app_state
  ELSE LET app_state1 == Append(app_state, log[op])
       IN AppendOps(log, app_state1, op+1, commit_number)
   
MaybeExecuteOps(r, log, old_commit, new_commit) ==
  IF new_commit > old_commit
  THEN /\ rep_app_state' = [rep_app_state EXCEPT ![r] = AppendOps(log, @, 
                                                                  old_commit + 1,
                                                                  new_commit)]
       /\ rep_commit_number' = [rep_commit_number EXCEPT ![r] = new_commit]
  ELSE UNCHANGED << rep_app_state, rep_commit_number >>

The TLA+ above is saying the following:

If the new commit-number (received in a message) is greater than the current commit-number then execute all operations up to and including the new commit-number, else leave the application state and commit-number unchanged.
To execute the operations:
- Update the value of application state variable, for replica r, by appending all log operations between the current commit-number+1 and the new commit-number.
- AND update the value of the commit-number variable, for replica r, with the new commit-number.

Hopefully the above TLA+ makes sense now but if not don’t be put off. I always recommend Leslie Lamport’s video course as a great way to understand TLA+ fundamentals.

Conclusions

TLA+ isn’t perfect and it might not find every problem with your thinking - missing commit-number monotonicity is a great example. It turns out that the commit-number doesn’t have to be monotonic for the safety of the log replication module - only the state machine module. In hindsight, it is clear that the commit-number in a message from a primary might be lower than a replica’s own commit-number and therefore we shouldn’t blindly update the commit-number without checking its value first - but these mistakes are really easy to make which is one reason why formal verification is so useful.

I normally don’t model the state machine in an SMR protocol, it is only because I wanted to verify the checkpointing mechanism that I did at all. With TLA+, we choose a level of abstraction which gives us the best return on investment. Specifications that try to model everything can get too large and unwieldly (and just scare-off any curious engineers). Often we only want to verify some core aspect of a protocol or system which means that there is always room for mistakes to creep in once we start implementation with the parts of the design we didn’t formally verify. This is always going to be a trade-off when using formal methods during system design and implementation.

You can find the specification with the addition of application state and commit-number monotonicity in my GitHub repo.

Next up really is the replica-recovery sub-protocol.

Analysis parts: Analysis intro, part1, part 2, part 3, part 5, part 6.