Paper: VR Revisited - State Transfer (part 3)

Previous parts: Analysis intro, part1, part 2, part 4, part 5, part 6.

The objective of the state transfer sub-protocol is to allow a replica that is behind to catch up. The paper states:

“State transfer is used by a node that has gotten behind (but hasn’t crashed) to bring itself up-to-date. There are two cases, depending on whether the slow node has learned that it is missing requests in its current view, or has heard about a later view.”

The words “has heard about a later view” is pretty vague and immediately the question arises as to what this means concretely. Having explored the view change sub-protocol extensively, we can confidently assume that this does not refer to any view change sub-protocol message. If a replica learns of a later view from an SVC, DVC or SV message then it should follow the view-change sub-protocol - not the state transfer sub-protocol. That leaves us with a replica hearing about a later view via a Prepare message.

The paper then goes on to state:

“In the former case it only needs to learn about requests after its op-number. In the latter it needs to learn about requests after the latest committed request in its log, since requests after that might have been reordered in the view change, so in this case it sets its op-number to its commit-number and removes all entries after this from its log.”

For now we’ll look at the second of the two cases where a replica learns of a later view via a Prepare message AND that it is missing a portion of the log (i.e. there is a gap between the replica’s op number and the op number of the Prepare message).

State transfer sub-protocol message passing

State transfer messages and rules are as follows:

To get the state the replica i sends a {GETSTATE v, n’} message to one of the other replicas, where v is its view-number and n’ is its op-number.
A replica responds to a GETSTATE message only if its status is normal and it is currently in view v. In this case it sends a {NEWSTATE v, l, n, k} message, where v is its view-number, l is its log after n’, n is its op-number, and k is its commit-number.
When replica i receives the NEWSTATE message, it appends the log in the message to its log and updates its state using the other information in the message.

However, the paper is not precise with its language about view-numbers. If the replica sends the GetState message with the replica’s current view-number then the message would be ignored as its view-number would be too low. State transfer would be useless, so I am confident that we need to set the view-number of the GetState message to be the higher view-number the replica has just learned of. That just leaves the question as to when the replica should update its own view-number.

Someone suggested to me that the replica should not change its view-number and should just wait for a view change, but this would obviate the need for state transfer at all as view changes include full log synchronization.

The replica needs to set the view number of the GetState message to match the Prepare message that is clear. The replica itself could either perform an eager view update where the replica updates its view number immediately or set it later on receiving the NewState message (lazy view update). I’ll explore both.

Eager View Update

The TLA+ specification gets three new actions: SendGetState, ReceiveGetState and ReceiveNewState.

Remember, we are only including the case where a replica learns of a later view and sees a gap in the entries.

SendGetState - A replica in the normal status receives a Prepare message with a higher view and whose op number is more than 1 ahead of its own op-number. The replica then:
- sets its view and last-normal-view to that of the message view number (eager view update).
- sets its op-number to its commit-number
- truncates its log to its commit number
- sends a GetState message with its new view number and new op-number to any peer.
ReceiveGetState - A replica in the normal status receives a GetState message with a matching view number and if it has log entries higher than the message op-number, it responds with a NewState message, including its op-number, commit-number and log suffix that is higher than the message op-number.
ReceiveNewState - A replica in the normal status receives a NewState message with a matching view and whose op-number is exactly one less than the first op in the log of the NewState message. This rule simply exists so that only a response that matches the request is processed - a custom variable could also have sufficed.

Eager view update - counterexample 1 - data loss

Using the model of three replicas, one view change and three operations with model checking mode, TLC finds a 16 step counterexample that violates the AcknowledgedWritesExistOnMajority invariant. This invariant is violated if a committed op does not exist on a majority of replicas. I use this invariant instead of a data loss invariant because it triggers earlier and as we know, once an op only exists on a minority, a view change can truncate or overwrite it. VR is all about overlapping election and replication quorums like Raft and Paxos which requires quorums to be majority quorums.

Let’s see what went wrong.

Fig 2. Eager view update counterexample 1

This counterexample requires that messages can be received out-of-order. VR Revisited discusses late and out-of-order messages in its Assumptions section:

“VR is intended to work in an asynchronous network, like the Internet, in which the non-arrival of a message indicates nothing about the state of its sender. Messages might be lost, delivered late or out of order, and delivered more than once; however, we assume that if sent repeatedly a message will eventually be delivered.”

Many protocols are built to be able to handle out-of-order delivery, such as Raft as it can simplify an implementation which doesn’t need to worry about message ordering guarantees between nodes in order to be safe. Apache ZooKeeper’s ZAB protocol is an example of a protocol that does explicitly control the ordering of messages passed between nodes.

The core insight of this counterexample is that State Transfer allows a replica to receive a Prepare message of a view for which it never received the StartView message because we eagerly updated the view-number. A late arriving StartView, processed after a Prepare message can overwrite operations that a replica has acknowledged back to the primary. We need to change the ReceiveSV action to prohibit this possibility.

The current restriction is:

/\ msg.type = StartView
/\ msg.view_number >= view_number

The basic rule is correct, we can receive a StartView when in Normal or View Change status and the SV view number should be >= replica view number to ensure that it is monotonic. But we need to be a little more specific: the SV view number can only match the replica view number if the replica is in ViewChange status, else the SV view number must be higher than the replica view number.

/\ msg.type = StartView
/\ \/ /\ msg.view_number = view_number
      /\ status = ViewChange
   \/ msg.view_number > view_number

The /\ means AND while the \/ means OR.

This prevents a late SV message from being processed after a Prepare message by a replica that joined a view due to state transfer, rather than as part of a view change. The paper does not describe any restrictions on receipt of a StartView message, though the intent was view monotonicity but we see from this counterexample that simple monotonicity is not enough.

Eager view update - counterexample 2 - data loss

Let’s see what went wrong this time.

Fig 3. Eager view update counterexample 2.

This data loss counterexample is caused by state transfer causing r3 to:

truncate its log to its commit number, removing op 1 (which drops op1 from being hosted by a majority to being hosted by a minority).
join view 2 and settings its last normal view to 2 (as it is in the normal status).
Send a DVC with log=[] and v’=2 despite being in a view where log=[1] was already established.

The DVC of r2 would have prevented the data loss but it arrived too late.

The fix to this particular case is for r3 to not truncate its log when it starts a state transfer. Instead it should leave its log as it is and send its commit number in the message. When a NewState message is received, it should overwrite any portion of the log covered by the log suffix in the message and append the rest.

Let’s make the change in the specification, we simply take out the mutation to the log and op-number in SendGetState. In the ReceiveNewState action we ensure that any the replica overwrites any operations whose op-number exists in the NewState message and append the rest.

Fig 4. Amend the SendGetState action to not truncate the log, and the ReceiveNewState action to append/overwrite instead of just append.

But is that enough? It seems like there could still be a data loss counterexample lurking because state transfer with eager view update allows a replica to join a view in normal status without the log synchronization of the StartView message. If a replica can join view V but without the up-to-date log that the view started with, is there an overlapping quorum issue?

It turns out there is indeed.

Eager view update - counterexample 3 - data loss

Turns out that eager view updates are just fundamentally unsafe.

The problem is that by r3 eagerly updating its view number to 2 without any kind of log synchronization, in subsequent view change 3, there exists a view change quorum where it has the highest last normal view (v’) of 2 (compared to r1 with a v’ of 1) and an empty log. This empty log “wins” and the commit-number of r1 is selected because it is highest. The StartView message that r3 sends includes an empty log with a commit-number of 1 which itself is a red-flag. This StartView will overwrite the logs of r1 and r2 where the committed op (a) currently exists.

By performing an eager view update without log synchronization we have violated a key property of VR Revisited, that any replica r of last normal view v’ is guaranteed to have all the committed entries that existed in v’. With VR Revisited, the operations written to the log do not contain the view-number (unlike the original VR). This means we rely on the last normal view to choose a log winner.

Let’s now start from the beginning with lazy view update.

Lazy View Update

The fixes we made for eager view update may not be required for lazy view update so we’ll remove the extra restrictions on receiving a StartView message and use log truncation again. It uses the same original actions with the difference that we set the view-number and last normal view in the ReceiveNewState action instead of in the SendGetState action.

Running TLC we very quickly encounter our first counterexample.

Lazy view update - counterexample 1 - data loss

Fig 6. Lazy view update counterexample 1.

Just like with eager view update, we see that the original restriction of msg.view_number >= view_number is not enough. A late arriving StartView message can overwrite progress made via a state transfer and set of Prepare messages. We need to readd those restrictions.

Will log truncation be next to fall (again)?

Lazy view update - counterexample 2 - data loss

Fig 7. Lazy view update counterexample 2.

TLC finds a counterexample with three replicas, four view changes and three operations that leads to data loss when truncating the log in state transfer. It comes down to simply losing overlapping view-change and operation quorums. By truncating an operation such that it goes from a majority to a minority, we introduce the possibility of a view-change quorum not hosting a previously committed operation.

Truncating the log in state transfer is unsafe no matter if we use eager or lazy view updates.

After removing log truncation like we did with eager view updates, there’s only one modification from the eager mode that we haven’t yet added to lazy mode - the StateTransfer status. Is it required?

Lazy view update - counterexample 3 - data loss

Fig 8. Lazy view update counterexample 3.

The counterexample above shows how a view change and state transfer can interleave in a way that causes the state transfer to overwrite the view change log synchronization, causing a committed operation to drop to a minority (leaving it open for a subsequent view-change to lose the committed op).

To fix this we will re-add the status StateTransfer again. This will prevent interleaving of view changes and state transfer.

I’ve tried small models with model checking mode and multiple large models with simulation mode, leaving TLC running for over 24 hours and no counterexample can be found. I’m reasonably confident that lazy view update with the fixes make state transfer safe - at least within the three sub-protocols modelled so far. Any of the current rules that work currently may prove to be insufficient in the future with the addition of client and replica recovery as well as the big one: reconfiguration.

This leaves us with the extra restrictions on receving a StartView message and the following state transfer actions:

Conclusions

The paper does not describe any restrictions on accepting Start View messages which caused data loss counterexamples in both eager and lazy view update approaches. Once we added the correct restrictions to receiving a StartView message we hit the data loss defect due to truncating a log. Truncating a log to a stale commit number is not safe as it could truncate operations that were required for a committed entry to maintain a majority quorum. We fixed this protocol defect by simply not truncating the log but allowing the NewState message to overwrite part of the log. Next we found that eager view updates in state transfer are not safe, which left only lazy view updates. Finally we saw that we need to ensure that view changes and state transfers cannot interleave in a way such that they can both complete - so we added a new status called StateTransfer.

The data loss defect caused by truncation looks like it is a novel defect discovered with VR Revisited. The other issues are also defects of the paper in my mind, but the main issue is an issue of omission or vagueness as these faulty behaviours are not explicitly described - instead they exist implicitly.

I am told there are at least two more out there, I assume these exist in the recovery and/or reconfiguration sub-protocols.

The state transfer version of the specification is in my GitHub repo.

Next we’ll add the replica recovery sub-protocol.

Previous parts: Analysis intro, part1, part 2, part 4, part 5, part 6.