Building A "Simple" Distributed System - Formal Verification

In the last post, we described a protocol that should satisfy the requirements and invariants established in the first post. Today we will look at formal verification with TLA+.

Formal verification is just another (niche) tool in the toolbox. Some tools require more skill than others to use. Some tools are more expensive than others. It is up to the practioner to decide if/when/how to use them.

The hard part is that you won't necessarily know if it is beneficial to a given problem you face, if you aren't already skilled in it. If a tool is very difficult to learn, then you might never invest in it enough to be able to make that call. Or you might invest a lot of time into it, to find it isn't a great match for your problem. At which point it gets stowed in your toolbox where it may or may not get used again. I expect many software engineers see learning formal methods as a difficult (it is) and high risk venture.

So, given the above, my aim of this post is for software engineers without prior experience of TLA+ to be able to get the gist of the spec and see why it was useful for this project. Please give me feedback if I succeeded or not.

I’ll split it up into three parts:

An intro to the syntax and data structures
An overview of the spec
The model checker and how it discovered a defect in my first version of the protocol.

For learning TLA+ and PlusCal (which is a pseudo-code which compiles to TLA+) I highly recommend Leslie Lamport’s video series and book, and Hillel Wayne has great content and a book and Ron Pressler writes about TLA+ on his blog.

Section 1 - Introduction to TLA+

Warning: I will try to avoid abstract and overly mathematical language and try to keep it closer to the language that programmers are used to. However, if you want to get serious about TLA+ then you’re going to have to embrace the strangeness and mathematical nature of TLA+, I learned it by watching Leslie Lamports video series and reading his book Specifying Systems.

A TLA+ specification models a set of variables and constants which represent the state of the system we want to model. A state in TLA+ terms is the assignment of concrete values to all variables in the specification, within a single step.

A state represents one moment in time of the system, a snapshot of the system. The notion of time is not continuous but a set of discrete steps where variables are assigned values and invariants can be verified. Each new step means a new state. Each state assigns a value to every variable, though not all variables must change, from one state to another. When we don’t modify a variable we must explicitly say that the value remains unchanged.

Let’s take a super simple example: a counter i, that is incremented forever. The specification is formed by formulae that either describe the initial state of the system, or the possible next states. This is what all TLA+ specs are based on: initial state and next state.

The initial state sets the variable i to 0. The next state formula will increment the value of i by one in the next state. The next state formula in this spec is “executed” again and again forever.

EXTENDS Integers
VARIABLES i
Init == i = 0
Next == i' = i + 1
Spec == Init /\ [][Next]_i

The above TLA+ code states that i starts at 0 and in each next step is incremented by 1. The use of the ‘ character denotes the next state. The variable i in the next state will be the value in the current state plus one. The ‘ character is called prime, so we have i primed and unprimed. Next in this case is an action because it assigns values to variables. But some formulae only evaluate whether something is true or false, as we’ll see in the next version.

The last line is strange and for now forget it. It requires you to go down the rabbit hole a bit deeper than is necessary at this point. I will omit that line from now on, we’ll cover it later.

Now let’s say we want to increment i until it reaches 100 then stop.

EXTENDS Integers
CONSTANT Limit
VARIABLES i
Init == i = 0
Next == 
  /\ i < 101
  /\ i' = i + 1

Here we introduce the AND logical operator which uses the /\ symbol. Yes it is strange to a programmer, but not to a mathematician. Just get used to the fact that \/ is OR and /\ is AND. The next state formula now has two parts.

/\ i < 101 does not assign a value to i in the next state, it is a state formula that will resolve to true when i is less than 101 and false when greater or equal to 101. We call this an enabling condition. Next will always be enabled when this formula is true. Actions will only execute when enabled.
/\ i’ = i + 1 increments that counter as before, but only when the enabling condition is true.

Once i reaches 100 in the current state it will assign the value 101 in the next state - oops, our counter goes to 101! Once at 101, the Next formula is false and no more steps can be taken. We wanted the counter to stop at 100, so we should change i < 101 to i < 100.

Now let’s say we want to model a counter that can monotonically increase or decrease in each iteration, and that has a maxium value of 100 and a minimum value of 0.

EXTENDS Integers
CONSTANT UpperBound, LowerBound
VARIABLES i
Init == i = 0
Increment ==
  /\ i < UpperBound  \* enabling condition
  /\ i' = i + 1      \* set i in the next state

Decrement ==
  /\ i > LowerBound  \* enabling condition
  /\ i' = i - 1      \* set i in the next state

Next == 
  \/ Increment
  \/ Decrement

Using a constant for the upper and lower bounds allows us to set these values from the model checker, TLC. We’ll set the lower bound to 0 and upper bound to 100.

In the above spec we say that i = 0 in the initial state. The Next action uses the \/ OR symbol and says that we will either increment or decrement the counter. Which one happens in each state depends on the current value of i. In the initial state, i is zero, which means that the enabling condition of the Decrement action is false, meaning that that action is not enabled, meaning that it will not be executed. So the first occurrence of the Next action can only increment the variable i in the next state. When i=1, then both Increment and Decrement will be enabled and either could be executed.

When we run the model checker, it will execute every possible permutation of enabled actions. Its kind of like parallel universes, if we limited the model checker to three actions then:

in one universe it went like this: Increment, Increment, Increment
in another it went: Increment,Decrement, Increment
in a third it went Increment, Increment, Decrement.

However, there was no parallel universe where it went Increment,Decrement, Decrement because after the first Decrement, its enabling condition was false. The model checker explores the entire state space and you can ask it to verify your invariants in every state it finds. This is hugely powerful.

Without limiting the model checker though, it would run the above spec forever as there are infinite permutations of a counter being incremented and decremented within the bounds of 0 to 100.

Another really simple introduction is https://sookocheff.com/post/tlaplus/getting-started-with-tlaplus/

TLA+ Functions, Sequences, Sets, Records

To do anything useful you normally need some kind of collection and possibly a record structure. The Rebalanser spec makes use of these.

For example, the list of node ids in the registry is stored in a set reg_nodeIds. It is initialized with the value {} denoting an empty set. Later we add elements using the union operator (\cup), remove elements (\) and perform intersections (\cap). The Sets page on Learn TLA+ has a nice overview.

(* The id models a monotonically incrementing id which guarantees that each node
   in the cluster has a unique id and older nodes have lower ids than newer nodes *)
RRIncrementNextId ==
  reg_nextId' = reg_nextId + 1

(* Adds a node to the nodes set and a new node id.
   Additionally the nVersion variable is incremented to signal a change in the cluster *)
RRAddNode == 
  /\ reg_nodeIds' = reg_nodeIds \cup {reg_nextId}
  /\ RRIncrementNextId
  /\ reg_nVersion' = reg_nVersion + 1

In the above formula, we assign a value of reg_nodeIds in the next state to be the one in the current state plus the current next id; essentially we add a new node id to the set of node ids. We also increment the reg_nextId variable in the next state. Note that we can assign reg_nextId a new value, but it only affects the next state, the current state is immutable.

Some data for each node is stored in a dictionary where the keys are the nodes and they each point to a NodeData record. The NodeData record is defined as:

Phase == { "-", "stopped", "started" }
Role == { "offline", "none", "follower", "leader" }
NodeData == [id : Nat, role : Role, aVersion : Nat, rbPhase : Phase]

The first two lines are sets, that contain the rebalancing phases and roles that a node can be in. The third line defines a record with four fields:

id (the id of the node assigned by the registry)
role (the current role of the node)
aVersion (the version of the resource allocations in the registry that the node is aware of)
rbPhase (the current phase of rebalancing)

TLA+ does not use the word dictionary, but the word function, from mathematics. The keys of the function are called the domain. We initialize this function as follows:

NoNode == [id |-> 0, role |-> "offline", aVersion |-> 0, rbPhase |-> "-"]
Init == 
  node_data = [n \in N |-> NoNode]

N in the above case is a constant defined earlier and provided by the model checker. It is the set of possible nodes that could be started up, such as { “n1”, “n2”, “n3” }. The line in the Init formula creates a function whose domain is the set N and where each node points to a record with an id of 0, role offline etc.

Later we’ll want to update the node data from being empty, to having an id, a role etc when we start up a node. When we mutate something, we cannot mutate it in the current state, unless it’s the initial state. In all subsequent states we simply define what it will be in the next state. When we update a single item in a collection, then we need to say not only what the item we are changing will be, but all the items. All the time in TLA+ we are defining the state of the entire system.

This introduces us to the EXCEPT ! notation which is strange, but you get used to it. The below formula says that the node_data function in the next state, will be the same as the node_data function in the current state, except for node n, whose record we assign new values. It takes an id from the registry, the none role, the current allocation version from the registry and the rebalancing phase “-” denoting that it has never been allocated anything.

node_data' = [node_data EXCEPT ![n] = [id |-> reg_nextId, 
                                       role |-> "none", 
                                       aVersion |-> reg_aVersion, 
                                       rbPhase |-> "-" ]]

Filtering

We might have a function of node records and want to perform filters on them. For example, the set of all node records that are running. I have included a C# example that is equivalent.

LiveNodes ==
  { n \in DOMAIN node_data : node_data[n].role # "offline"}

// C# example
public IEnumerable<NodeData> GetLiveNodes()
{
   foreach(var key in node_data.Keys)
   {
      if(node_data[key].role != "offline")
         yield return node_data[key];
   }
}

The # means not equal to. { … } denote a set, so we wrap n \in DOMAIN node_data : filter in braces to make a set of the filter.

The spec stores the list of disconnected nodes in a set called node_disconnected. In the Init formula it is set to {} which denotes an empty set. Some actions require that a node be connected, so we have a simple formula:

IsConnected(n) ==
  n \notin node_disconnected

We can then use that in other formulas:

RoleChange(n) == 
  /\ IsConnected(n)
  /\ BecomeLeader(n) \/ BecomeFollower(n)

In the above formula, IsConnected is the enabling condition. Unless the node n is connected, it cannot become the leader or a follower. The BecomeLeader and BecomeFollower actions change the state of the node record of n. Note the use of \/ (OR) and /\ (AND). It says that the node must be connected AND must EITHER become the leader OR become the follower. Whether or not it becomes the leader or follower or neither depends on the enabling conditions of those formulae. In parallel universes, both will get executed tens of thousands or even millions of times as the model checker explores all possible permutations of reality.

We often need to know if something is true of all elements in a set, or there exists some element in the set for which something is true. TLA+ has the \A meaning for all, and \E meaning there exists.

For example, the registry assigns nodes monotonically increasing integer ids. Whoever has the lowest id is the leader. Pretty simple leader election. So we have a formula for a node to know if it is the leader:

HasMinId(nodeId) == 
  \A n \in reg_nodeIds : nodeId <= n

A node id is passed to this formula and compared to all the node ids in the registry. If it is equal to or lower than all the ids, then it is the lowest and is true. This can then be used in the enabling conditions of our BecomeLeader and BecomeFollower formulae:

(* Having the lowest id and not currently being leader enables the action.
   The role is updated to leader and the term is incremented. The term indicates
   a new leadership period and does not change as long as the new leader continues being the leader*)
BecomeLeader(n) == 
  /\ HasMinId(node_data[n].id)
  /\ node_data[n].role /= "leader"
  /\ node_data' = [node_data EXCEPT ![n].role = "leader"]
  /\ node_ldr' = [nodeId |-> node_data[n].id, term |-> reg_term + 1, nVersion |-> reg_nVersion, rVersion |-> reg_rVersion, rbStats |-> NoStats]
  /\ reg_term' = reg_term + 1
  /\ NewLeaderRebalance
  /\ UNCHANGED << vars_nodeIsolation, vars_regNodes, vars_regResources, vars_regIsolation, reg_barrier, resourceAccess >>

(* Not having the lowest id and not already being a follower enables this action.
   The role is updated to follower. *)
BecomeFollower(n) == 
  /\ ~HasMinId(node_data[n].id)
  /\ node_data[n].role /= "follower"
  /\ node_data' = [node_data EXCEPT ![n].role = "follower"]
  /\ UNCHANGED << node_ldr, vars_nodeIsolation, vars_registry, resourceAccess >>

In both the formulae above, the first two lines are the enabling conditions, the end line is the UNCHANGED statement which we’ll cover soon and the rest defines the next state.

The BecomeFollower action uses the ~ character to denote NOT. So ~HasMinId means: does not have the min id. The UNCHANGED line lists all the variables that were not changed. It means that the model checker might execute only the BecomeFollower action and no others, setting all the unchanged variables in the next state to the same value in the current state.

Going back to \A and \E. The example we used was for an enabling condition. We can also use \E as an enabling condition but we can also use it to execute an action.

\E nd \in LiveNodes : \/ StopConnectedNode(nd)
                      \/ StopIsolatedNode(nd)
                      \/ ExpireNode(nd)

This says that there exists a node in the set of live nodes that:

is connected and can be disconnected OR
is isolated (disconnected) and can be stopped OR
can be expired

Each of the three formulae have enabling conditions and next state actions. The whole formula is true if one of the sub formula are true for one of the nodes. The model checker will execute enabled actions in all possible permutations. It could decide to stop a node, then start the node, then stop then start etc. The number of permutations across all the different actions is infinite, so we need a way to limit the state explosion so our model checker can actually finish.

To that end the spec has three counters which are useful for limiting the infinte state space to one of just a few tens or hundreds of millions.

isolationEventCtr is incremented every time we mess with network connectivity of a node
reg_nVersion is incremented every time we add or remove a node in the registry
reg_rVersion is incremented every time we add or remove a resource in the registry.

If we keep each of those counters to less than 4, then the model will finish in a short period if time, computing millions of unique states.

There’s a lot, lot more to TLA+ but hopefully, that gives you a taste. If you don’t get it at all then that is fine, you are not dumb - TLA+ is hard to learn and hard to explain, its a bit like Morpheus explaining the Matrix to Neo. For more on sets, functions, sequences check out Hillel Wayne’s Learn TLA+ site which is helpful for beginners. I also highly recommend that you simply take the red pill and get your mind blown by Leslie Lamport’s video series.

Other sources of more in depth TLA+ materials are Ron Pressler’s blog and Hillel Wayne’s blog.

Section 2 - The Specification of the protocol

Link to spec files.

Level Of Abstraction

Note that the real system will be composed of N nodes and a registry. The implementation will have requests and responses being sent over the network. We could model the specification using requests and responses if we wanted, though that would increase the size and state space of the specification.

I have opted to model the separation of nodes from registry by carefully naming the variables (and formulae) such that it is obvious which represent nodes and which represent the registry. When a node variable is assigned a value from a registry variable, there is implicitly some form of communication between the two. Likewise, when an enabling condition in a node related formula uses a registry variable, there is an implicit communication. To simulate connectivity and loss thereof, we use sets denote the disconnected nodes.

When you model a system in TLA+, part of the art of it is choosing the right level of abstraction. Too high and you might miss a defect. Too low and you might get lost in complexity.

The Spec

Like all good coders, I decided not to create one monolithic formula. I broke it up into small, nicely named formulae and provided comments also. The specification at a high level is composed of a tree of OR branches until it gets down to the formulae that actually change the state of the system. In each step it will execute just one of the blue formulae.

The blue formulae are the ones that actually change the state of the system. They also call further formulae in some cases. Not all of the blue formulae will be enabled in each state. But some higher level formulae are, for example EnvironmentChange can be executed in every state because either we can add a new node or remove an existing node, be it connected or disconnected. The model checker will explore every single permutation of these actions.

The specification is divided into two files ResourceRegistry.tla and ResourceBarrier.tla in order to separate the formulae regarding the registry and the nodes. The ResourceBarrier.tla is the parent. There are also formulae related to the real resources (not the resource identifiers stored in the registry) that are stored in the ResourceBarrier.tla file.

In the registry we have a bunch of variables:

CONSTANTS R \* the set of all possible resources that could exist. 
VARIABLES reg_resources,      \* the set of resources in the register.
          reg_rVersion,       \* the current version of the resources. Incremented every time a resource is added or removed.
          reg_nodeIds,        \* the set of nodes ids.
          reg_nVersion,       \* the current version of the nodes. Incremented every time a node is added or removed.
          reg_allocations,    \* the resource to node allocations (a function of nodes to resources).
          reg_aVersion,       \* the current version of the allocations. Incremented each time a new allocations is set.
          reg_barrier,        \* the resource barriers. A function of resource to node id.
          reg_term,           \* the leadership term. Incremented each time a leader is elected.
          reg_nextId,         \* a counter used to assign an id to a node. Increments each time a node is added.
          reg_unreachableIds  \* the set of nodes that are isolated from the registry

The resources, node ids and allocations all have versions. This is a simple way to represent change that can enable other actions in the spec. None of the formulae in the ResourceRegistry file know of the nodes and their variables. They only know the node ids, the resources and the allocations.

The ResourceBarrier file has some variables related to the nodes:

CONSTANTS N \* the set of all possible nodes that could exist.
VARIABLES node_data,           \* data in each node
          node_ldr,            \* data stored by the current leader
          node_disconnected,   \* set of all nodes isolated from the registry
          isolationEventCtr,   \* a counter of the number isolation events. Used by the model checker to limit the number of events
          resourceAccess       \* A simulation of the real resources. Not part of the Rebalanser system directly, used for verifying an invariant.

Each of the blue actions assign values to all these variables (because we always define the state of the whole system) when they are executed. Let’s take a look at one of them - StartNode (at the bottom):

(* The id models a monotonically incrementing id which guarantees that each node
   in the cluster has a unique id and older nodes have lower ids than newer nodes *)
RRIncrementNextId ==
  reg_nextId' = reg_nextId + 1

(* Adds a node to the nodes set and a new node id.
   Additionally the nVersion variable is incremented to signal a change in the cluster *)
RRAddNode == 
  /\ reg_nodeIds' = reg_nodeIds \cup {reg_nextId}
  /\ RRIncrementNextId
  /\ reg_nVersion' = reg_nVersion + 1

StartNode(n) ==
  /\ RRAddNode
  /\ node_data' = [node_data EXCEPT ![n] = [id |-> reg_nextId, 
                                            role |-> "none", 
                                            aVersion |-> reg_aVersion, 
                                            rbPhase |-> "-" ]]
  /\ UNCHANGED << node_ldr, vars_nodeIsolation, vars_regResources, vars_regIsolation, vars_regAlloc, vars_regMisc, resourceAccess >>

In the above TLA+ formulae:

assign the monotonic node id counter, reg_nextId, in the next state to be the the incremented value from the current state
assign the reg_nodeIds set in the next state to be the union of the current set and the reg_nextId value. Basically adding the new id to the node ids in the registry.
assign the variable node_data in the next state to be the same as the node_data function in the current state except that the record for node n has its id and role changed.

Notice that the order in which each sub formula is executed is irrelevant. The counter, for example, stays the same in the current state, it is only changed in the next state. It tiresome to repeat that the value in the next state is assigned the value from the current state except for xyz change. From now on, anytime we change the value of a variable, we are assigning it a new value in the next state only.

Another example. We disconnect a node in one step. Read it from the bottom up. In TLA+, you can’t use a formula that is declared lower than the formula it is used in. So start at IsolateNode at the bottom.

RAStopAllAccess(n) ==
  resourceAccess' = [ra \in DOMAIN resourceAccess |-> 
                        IF resourceAccess[ra] \cap {n} # {} 
                          THEN resourceAccess[ra] \ {n}
                          ELSE resourceAccess[ra]]

StopActivity(n) ==
  RAStopAllAccess(n)

RRIsolateNode(nodeId) == 
  /\ nodeId \in reg_nodeIds
  /\ nodeId \notin reg_unreachableIds
  /\ reg_unreachableIds' = reg_unreachableIds \cup {nodeId}

IfLeaderWipeData(n) ==
  \/ /\ node_data[n].id # node_ldr.nodeId
     /\ node_ldr' = node_ldr
  \/ /\ node_data[n].id = node_ldr.nodeId
     /\ node_ldr' = NoLeader

RevertToNoRole(n) ==
  /\ StopActivity(n)
  /\ node_data' = [node_data EXCEPT ![n] = [node_data[n] EXCEPT !.role = "none",
                                                                !.id = 0]]
  /\ IfLeaderWipeData(n)

IsolateNode(n) ==
  /\ RRIsolateNode(node_data[n].id)
  /\ RevertToNoRole(n)
  /\ node_disconnected' = node_disconnected \cup {n}
  /\ isolationEventCtr' = isolationEventCtr + 1
  /\ UNCHANGED << vars_regNodes, vars_regResources, vars_regAlloc, vars_regMisc >>

IsolateNode is only enabled if all sub formulae are enabled as we are using /\ AND everywhere. RRIsolateNode (RR standing for Resource Registry) is only enabled if the node is not in the list of unreachable nodes. So we can only isolate a connected node.

What it and its sub formulae do:

Adds the node id of node n to the set reg_unreachableIds
Stops all resource access by removing the node id from the resource access function which maps resources to nodes. The resourceAccess function maps each resource to a set of nodes. The invariant does not allow multiple nodes to access a single resource so why a set? Because the reality is, is that concurrent access is something we want to verify. We have an invariant formula which we check in every state that checks that for each resource -> node set mapping, the set has a cardinality of 0 or 1 only.
Reverts the node back to the none role. If it was the leader then we wipe the leader data.
Add the node to the node_disconnected set. Why have a node_disconnected and a reg_unreachableIds set? Because they won’t always be the same. Once a node has been unreachable for a while it is expired, meaning it is completely removed from the registry. However, the node may still exist and still be disconnected.
Increment the isolationEventCtr. We use this counter in the model checker to limit the state space.

So how does this relate to the protocol? In the protocol we state that when a node cannot reach the registry, it must revert to no role before the registry expires it. We achieve that by reverting to the none role in the next state, and simply adding the node id to the unreachable ids in the registry of the next state. In the next step, the isolated node will have no role and will only be able to assume a role once connected again. The registry has an RRExpireNode formula which is now enabled for that node and can be executed at any time which wipes the registry of the node and any barriers it had. Also, the protocol says that resource access must stop when a node cannot reach the registry, we also model that in this formula.

Let’s look at the ExpireNode formula, again it is at the bottom:

(* Searches through all of the barriers to find one or more barriers owned by the given node id.
   Any matches it finds, it removes the barrier (by setting it to 0) *)
RRRemoveBarriers(nodeId) ==
  reg_barrier' = [rb \in DOMAIN reg_barrier |-> 
                    IF reg_barrier[rb] = nodeId 
                       THEN 0 
                       ELSE reg_barrier[rb]]

(* Removes a node from the nodes set and removes its node id.
   Additionally the nVersion variable is incremented to signal a change in the cluster *)
RRRemoveNode(nodeId) ==
  /\ reg_nodeIds' = reg_nodeIds \ {nodeId}
  /\ reg_nVersion' = reg_nVersion + 1
  /\ reg_unreachableIds' = reg_unreachableIds \ {nodeId}
  /\ RRRemoveBarriers(nodeId)

RRExpireNode(nodeId) ==
  /\ nodeId \in reg_unreachableIds
  /\ RRRemoveNode(nodeId)

So it is only enabled when the given node id is in the set of unreachable ids. It then removes the node in the next state by:

removing the id from the set reg_nodeIds
removing the id from the ser reg_unreachableIds
incrementing the reg_nVersion counter
removing the id from any barriers

When executed, there is no trace of that node id in the registry in the next state, including any barriers it had being lost.

That’s enough I think of looking at the specifics of the specification. It is reasonably well documented and explains what each formula does. You should have an idea of how we can model all the possible interactions between the registry and the nodes, causing all kinds of mayhem that should test the protocol design to its limit. In the next section we’ll look at the defect it found.

Section 3 - Running the model checker (TLC) and finding a defect

In my first version of the protocol, each resource barrier was a property of the resource itself. When an administrator removed a resource, it removed its barrier also. No problem I thought, once removed, a new rebalancing won’t include it, so there would be no need for the barrier.

The model checker found the following series of states that lead to the violation of my invariant. We’ll look at the invariant formulae in this section also. There is node A and node B with three resources r1 to r3.

(A) starts up
(A) becomes leader
(A) allocates (A) -> r1, r2, r3 in the registry
(A) puts barriers on r1, r2, r3 in the registry
Admin removes r1 (including its barrier) from the registry
Admin adds r1 to the registry
(A) starts accessing resources r1, r2, r3 (has not yet detected the changes to resources)
(B) starts up
(B) becomes a follower
(A) now sees the changes to resources and node. Allocates (A) -> r2, r3 B -> r1 in the registry
(B) sees change to allocation and enters stopped phase of rebalancing.
(B) enters start phase of rebalancing, puts barrier on r1 in registry and starts accessing resource r1 VIOLATION: (A) is still accessing r1
(A) stops all access of r1, r2, r3
(A) removes barriers from r1, r2, r3. Cannot remove barrier on r1 as it is no longer the owner of that barrier.
(A) puts barrier on r2, r3 in registry
(A) starts accessing resources r2, r3

Node A starts up, becomes the leader and allocates itself all the resources. We see that an administrator deletes and then quickly readds a resource with the same identifier. In doing so, its barrier is lost, which leads to the violation.

There are multiple scenarios related to an administrator removing and adding the same resource quickly enough to cause this violation. The solution was simple: make barriers their own entities separate from the resources. When an administrator removes a resource, any barriers it had are not removed. This fixes the problem. Barriers can only be removed by nodes when they either shutdown or a rebalancing occurs, also if they perish their id gets added to the unreachable list and then later on are completely removed from the registry, including their barriers.

Invariants

Defining invariants in TLA+ is exactly the same as any other formula. It is basically an enabling condition that you don’t want to ever happen.

This spec has three invariant formulas:

RRInvariant - for variables in the registry
RAInvariant - for the resourceAccesses variable
RBInvariant - for rebalancing

Invariants in a spec don’t just test your design, they ensure you don’t have a bug in your spec, like assigning the wrong type of value to a variable. TLA+ is dynamically typed so you can assign any value to any variable and it won’t complain, except for some runtime errors. Invariants help you catch these kinds of mistake.

(* The resource allocations are balanced if there is not a difference in the number
   of resources assigned between two nodes that is greater than 1 *)
NoImbalancedAllocations ==
  \/ Cardinality(DOMAIN reg_allocations) <= 1
  \/ /\ Cardinality(DOMAIN reg_allocations) > 1
     /\ \A n,m \in DOMAIN reg_allocations : Cardinality(reg_allocations[n]) - Cardinality(reg_allocations[m]) \in {-1, 0, 1}

RRInvariant ==
  /\ reg_resources \subseteq R
  /\ reg_rVersion \in Nat
  /\ reg_nodeIds \subseteq Nat
  /\ reg_nVersion \in Nat
  /\ reg_nextId \in Nat
  /\ reg_aVersion \in Nat
  /\ reg_barrier \in [R -> Nat]
  /\ reg_term \in Nat
  /\ NoImbalancedAllocations

What is all that you are thinking! The \in Nat means that the value of the variable is a natural number. It says that the current resources, reg_resources, are a subset of R, the constant that we set in the model checker. That reg_barrier is a function of R to a Natural number (corresponding to a node id).

NoImbalancedAllocations ensures that we never have imbalanced allocations in any step.

The RAInvariant simple ensures that there are never two nodes accessing the same resource:

RAInvariant ==
  \A ra \in DOMAIN resourceAccess : Cardinality(resourceAccess[ra]) < 2

The RBInvariant ensures the variable types are correct, like the RRInvariant but also that once rebalancing is complete that:

All in-scope resources are being accessed
All in-scope resources have been allocated

What does in-scope mean? When a rebalancing starts, concurrently, new resources can be added or removed. So any check of the resource accesses needs to be made against the snapshot of resources that existed at the time of allocation.

When I run the spec with the following state constraint:

/\ reg_nVersion < 3
/\ reg_rVersion < 3
/\ isolationEventCtr < 3

Then it completes quickly with only 201826 distinct states, whose longest behaviour is 18 states long. If I change the < 3 to < 4 it takes about 2 hours to complete and finds hundreds of millions of states. This is called the state explosion.

This specification has given me confidence that my protocol is sound. I still have a ways to go playing with more advanced temporal logic, but the safety side of things seems to be well covered and uncovered a defect that I had not considered. I am sure TLA+ experts could tell me ways of improving the specification and you are welcome to do so! But the point is that me, as a software engineer, not a temporal logic mathematician was able to be a create a set of mathematical formulae that describe the system. From the decision to learn TLA+ to having this spec ready took 18 days of evenings and weekends. I started out with sub problems, like solving the leader election alone, then calculating even resource allocation and finally putting it all together. You can get useful results relatively fast.

A Note on Concurrency

You may be wondering about the fact that we are modelling a concurrent system sequentially. The fact is that time, in reality, is continuous and a series of actions that occur in the same time frame concurrently can be modelled as a single sequential history of events. Look at the steps that led to the defect, that was one such linearized history of events that led to a violation of an invariant.

Some of those actions might be really really close in time. In TLA+ we have no concept of the amount of time between steps, just whether one step happened before or after another. There could have been 1 nanosecond between two steps or 1 week. To the model, it is not relevant. What is important is the order of steps that occur.

This is what TLA+ gives us, this notion of before and after. A behaviour in TLA+ is one such single sequential history of events, one among possibly tens, thousands, millions or even an infinite number. A single counter that counts to inifinity has one single behaviour, whereas Rebalanser has infinite such possible behaviours (unless we limit the state space).

Final Thoughts

If you decide to learn TLA+ and you’re just a regular software engineer like me, then brace yourself for a wild ride. Yes it is strange and weird but it is also quite wonderful. Is it the silver bullet? Is it always applicable? The answer is no. But some systems are nicely suited to it and if you are building that type of system then I don’t think you’ve got anything to lose by learning TLA+ or PlusCal. Personally for me, I have really enjoyed the experience and will continue to spend time learning it.

We didn’t cover temporal logic, fairness, liveness or any of the other more advanced topics. We also didn’t cover PlusCal which is a pseudo-code alternative that you might find easier to grok. Head over to the forum, people there respond quickly to questions if you have doubts. Some useful links:

https://groups.google.com/forum/#!forum/tlaplus The forum.
https://www.reddit.com/r/tlaplus/ The sub reddit for TLA+
https://learntla.com/introduction/ Hillel Wayne’s site, with a link to his book.
https://lamport.azurewebsites.net/tla/tla.html A wealth of info and links, including links to his book Specifying Systems which is excellent and thorough. Don’t expect to understand it all on first reading.
https://github.com/tlaplus/Examples Many, many example specs.
https://lamport.azurewebsites.net/video/videos.html The video series on Leslie Lamport’s site.
https://www.youtube.com/channel/UCajiu4Cj_GHOX0if3Up-eRA The video series on YouTube

Links to the rest of this series:

Part 1 - The What

Part 2 - Designing the protocol

Part 3 - Formally verifying the protocol with TLA+ (this post). Link to spec files.
Part 4 - The implementation
Part 5 - Logging
Part 6 - Testing the implementation (coming soon)

Banner Image Credit: G. Brammer/ESO. Link to image.