Photo by Linus Nylund on Unsplash.

In part 1 we introduced the concept of safety and liveness properties, then a stupidly simple gossip protocol called Gossa. Our aim is to find liveness bugs in the design and improve the design until all liveness issues are fixed.

Gossa had some problems. First it had cycles due to nodes contesting whether a peer was dead or alive. We fixed that by making deadness take precedence over aliveness but still the cluster could not converge. The next problem was that a falsely accused dead node was unable to refute its deadness as no-one would pay attention to it - deadness ruled.

The proposed fix I mentioned in part 1 was to allow a falsely accused node to refute its deadness via the introduction of a monotonic counter.

Gossa v3 - Generation to the rescue

In this third version of Gossa, we add a third variable: generation which is a monotonic counter that each node durably maintains.

The initial state sets generation to 1 for all nodes and peer_state values are now records which have two fields: generation and status (Alive/Dead).

	Init ==
	/\ running = [node \in Nodes \|-> TRUE]
	/\ peer_status = [node1 \in Nodes \|->
	[node2 \in Nodes \|->
	[generation \|-> 1,
	status \|-> Alive]]]
	/\ generation = [node \in Nodes \|-> 1]

view raw init hosted with ❤ by GitHub

When a node merges gossip with its own knowledge it applies the following rules:

For gossip about peer nodes, generation has the highest precedence and in cases where generations match, deadness has higher precedence than aliveness as before.
When the gossip pertains to itself, has the correct generation value and identifies the node as dead, it increments its own generation.
When the gossip pertains to itself and the gossip identifies the node as alive or the generation is stale, then it keeps its own knowledge of itself unchanged.

Fig 1. A gossip exchange where a node refutes its deadness.

	Highest(peer1, peer2) ==
	CASE peer1.generation > peer2.generation -> peer1
	[] peer2.generation > peer1.generation -> peer2
	[] peer1.status > peer2.status -> peer1
	[] OTHER -> peer2

	MergePeerState(local_node, local_ps, remote_ps) ==
	[node \in Nodes \|->
	CASE
	\* CASE --- This is node is a peer so we select
	\* the higher precedence of the two ---
	node /= local_node ->
	Highest(local_ps[node], remote_ps[node])
	\* CASE --- This node is the local node and the remote
	\* believes it is dead so it increments its
	\* generation ---
	[] /\ remote_ps[node].generation = generation[node]
	/\ remote_ps[node].status = Dead ->
	[generation \|-> generation[node] + 1,
	status \|-> Alive]
	\* CASE --- This node is the local node and the remote
	\* believes it is alive so it keeps its local
	\* knowledge of itself ---
	[] OTHER ->
	local_ps[node]]

view raw merge hosted with ❤ by GitHub

When a node starts, it also increments its generation as its peers may have identified it as being dead which it will need to refute.

	Start ==
	\E node \in Nodes :
	/\ running[node] = FALSE
	/\ LET new_gen == generation[node] + 1
	IN
	/\ running' = [running EXCEPT ![node] = TRUE]
	/\ generation' = [generation EXCEPT ![node] = new_gen]
	/\ peer_status' = [peer_status EXCEPT ![node][node].generation = new_gen]

view raw start hosted with ❤ by GitHub

You can see the whole spec here.

However, this new version confronts us with a problem. Because generation keeps incrementing, the state-space just became infinite and TLC will run until it runs out of local disk.

Dealing with an infinite state-space

If you want to model check your specification with TLC, then you don’t want an infinite state space. You need to constrain the number of states that can be reached.

We have a few options here:

Get TLC to compute the fingerprint of the global state without the generation value. The fingerprint is used by TLC to know if it has previously reached a state and therefore doesn’t need to explore subsequent states again. If we can remove generation from the fingerprinting then we should go back to a bounded state-space.
Use a state constraint to limit the maximum value of generation on any node.
Limit the number of executions (in any given history) of the actions that increment generation can be executed. In this case Start and Gossip.
Limit the number of executions (in any given history) of the actions that cause a node to react by incrementing its generation. In this case Die and IncorrectlyDetectDeadNode.

Note: that the term history, trace and behavior all refer to the same thing: a chronological record of events, actions, and messages that occurred in the system, in a serial order. I typically use the word history but TLA+ uses the term behavior.

Option 1 - Views

Option 1 is something that I do regularly with auxiliary variables. These are variables which are not part of the algorithm but are needed by the spec to accomplish some end. Examples are counters which count the number of times something happened, or for tracking something needed for an invariant or a counter used to generate guaranteed unique numbers. These variables can be excluded from the fingerprint by using a VIEW. I could exclude the variable generation by using a VIEW of <<running, peer_status>>.

However, in this spec the generation is also nested in records in a nested map of the peer_status variable. Also, generation is a core part of the algorithm so we really don’t want TLC to ignore it when deciding if it has previously explored a state or not.

Option 2 - State constraints

State constraints unsurprisingly constrain which states can be explored. We could use the following constraint:

\A node \in Nodes : generation[node] < 3

This says: for all nodes in the Nodes set, their generation must be below 3. TLC would then not further explore states that reached a generation of 3 on any given node.

The nice thing about state constraints for controlling the state space is that you don’t need to put these constraints in your specification. You can apply them from the Toolbox or another tla file. This prevents state-space controls from polluting your clean, elegant specification.

The downside is that state constraints can prevent liveness checks from executing. This is a pretty major downside.

TLC can only check liveness on infinite behaviors. Obviously we want to constrain the state-space so it is bounded but we also need behaviors to be infinite so liveness properties can be checked. This is where stuttering comes in. Stuttering is the concept that while no actions in your specification execute (because none are enabled), the world beyond your spec continues. In this world where things carry on happening just not in your system, your system “stutters”. Once TLC has no more enabled actions to execute, it stutters infinitely, producing an infinite behavior which can be checked.

State constraints can prevent TLC from stuttering because say, generation on node 1 reached the value 3 and no child states of that state get explored. TLC got cut-off from exploring the set of infinite behaviors where generation reached 3 on node 1. Thus liveness checks are not applied on any behavior because all behaviors eventually reach states where the generation reaches the value 3.

When I set the above state constraint, the TLA+ Toolbox actually gives me a warning:

This leads us to the next two techniques which are a bit more invasive, but allow for infinite behaviors.

Option 3 - Limit actions that directly lead to an infinite state-space

Our problem is that generation can get incremented infinitely. One solution is to put a limit on generation incrementing beyond some value, directly in the specification itself.

In the Start action, we only allow the node to start if its current generation is <= 2.

	Start ==
	\E node \in Nodes :
	/\ running[node] = FALSE
	/\ generation[node] <= 2 \* <----- Limit generation
	/\ LET new_gen == generation[node] + 1
	IN
	/\ running' = [running EXCEPT ![node] = TRUE]
	/\ generation' = [generation EXCEPT ![node] = new_gen]
	/\ peer_status' = [peer_status EXCEPT ![node][node].generation = new_gen]

view raw limit_start hosted with ❤ by GitHub

In the Gossip action, we only allow it to execute if neither the source nor destination increment their generation to 3.

	Gossip ==
	\E source, dest \in Nodes :
	LET gossip_sent == peer_status[source]
	...
	new_gen_dest == merged_on_dest[dest].generation
	new_gen_source == merged_on_source[source].generation
	IN
	/\ running[source] = TRUE
	/\ running[dest] = TRUE
	/\ new_gen_dest <= 2 \* <----- limit generation on dest
	/\ new_gen_source <= 2 \* <----- limit generation on source
	/\ peer_status[source] /= peer_status[dest]
	...

view raw limit_gossip hosted with ❤ by GitHub

Rather than hard-coding 2 in the limit, we can use a constant and set it through the config.

Just like the state constraints, we solve our infinite state-space but liveness remains broken. We have infinite behaviors but now we constrain behaviors so they start to stutter when the cluster is still divergent. The liveness property EventuallyConverges gets violated.

For example:

Node 1 falsely detects Node 2 as dead.
Node 1 gossips with Node 3, sharing the dead state of Node 2.
Node 2 gossips with Node 3 and discovers that Node 3 thinks Node 2 is dead. It increments its generation to 2.
Node 2 gossips with Node 1 and Node 1 sees that Node 2 is alive with a higher generation and updates its state.
Node 1 falsely detects Node 2 as dead, again…

However, because the spec doesn’t allow Node 2 to increment its generation to 3, it cannot refute its deadness and the cluster is unable to converge.

This is where I apply the following principle:

Recovery principle: Be restrictive with perturbations and liberal with recovery.

Another way of putting it is restrict how many times a stone can be cast into the pond but don’t restrict the ripples from reaching the bank. If we don’t want infinite ripples, stop the stones not the ripples themselves.

When we apply this principle to Gossa, we could restrict the actions Die and FalselyDetectDeadNode which are the stones cast into the pond.

Option 4 - Control the stones (perturbations), not the ripples (recovery)

Should we limit the action Die or Start? We could consider Die as the stone and Start as the ripple. We need to limit one of them else the generation can increment forever due to an infinite cycle of death and life.

Convergence doesn’t need nodes to come back to life, it just needs the live nodes to agree on an accurate state of the world. So it is completely valid to not apply Weak Fairness to the Start action which means dead nodes won’t always come back to life. It also means that we can keep the generation limit in the Start action as limiting a node’s ability to start does not impact liveness if we only care about convergence.

However, if we also added the liveness property EventuallyAllRunning then indeed we couldn’t limit the number of times a node could start as we could violate that liveness property (Start is a recovery action). Therefore we would need to limit the number of times a node could die (via its generation).

Falsely accusing a node of being dead is a large stone being cast into the pond. It causes a flurry of gossip that only ceases when all nodes converge on the aliveness and new generation of the accused node. Therefore we apply a limit to the number of times this can occur.

	FalselyDetectDeadNode ==
	\E local, remote \in Nodes :
	/\ local /= remote
	/\ running[local] = TRUE
	/\ running[remote] = TRUE
	/\ generation[remote] <= 2 \* <--- Limit generation
	...

view raw limit_falsely_detect hosted with ❤ by GitHub

With those changes we have Gossa v4. Now when we run TLC we have a finite state space and liveness checks are not impeded.

The slowness of liveness checking in TLC

The major drag with TLC liveness checking is that it is slow. You cannot liveness check the same size of models as you can with just model checking invariants. This is unfortunate and will limit how you use liveness checking.

Invalid safety properties as an additional tool

Another way for the TLA+ writer to gain more confidence in their specification is to use invalid safety properties. This is where you create an invariant that will be violated by reaching a state that you know is valid and should be reachable. In the case of Gossa you can actually make NotConverged an invariant so that as soon as convergence occurs, TLC spits out an invariant violation and the error trace.

The benefit of this approach is that it is faster than checking liveness properties. However, it also has the serious limitation that it will only tell you whether an expected state is reachable or not - it will not detect situations where some infinite behaviors do reach the expected good state and some do not. Let’s look at an example.

Fig 2. The invalid safety property detects that the darker green state at the bottom is reachable, but only the liveness property can detect the behaviors where that state is unreachable.

In the tiny state space above, we have 9 unique states and 6 possible behaviors. The darker green state at the bottom is the expected good state that our liveness property states should be eventually reachable from the initial state. First we set a negative safety property which says that the good state is never reached. TLC quickly spits out a counterexample with one of the 4 behaviors that can reach this state. In this case, the invariant violation is good, that is what we wanted.

Then we run TLC checking our liveness property and TLC spits out a temporal property violation and error trace. But this was not a negative result being good, but a positive result being bad. We just found out our spec or design has a liveness bug.

Look at invalid safety properties as an additional tool in the toolkit, but not a substitute for liveness properties.

In summary…

Liveness checking is an extremely useful and I would say necessary part of writing and checking TLA+ specifications. It can find shortcomings in your thinking, bugs in your specification and flaws in your design that invariants simply cannot.

I’m not sure I would trust any non-trivial spec that didn’t have some kind of liveness properties to verify that:

expected (good) reachable states are indeed reachable.
no cycles exist.
the algorithm cannot get stuck (or partially stuck).

Many of the above issues are often caused by subtle bugs in the specification itself, rather than the actual design. The danger is that a liveness bug in one part of the specification leads to TLC not discovering a safety or liveness property violation of the design itself - all because a large set of valid states were not reachable.

Some points to remember:

Safety properties are typically invariants which are state-scoped. That is, you can inspect any given state and know if the invariant holds or not. Liveness properties are history(or behavior in TLA+ parlance) scoped. You have to inspect a complete history in order to check if a liveness property has held or not. Hillel Wayne has an excellent blog post: Safety and Liveness Properties.
Use Strong and Weak Fairness to tell TLC which actions should eventually be executed. To learn more about fairness read the best blog post I know of on the subject, also by Hillel Wayne.
When placing bounds on infinite state spaces, be aware that state constraints can silently stop liveness checks from working. Instead you can build in state-space limits into the spec itself and use the recovery principle to avoid spurious liveness property violations:
Be restrictive with perturbations and liberal with recovery.
Or alternatively, if you want the ripples to subside, stop casting stones.
Hillel Wayne pointed out that options 2 and 3 can be made less invasive by placing the constraints as conjunctions in the Next state formula. See here for an example of that.
Be aware that liveness checking is slow and you won’t be able to check as large a model as you can with only model checking of invariants. Be prepared to run TLC at least twice: once with liveness checking enabled on a small model and again with only invariants on a larger model.
Invalid safety properties can be very useful for gaining confidence in a spec, but won’t catch all liveness issues.
Don’t forget that simulation mode exists and helps get confidence with models too large to brute-force check. Simulation is embarrassingly parallel, just give TLC as many cores as you can and let it run for long enough for you to be satisfied. See simulation in the TLC options page.

Next…

My next project is to implement Gossa in its multiple versions in the P language. Anecdotally I’ve heard this language is more accessible to engineers than TLA+ and so I’m curious to try it out. It supports liveness checking so I’m curious to see how the experience differs to TLA+.

Thanks for reading :)

	Init ==
	/\ running = [node \in Nodes \|-> TRUE]
	/\ peer_status = [node1 \in Nodes \|->
	[node2 \in Nodes \|->
	[generation \|-> 1,
	status \|-> Alive]]]
	/\ generation = [node \in Nodes \|-> 1]