The aim of this post is to give the reader an understanding of:

Why formal methods exist and why these methods are especially important in the field of distributed systems.
An introduction to TLA+, including the conceptual model of how it represents data and time.

From here, you should be able to go out and read more advanced content on TLA+ with some foundational idea of what it is all about.

Why use TLA+ or any other formal verification technique?

Designing complex systems such as concurrent and distributed systems is hard. Complexity arises from the need to balance numerous competing requirements, such as performance, scalability, fault tolerance, security and resource costs. The designer must maintain a mental model of an intricate web of dependencies, interactions, and potential failure points. This is no easy task.

Formal modeling is a powerful tool in the system designer’s toolkit. It provides a precise and unambiguous way to represent the behavior and structure of a system, giving engineers a precise language to describe intricate details and potential interactions that may be overlooked both by informal descriptions and by the engineer’s own internalized understanding of the system. With a model, you can get a better physical appreciation of the design, something you could almost pick up with your hands.

It is a communication tool as much as anything else. I like to think of it as an executable and verifiable diagramming tool but also a simulator that helps me analyze the system's behavior under various conditions.

Properties and principles

System designers care very much about principles and properties, using them as guides when designing and building complex systems.

A principle might be: “the system should be decentralized and stateless in as many components as possible in order to aid elasticity” or “the system should be able to self-heal and recover rather than depend on human intervention“. Principles are not machine verifiable, but the system designer can know when a principle is being followed or not and to what degree. Principles act as guiding stars for a project and help teams make better architectural decisions.

Properties are more concrete and are actually machine verifiable - the system can be proven to satisfy a property or not. Properties may be things like: “the system should remain available as long as one server remains operational” or “clients must be able to read their own writes” or “distributed deadlocks eventually get detected“. System designers care very much about these properties and define the key properties first so that they can design a system that is proven to satisfy them.

With formal methods we break things down into a formal model of the system and the properties that it should adhere to. Flaws are found by identifying behaviour that violates these properties.

There are two main methods of doing this:

Constructing mathematical proofs.
Using a model checker to do a brute force search of all the possible interactions and flagging any that lead to a property violation. This is called bounded model checking.

Constructing proofs is hard work and takes a considerable amount more effort than using bounded model checking. I personally use model checking as it is simple and practical enough for me to fit specification writing into a busy schedule of other tasks.

A short guide to TLA+

When we model a system in TLA+ we are modelling its design, not its implementation. We model what it should do, not how it should do it. Modelling is partly about learning more about the proposed system and about sharing the design with others, so clarity and simplicity is important. We don’t model every minute detail of a system but rather choose a suitable level of abstraction. Many aspects of a system can be considered optimizations or implementation details and can be left out of our specifications.

In a TLA+ specification, the “what the system should do” is represented by variables, constants and actions.

Constants describe aspects of the system that do not change. This could be the actors in the system, or some threshold
Variables describe the mutable state of the system, such as its data or the current status of the different actors (a server being up or down)
Actions modify the system (mutating the variables), for example, a message is sent or received, data is written or deleted.

States, Steps and Time

In TLA+ a state is a mapping of values to variables.

A 24 hour clock could be modelled by the variables hour, minute and second. One state, s1, could be the mapping of values to variables: hour=10, minute=23, second=38 while state s2 could be: hour=10, minute=23, second=39. One tick of the clock took the system from state s1 to state s2.

Fig 1. Two different states of a clock specification.

This clock specification would have 86400 possible unique states as that is the number of seconds in a day.

A step is the transition from one state to another. If the clock ticks 1 second, then we would transition from s1 to s2, where s2 is one second ahead of s1.

Time is not continuous but is instead represented by discrete units, the state, where each step is the transition from one state to another.

s1 -> s2 -> s3 -> s4 ... sN

We describe the initial state or even multiple possible states that the system can start in. The actions that describe the “what the system should do” will then transition the system into subsequent states.

Properties: Safety and Liveness

We classify our properties into two types: safety and liveness.

A safety property is focused on bad things not happening. The most widely used type of safety property is an invariant. An invariant is a property that must always be true in every reachable state - as soon as it is false it means we have violated safety. An example is data loss. The invariant may declare that every item placed into a bag must forever be found in that bag. But if there were to be a hole in that bag, an item may fall out, which would immediately violate that invariant.

Invariants cover individual states only, but we may also want a safety property that describes an illegal transition between states. For example:

a plastic bag must never go from FULL to RECYCLED, it must first be in the EMPTY state.
The clock in state s+1 must be one second ahead of state s.

A liveness property describes something good that must eventually happen. For example, if Bob sends Alice a postcard, Alice must eventually receive it. Or, if I keep placing items in a bag, the bag must eventually become full. If however, the postcard is dropped when the postman empties the post box, then the liveness property is violated as Alice can never receive it. If there is a hole in the bag, then items may fall out at the same rate they get added, causing the bag to never fill — violating that property.

Model Checking

TLA+ comes with both a bounded model checker called TLC and a proof assistant called TLAPS. In my career I have focused on model checking as it provides me the best ROI; I just can’t spend all my time on modeling, I actually need to go build the thing too.

The downside of model checking is that there are limitations to what it can achieve. TLC performs a brute force breadth first search (BFS) of all the possible reachable states. Some specifications can have huge or even infinite state spaces which can make brute force model checking unviable. Limiting the size of the state space is often necessary.

Let’s look at two examples.

Example 1: Two counters

We have two counters and in each step, one of them increments its value by 1. If we set no upper limit on the counter values the state space is infinite, but if we limit the counters to staying equal to or lower than 2 we get the following possible states and state transitions:

Fig 2: State transitions for two counters, up to the value of 2. There are 9 unique states.

Each state is a unique combination of all the variable values that occur at the same time and in our two counters example there are only 9 possible unique states.

There are also a number of different paths through this tree of possible unique states.

Fig 3: Six different paths through the state space

This particular system has 6 possible paths through the states. In TLA+, these paths are called Behaviours.

Below is the specification. It consists of two constants (set from the TLA+ toolbox or a config file when using VS Code):

C is the set {c1,c2}
Limit is set to 2

It has a single variable, counters, which a programmer can think of as a map or dictionary, where the key is the counter id (c1 or c2) and the value is the counter value. In TLA+, maps are known as a FUNCTIONS and the keys are known as the DOMAIN.

In the initial state both counters are set to 0. In the next state we non-deterministically choose a counter using the existential quantifier (there exists \E) and check if its value is below the limit. If so, we set the counters variable in the next state, to be the same as the current state except for this counter, which we increment by 1.

	---------------------------- MODULE TwoCounters ----------------------------
	EXTENDS Integers

	CONSTANT C, Limit
	VARIABLES counter

	Init ==
	counter = [c \in C \|-> 0]

	Next ==
	\E c \in C :
	/\ counter[c] < Limit
	/\ counter' = [counter EXCEPT ![c] = counter[c] + 1]

	AlwaysNonNegative ==
	\A c \in C : counter[c] > -1

	AllAtLimit ==
	\A c \in C : counter[c] = Limit

	EventuallyAllAtLimit ==
	<>[]AllAtLimit

	=============================================================================

view raw two_counters.tla hosted with ❤ by GitHub

This will look alien to most programmers, there are a number of things that are strange:

the use of the /\symbol which represents AND (&& in many languages)
the use of the ` symbol, known as prime. A variable with the prime symbol is a primed variable and represents the variable in the next state. Each state is immutable, the variable values cannot be changed. We can only assign new values to a variable in the next state and we do that by using the primed form of the variable and the = operator.
EXCEPT is a shorthand for not having to explicitly rebuild the whole function (map) in order to set the primed variable. We can say the function is the same, except for the value of one key.

The alternative to EXCEPT would be to explicitly define a new function with the same domain (set of keys) and values but increment the value only of the chosen counter c:

counters’ = [ctr \in C |-> 
                 IF ctr = c 
                 THEN counters[c] + 1
                 ELSE counters[c]]

The Init formula sets the initial state. The model checker then explores the state space by repeatedly executing the Next formula, until it has explored all 9 states.

Example 2: A Clock

Our clock specification has three variables: hour, minute and second.

Fig 4: A state transition for our clock specification.

The initial state is our starting point, where we set initial values of our variables. We set the clock to start at 00:00:00 as its initial state.

The next state formula describes the possible actions that could occur in any given state. An action is a step that takes us from the current state to some next state.

Our Clock specification could look like this.

	------------------------------- MODULE Clock -------------------------------
	EXTENDS Naturals


	VARIABLES hour, minute, second

	Init ==
	/\ hour = 0
	/\ minute = 0
	/\ second = 0

	Tick ==
	/\ hour' = IF second = 59 /\ minute = 59
	THEN IF hour = 23
	THEN 0
	ELSE hour + 1
	ELSE hour
	/\ minute' = IF second = 59
	THEN IF minute = 59
	THEN 0
	ELSE minute + 1
	ELSE minute
	/\ second' = IF second = 59
	THEN 0
	ELSE second + 1

	Next ==
	Tick


	=============================================================================

view raw Clock.tla hosted with ❤ by GitHub

In most specifications, the Next formula will contain multiple actions, but our clock only does one thing, it ticks. Therefore the next state formula has only a single action called Tick. The Tick action takes the time composed of the hour, minute and second variables and increments the time by one second.

When we give this specification to the model checker, it will assign the values as described in the Init state formula.

Limiting the State Space

The clock naturally has a finite state space, but the TwoCounters specification has an infinite state space without a counter limit. We could make the clock have an infinite state space by adding the date as a variable and setting no limit. The model checker would try to run it forever. In contrast, with only the hh:mm:ss parts, there are only a limited number of seconds in a day. There would be only 86,400 possible unique states (one for each second in a day). If we included the date, all we would need to do is tell TLC to stop exploring when a certain number of days have been explored.

States Form a Graph

We saw that the state space for the Two Counters example formed a graph where the vertices are the states and the edges are the steps.

The simple Clock specification has an initial state of hour=0, minute=0 and second=0. That state has only a single next state of hour=0, minute=0 and second=1, and so do all the other states. This forms a graph that is a simple chain where the number of vertices is 86400 and the number of edges is 86399.

Fig 5: Clock graph is a chain as each state has only one next state

Given a single worker thread, TLC took 4 seconds to find all 86400 unique states (and most of that time is TLC starting up, parsing etc).

When I create a variant of the clock spec which allows for the display of the clock to pause for arbitrary amounts of time, then the clock can appear to jump. In this JumpyClock variant, a valid step might be to go from 10:37:00 to 10:37:02 or even from 00:00:00 to 23:59:59.

Such a graph might look more like:

Fig 6: Jumpy clock graph, same number of states but many more possible transitions.

In fact, this simple change to the clock specification inflates the number of edges (steps) from 86399 to 3,732,436,800. But note that the number of vertices (states) is the same at 86400 (there are still only that number of seconds in a day). JumpyClock takes 40 minutes (with 8 worker threads) to explore the state space. Although the number of states remained the same, the number of steps was several orders of magnitude higher.

The JumpyClock specification below uses non-determinism like the TwoCounters specification in order to choose the next values of hour, minute and second. The if statements have been replaced by existential quantifiers (there exists \E) which non-deterministically select a value from the specified range. TLC will explore every possible value of each existential quantifier in its brute-force BFS.

Most specifications of distributed systems are run with state spaces that reach billions or even trillions of states potentially taking days to execute.

	----------------------------- MODULE JumpyClock -----------------------------
	EXTENDS Naturals

	VARIABLES hour, minute, second

	Init ==
	/\ hour = 0
	/\ minute = 0
	/\ second = 0

	TotalSeconds(h, m, s) ==
	s + (m60) + (h3600)

	IsInFuture(h, m, s) ==
	TotalSeconds(h, m, s) > TotalSeconds(hour, minute, second)

	Tick ==
	\E next_hour \in 0..23 :
	\E next_minute \in 0..59 :
	\E next_second \in 0..59 :
	/\ IsInFuture(next_hour, next_minute, next_second)
	/\ hour' = next_hour
	/\ minute' = next_minute
	/\ second' = next_second

	Next ==
	Tick

	=============================================================================

view raw JumpyClock.tla hosted with ❤ by GitHub

The existential quantifiers translate to: there exists an hour between 0 and 23, and there exists a minute between 0 and 59, and there exists a second between 0 and 59 SUCH THAT they form a time point which is in the future compared to the time of the current state. Then set the hour, minute and second variables in the next state to those values.

Creating Properties

For the Two Counters example we could have the following properties:

All counter values are non-negative — safety (invariant)
All counter values eventually reach the limit — liveness

We can define those as follows:

AlwaysNonNegative == \A c \in C : counter[c] > -1
AllAtLimit == \A c \in C : counter[c] = Limit
EventuallyAllAtLimit == <>[]AllAtLimit

AlwaysNonNegative declares that for all counters, their value is above -1. We instruct TLC that AlwaysNonNegative is an invariant and it checks that it is true in every reachable state.

EventuallyAllAtLimit uses the eventually symbol <> and the always symbol [] to declare that eventually all counters will reach the limit and stay there forever. We instruct TLC that EventuallyAllAtLimit is a temporal operator and it ensures that every behaviour conforms to it.

Liveness relates to complete sequences of states, known as behaviours in TLA+. Outside of TLA+ the term history is also used to describe one concrete sequence of states. Invariants on the other hand relate to individual states. We cannot inspect any given state and say that it does not conform to our liveness constraint, because there could always be some subsequent sequence of states that lead to conformance.

TLC evaluates all the enabled invariants in each step and halts execution when an invariant has been violated and also prints out the steps (known as an error trace) that led to this violation. Likewise, when TLC reaches the end of a behaviour and the behaviour does not conform to a liveness constraint, it halts and prints out an error trace.

How to run the Clock spec in the TLA+ Toolbox

The TLA+ Toolbox is the original IDE for TLA+. You can download and install it from here.

Download the Clock.tla Gist as the file Clock.tla.
Open the Toolbox -> File -> Open Spec -> Add New Spec and select it.
Create a new model by clicking TLC Model Checker -> New Model and entering a name for the model.
Press the green play button. You will see if finds all 86400 unique states.

If you want to see what an invariant violation looks like, add the following to the Clock.tla file.

SillyInvariant == second /= 30

In the Model Overview, expand the Invariants section and add the new invariant by name (SillyInvariant). Click the play button again. This time you will see the error: Invariant SillyInvariant is violated. It will show you an error trace with 31 states which leads to the seconds value reaching 30.

Some people hate the toolbox because it’s pretty clunky. It tends to be my main workhorse still as I prefer the UI for inspecting error traces. However, there is also a VS Code extension which I’ll explain next.

How to run the Clock spec in VSCode

Install the TLA+ extension here.

The extension needs a Clock.cfg file so create one with the following content:

\* CONSTANTS 

INIT Init
NEXT Next

\* PROPERTY
\* Uncomment the previous line and add property names

\* INVARIANT
\* Uncomment the previous line and add invariant names

Now right click in the Clock.tla file and select Check Model with TLC. It will find all 86400 unique states.

To test an invariant, add the same invariant SillyInvariant as last time. In the config file, add it as an invariant:

INVARIANT
SillyInvariant

Now run TLC again and you will see the same invariant violation as with the Toolbox.

The benefits of VSCode are that it is a generally nicer experience than the Toolbox and you also get a debugger.

When to use TLA+?

I use TLA+ when designing distributed systems, designing modifications to existing distributed systems and when trying to understand a distributed system which does not have good documentation of the design.

I could also imagine using it to design and validate the coordination between microservices or when designing a security model. For example, is it possible for hackers or other bad actors on the web to game my e-commerce system in some way by executing a sequence of orders, cancellations and so forth to get free products or spurious refunds. I’m just making this example up but the point is that TLA+ can represent pretty much any abstraction. It does not need to be limited to the logic of an application or protocol. You can model business processes too.

However, it isn’t a tool for all projects. If I were implementing an application that sits on a database I probably wouldn’t reach for TLA+. If a project involves complexity that derives from some kind of concurrency between multiple actors with some high stakes, then I think TLA+ can be a real productivity boost as well as a “down the road pain saver”. Markus Kuppe’s Weeks of debugging can save you hours of TLA+ video is a great video on that.

Changing how you think, write about and discuss systems

I recently reversed engineered the Apache Kafka replication protocol into a TLA+ specification as part of the KIP-966 design process. The hard thing about this was that there was no single prose description of the protocol to work from. I had to work mostly off the code itself with some reading of a few scattered design documents (KIPs) and then build in our proposed changes on top of that. Because the protocol had evolved organically over more than a decade, it was hard to create a neat abstraction of it. The end result was the largest and most complex specification I had ever written (which is not a humble brag but an admission of my inability to simplify it to a level I would like).

The great thing that came out of this exercise was that I was able to write the prose description that I wish I’d had at the beginning when I was trying to understand the workings of the protocol. Because the specification was highly focused on properties I was able to describe the design of the protocol with those same properties. This English prose description regularly refers back to these properties, such as the Leader Candidate Completeness property or the Replication Quorum Superset property. Read the Replication Correctness section which goes into depth about the motivations for each important rule and behavior in terms of these safety properties.

This same properties are expressed in the TLA+ specification.

Each property has a specific reason for existing which when violated causes a violation of the promises made by the system. I ended up defining a set of properties where some are specific to sub-protocols but which themselves support important higher level properties. Being able to explain the motivation for each local rule and behaviour that exists in terms of the properties they support aids clarity immensely in my opinion.

Fig 7. Kafka replication protocol safety properties.

Just as design patterns helped the industry by providing the terms to describe commonly used code structures and interactions, properties give us the terms to justify design decisions and guide an architecture over time.

I believe that having experience writing formal specifications leads to writing better design documents, writing more understandable code and having clearer more coherent discussions regarding the motivation for why each part of the design should work the way it does. Focusing on properties and principles also provide guard rails so that over time, detrimental changes are not introduced either accidentally or carelessly. This is especially important in open source where contributions can come from anywhere, or in projects with high turnover where foundational knowledge can get lost over time.

Writing formal specifications make good habits which are transferable to how we think about and discuss systems as a whole. It’s not just TLA+ that is available to engineers, there are others such as Alloy and P which are simple and practical enough to be used in engineering projects without a formal methods team. Try one out, try thinking in properties and principles and see where it takes you.

A primer on formal verification and TLA+