Sophisticated, simulations need not be. Valuable insights, even simple scripts reveal. — Formal Methods Yoda

A couple of weeks ago I was a guest on The Geek Narrator to talk about formal verification. I spoke a lot about how modeling and simulation are tremendously powerful tools, whether you use a formal verification language (such as TLA+) or just a Python script.

This post goes through a real world example of how I used modeling and simulation to understand the statistical properties of a proposed distributed system protocol, using both Python and TLA+. There is a talk version of this post from TLA+ Conf 2022.

Why do modeling and simulation?

First a definition. I don’t have particularly precise definitions of these two concepts except to say that perhaps:

Modeling is more about understanding the structure, actors, communications, and how moving parts work together.
Simulation is more about asking specific questions and getting answers by running experiments.

In this post I’ll explain my reasons for the practice and a real-world example in both Python and TLA+.

During the process of writing and executing a model of a design, the system designer can uncover a number of things that they may not have considered while writing the design document. Optimizations can arise, as well as unexpected gotchas. But while TLA+ is amazing at finding correctness and liveness bugs in designs, it hasn’t always been useful for understanding other properties. There can be a host of other statistical properties that remain unexplored, which can lead to undesirable surprises later on.

A model can be written in any language, including TLA+. Traditionally, TLA+ has not been very good at answering these questions, but its model checker, TLC, now has the necessary capabilities to run simulations to obtain statistical data.

What kind of statistical properties might we care about?

I like using gossip protocols as an example of explaining what kind of additional properties a system designer might care about, and how modeling can help.

Gossip protocols exist to enable scalable, decentralized communication in distributed systems, ensuring that information spreads efficiently and reliably among nodes, even in the presence of network failures or dynamic membership changes. There are many variants, and many optimizations that can be applied to any given protocol.

When designing a gossip protocol, there will be multiple statistical properties that the protocol designer cares about:

The time for the group to converge after a perturbation to the cluster (a node leaves/dies, or joins for example).
The aggregate number of message exchanges (or network load) until convergence.
The aggregate number of message exchanges (or network load) over time. Does it start out busy then fade away? Is there a peak after a perturbation? What shape is the peak?
The probability of convergence stalling.
When a node falsely believes another node to be dead, does that false belief propagate? Does it ultimately get refuted? How long does it take for the refutation to spread and for convergence to occur?

Next, the designer might have some questions about how different things can impact those statistical properties:

How does changing the number of times a given piece of information can be shared affect those properties?
How does changing the number of peers to contact in each round affect those properties?
How does limiting the number of gossip items per message affect those properties?
How does implementing optimization X or Y affect those properties?

We can create a simplified model of the gossip algorithm, and run multiple simulations to gather data on number of rounds to convergence, messages exchanged etc, using different configurations and perturbations. Then we can visualize the results and looks for unexpected patterns and learn the relationships between statistical properties and configurations/optimizations.

*Fig 1. A violin plot for the number of rounds of message exchanges (as a factor of dissemination limit) for a gossip protocol (SWIM) to reach convergence after a single node dies.*

The above plot was based on data generated by my TLA+ specification of the SWIM gossip protocol. That spec doesn’t make for a good case study for this post because it is too complex, and I didn’t ever implement it in Python. Instead, I’ll use a simpler case study from back in 2020 which I implemented in both Python and TLA+.

A real-world example - cooperating queue consumers

While on the RabbitMQ core team, I learned of a library author implementing the equivalent of Kafka’s consumer group protocol, using RabbitMQ’s single active consumer feature. Rather than having a central coordinator like Kafka, the author designed a cooperating consumers algorithm where each consumer made deterministic decisions that would lead to a group of consumers having balanced assignment over a group of queues.

A queue with Single Active Consumer (SAC) enabled only pushes messages to one consumer (the active consumer), even if multiple consumers have subscribed to the queue. Each SAC enabled queue has an associated internal subscriber-queue, which contains the subscribed consumers in the order of when they initially subscribed. The consumer at the head of the queue becomes the active consumer. If that consumer dies, its connection dies, or it unsubscribes, then it is removed from the subscriber-queue and the next consumer at the head of the subscriber-queue is selected as the active consumer.

The basic algorithm of each client application was:

Every 30 seconds, poll the RabbitMQ cluster to get stats to learn of the queues and the other consumers (identified by a prefix).
Deterministically calculate the optimum number of queues that each application should be active on.
1. If there are 10 consumers and 20 queues in the group, then each application should have 2 active consumers.
2. If there are 3 applications and 4 queues, then one app should have 2 active queues and the remaining apps should have 1 each.
If the application has too many active consumers, then “release” a number of queues (by unsubscribing and resubscribing to those queues), thereby reducing its active count to the optimum for balance. Note that after a queue release, the application is at the back of the subscriber-queue for that queue.

Fig 2. Two applications over three queues reach a balance of the number of active consumers. Each queue has an associated subscriber-queue that RabbitMQ uses to determine which consumer is the active-consumer.

I modeled the algorithm as a Python script, using a simple round-based approach to match the 30-second polling of the library. In a loop, each consumer would calculate its ideal number of active consumers, then release a corresponding number of queues if it had too many. Once the consumers were equally balanced over the queues, it would write some statistics to a CSV file, such as the number of rounds required to reach a balance. The number of rounds loosely corresponded to time. I ran the simulation repeatedly, varying the number of queues, applications, and the order in which applications would subscribe to queues (each combination was run 4000 times).

I used an R notebook with ggplot2 to do the data viz.

Fig 3. X-axis: The number of applications. Y-axis: The number of rounds for the group of applications to reach a balance over 10 queues, after starting up. Left: the order of queue subscriptions is completely random. Right: Each application starts one at a time, a sequentially subscribes to the queues.

In this “start-up” scenario, some interesting relationships between the number of queues and the number of applications became evident immediately. Also, the order in which subscriptions happened was important.

In general, a sequential order of subscriptions saw a higher median number of rounds to reach balance, but a random order led to a greater number of rounds in the high percentiles. For the randomized order, the p99 was 15 rounds with 10 queues and 10 applications, equating to almost 8 minutes of queue releases occurring until stability.

This seemed pretty high. But it got worse as the number of queues and applications grew.

*Fig 4. At 20 queues, the p99 for a randomized order of subscriptions was 37 with 20 applications, equating to around 18 minutes of queue releases until balance was achieved.*

Next I wanted to understand how long it would take for a stable group to reach balance after losing one app (due it dying or leaving). I called this the “lose-one-app” scenario.

*Fig 5. “Lose-one-app” scenario. The rounds to balance of a stable group of applications that lose one member.*

Again, there was a striking relationship between the number of applications and queues, as well as the subscription order. The peaks indicated lot of instability, with repeated “queue releasing” going on for up to 15 minutes in the sequential case.

Note: The sequential case is most likely when starting up a group of applications. However, over time, the subscriber-queues should become more randomized. So it is likely that we would see pairings loosely reflect the following:

“sequential subscription” with “start-up” scenario
“random subscription” and “lose-one-app” scenario

Understanding the relationships

If you are curious about what caused these relationships, then read this section, else you can jump to the next. It’s best understood with a diagram, if you’re interested, I documented it in the GitHub repo.

The reason for this relationship all comes down to the subscriber queues and the ideal number of active consumers per application. In the “lose one app” scenario, if we have 10 queues and 11 applications, then 10 applications each have 1 queue active, leaving 1 application with no active consumers. If one application dies, we have the following probabilities:

1/11 probability the dead app is the one with no active consumers, hence the min number of rounds being 0.
10/11 chance of an application with one active consumer being the one to die.
1/10 chance of the application with 0 active consumers now being at the head of the subscriber-queue for that queue, and being made the active consumer straight away.
9/10 chance of the application with 0 active consumers not being at the head of the subscriber-queue. In this case, multiple queue releases by other applications will be required to get it to the head of the subscriber queue for that queue.

In the 9/10 case, it’s actually worse, because each app randomly chooses a queue to release when it has too many. This causes a chain of releases across multiple queues as each release in queue can cause another app to have too many active consumers, creating a wave of releases that can take a long time to settle down. It’s worse in the sequential subscription case if the application that needs one active queue is at the back of every subscriber-queue (because it started last), then we need to go through the most rounds of releases to get the application to the front of one subscriber-queue. That is why the worse case is so much worse with sequential than random. See the example in the repo for a clearer description.

The “nonactive-release” optimization and measuring its impact

The problem is that an application that desperately needs an active queue may be at the back of all or many of the subscriber-queues. We really need to jump that application to the front of the subscriber-queues. Enter the “nonactive-release” optimization.

This optimization works as follows. On top of the existing behavior, if an application decides it has the perfect number of active queues then it:

Calculates the ideal number for every other app.
If it detects one or more other apps do not have enough active queues then the application releases all queues where it is not the active consumer.

Going back to the example above, given that most of the apps already have the perfect number (bar one in 9/10 cases), then most apps release the 9 queues which they are not active on, jumping the needy app to the front.

I modeled this change, with the following results for the “lose-one-app” scenario:

*Fig 6. “Lose-one-app“ scenario. Left: Original algorithm. Right: The non-active release optimization.*

The nonactive release optimization had a large beneficial impact on the “lose-one-app” scenario. It also helped with the “start-up” scenario, when subscriptions occurred in a random order. The sequential order subscription variant saw a more muted improvement.

Fig 7. “Start-up” scenario. Left: Original algorithm. Right: The non-active release optimization.

I won’t go into the nuances of why, as that is not the point of this post. You can get a good idea of the problem and how the optimization helped in this example from the README.md.

The TLA+ version

A few weeks later I rewrote the model in TLA+ after being challenged by Markus Kuppe. After some testing, it became clear that TLC, the model checker, needed some additional features to support gathering statistics. Markus added those features and eventually we were able to use TLA+ and TLC to obtain matching data for this cooperating consumers example.

The TLA+ version is very similar, except of course there are no loops.

Next ==
    \/ \E a \in apps : ActiveReleaseCheck(a)
    \/ \E a \in apps : NonActiveReleaseCheck(a)
    \/ \E a \in apps : SubscribeToAllQueues(a)
    \/ \E a \in apps, q \in queues : SubscribeToOneQueue(a, q)

The next state formula consists of two main actions:

Applications subscribing to queues.
- SubscribeToAllQueues mimics the sequential subscription order (when sequential sub is configured).
- SubscribeToOneQueue mimics the random subscription order (when random sub is configured).
Applications performing a queue release check where they
- Determine the optimum number of queues.
- Release queues if they have too many.
- If non-active optimization, also do a non-active queue release if necessary.

Once balance has been achieved, a CSV line is written to a CSV file.

To mimic the design of the Python script, the queue release checks are executed with the following constraints:

A queue release check can only occur when all applications are subscribed to all queues.
Applications must perform checks at an equal rate. This is done by tracking the number of checks performed, and ensuring no application can be more than 1 check ahead of any other.

This produced the following results for the “start-up” scenario:

*Fig 8. The TLA+ version of fig 6. The results are almost identical, despite the vastly different implementations and runtimes.*

The “lose-one-app“ scenario:

*Fig 9. The TLA+ version of fig 5. Again, they are almost identical.*

Rather than add the details of how to do it here, see the README.md which contains a short guide on how to add statistics gathering to a TLA+ specification.

TLA+ gotchas

I will include the gotchas here, in case you try it and can’t figure out why your data is not making sense. There are a few gotchas with using TLA+ that you need to be careful of:

Absolutely you must use TLC’s ‘generate’ mode instead of ‘simulate’. Generate mode has been designed with statistical simulation in mind and, among other things, ensures a uniform distribution regarding non-deterministic choices (like those existential quantifiers). Without that, the results get heavily skewed.
You must ensure that TLC will explore enough steps. The default is 100, and TLC will stop the run after 100 steps and start the next run. If your traces sometimes require more than 100 steps, then your results will be skewed. I set the '-depth’ argument to 1000 and had an invariant that triggered when it exceeded 999. This caught spec bugs and prevented bad data.
Do not check temporal properties as for some reason this also skews the generated stats.
When checking multiple configurations in a single run, always do a sanity check on the distribution of runs per configuration.

Regarding that last point, I have a plot in my R notebook for the number of runs per configuration.

*Fig 10. The number of runs per AppCount-Algorithm combo.*

As always with TLA+, be paranoid of success at first and work to get confidence that your spec is accurate. Liveness checks are useful to ensure that all states that should be reachable are in fact reachable. A spec bug that prevent some of the state space from being explored will skew statistics. Just remember to turn off the checks before running ‘generate’ mode.

There are some downsides to adding statistics gathering into a formal model:

The stats code can pollute the specification, making it messier.
Many optimizations that you want to gather stats on exist under the level detail of your TLA spec.

But the takeaway is that it is possible to obtain statistical properties using TLA+ with TLC.

Conclusions

To summarize the benefits of modeling, it provides a structured way to understand and analyze complex systems by focusing on their components, interactions, and behaviors. It allows system designers to uncover overlooked aspects, test assumptions, and identify potential optimizations or issues early in the design process. By running simulations, engineers can answer specific questions about system behaviors and statistical properties. Designers care not only about safety and liveness properties, but statistical properties.

I coded up the original Python script for the cooperating consumers in a couple of hours, and had some plots to share with the library author that day. Simulations don’t need to be sophisticated to be valuable.

The nice thing about using TLA+ to obtain statistical properties is that you also get the correctness and liveness checking, all in a single specification. But if I only want some statistics, I’d probably choose a Python script first (it’s a simpler process). My next question is whether I can do this technique, or more sophisticated things, with Fizzbee. Watch this space.

See my simulations repo for the code.