To be atomic or non-atomic, that is the question (Fizzbee) — Jack Vanlightly

To be atomic or non-atomic, that is the question (Fizzbee)

After posting my last Kafka transactions diary entry, JP (the Fizzbee maintainer) wrote a refactored version using non-atomic actions and a different way of representing the network. It’s a very interesting variant and I’m tempted to switch over to his version.

When an action is not atomic, execution of an action could yield at any moment to a different action in a different role instance or even the same role instance. With this yielding we can also replace explicit message passing with direct invocation of a function on another node. The invocation and the response can all yield to other concurrent events happening across the system.

Let me use a simple example to demonstrate - Ping-Pong.

Fig 1. The basic idea of this ping pong spec.

Ping-pong with atomic actions and explicit message passing

I can make a spec with atomic actions, and explicit message passing as follows:

always assertion AllOk:
for pinger in pingers:
if pinger.received != None and pinger.received != pinger.expected:
return False
return True
role Ping:
action Init:
self.name = "pinger-" + str(self.ID)
self.expected = None
self.received = None
atomic action SendPing:
require self.expected == None
msg = record(source = self.ID,
name = self.name)
network[ponger.ID].add(msg)
self.expected = "hello " + self.name
atomic action ReceivePong:
recv_msg = any network[self.ID]
self.received = recv_msg
role Pong:
atomic action ReceivePing:
recv_msg = any network[self.ID]
send_msg = "hello " + recv_msg.name
network[recv_msg.source].add(send_msg)
network[self.ID].remove(recv_msg)
action Init:
network = {}
pingers = bag()
for id in range(2):
pinger = Ping(ID=id)
pingers.add(pinger)
network[pinger.ID] = genericset()
ponger = Pong(ID=100)
network[ponger.ID] = genericset()
view raw atomic.fizz hosted with ❤ by GitHub

This models each of the following as atomic:

  • Sending the ping

  • Receiving the ping and sending the pong

  • Receiving the pong

The network, modeled as a map, enables this division of work into asynchronous atomic units.

With 2 pingers and 1 ponger, it gives us 16 unique states.

Let’s look at a non-atomic variant of this ping-pong spec.

Ping-pong with non-atomic actions and implicit channels

always assertion AllOk:
for pinger in pingers:
if pinger.received != None and pinger.received != pinger.expected:
return False
return True
role Ping:
action Init:
self.name = "pinger-" + str(self.ID)
self.expected = None
self.received = None
action DoPing:
require self.expected == None
self.expected = "hello " + self.name
result = ponger.ProcessPing(self.name)
self.received = result
role Pong:
func ProcessPing(name):
response = "hello " + name
return response
action Init:
pingers = bag()
for id in range(2):
pinger = Ping(ID=id)
pingers.add(pinger)
ponger = Pong()
view raw nonatomic.fizz hosted with ❤ by GitHub

In this version, each action is non-atomic which means that the model checker can suspend an action at any time (yield) and start or resume another action.

This enables us to replace the explicit message passing with a direct invocation of ProcessPing in the Pong role. It allows interleavings of multiple pinger calls to ProcessPing without needing a network. This is what JP calls implicit channels. Our three atomic actions are replaced by a single non-atomic action. The atomic action in the Pong role is replaced by a func.

Fig 2. Atomic actions with explicit network vs non-atomic actions with implicit channels (invoking funcs on other roles).

With 2 pingers, we get 34 unique states compared to the 16 for the atomic version. Naturally, this increases the state space as we have many more possible interleavings.

But not so fast. It actually, it depends on how much concurrency you allow Fizzbee to explore. By default, Fizzbee will only explore histories where up to 2 different actions are concurrently executing. So if we had 3 pingers, Fizzbee would only actually explore histories where two pingers could be executing their DoPing action at a time. Therefore we can say that non-atomic Fizzbee does not always do exhaustive model checking. This is a necessary trade-off as the state space without atomic actions gets really big really fast.

Let’s explore this a little further.

Action Concurrency

Imagine we have 8 pingers and 1 ponger, Fizzbee took 3m 51s to find 65536 unique states. When I ran the non-atomic version (with max action concurrency of 2), it ran for 25s to find 35508 unique states. That is, less states, despite being non-atomic, with many more possible interleavings. The reason is the limited concurrency.

These are the results as I gradually increase that concurrency:

  • Concurrency 2: Time=25s, States=35508

  • Concurrency 3: Time=3m 4s, States=180736

  • Concurrency 4: Time=12m 52s, States=545616

I didn’t wait around to see how long it would take for a concurrency of 5-8 as the state space is quickly getting very large.

In the model checker, the concurrency correlates to threads. With a max concurrency of 2, there are two threads executing actions, yielding, selecting a different action to start/resume and so on. You will see the reference to threads in the error traces.

Using atomic blocks

Let’s make a silly change to our ponger. Before it responds to a pinger, it stores the response in field then returns the value of this field.

role Pong:
    action Init:
        self.response = None
 
    func ProcessPing(name):
       self.response = "hello " + name
       return self.response

This is ok when we have one pinger, but when there are two, the first invocation of ProcessPing could yield after setting self.response, then the second invocation could overwrite self.response causing the returned response to be wrong for one of the pingers.

The invariant finds that issue right away (more on that later).

We can fix this by placing the two lines in an atomic block.

    func ProcessPing(name):
        atomic:
            self.response = "hello " + name
            return self.response

This fixes the issue. It is also very readable. So Fizzbee provides a nice way to model arbitrary atomic blocks of code, that in an implementation we might use some kind of lock for.

This also reduces the state space as there are now less interleavings. If I wrap the lines of the original (non-silly) version in an atomic block, with 8 pingers and max concurrency of 2, the state space is reduced from 35508 unique states to 16640 states.

Understanding error traces (counterexamples)

Now we get to the bad news, analysing counterexamples with non-atomic actions is extremely painful and slow. Maybe JP will come up with a nice way of fixing this (I hope so).

Let’s go back to the silly variant above, but without the atomic block to protect the returned value. With 2 pingers, the output is as follows:

Model checking /home/jack/tmp/nonatomic/nonatomic.json
configFileName: /home/jack/tmp/nonatomic/fizz.yaml
StateSpaceOptions: options:{max_concurrent_actions:2 crash_on_yield:false} deadlock_detection:false
Nodes: 32, queued: 17, elapsed: 12.861828ms
Time taken for model checking: 12.933343ms
Writen graph dotfile: /home/jack/tmp/nonatomic/out/run_2024-12-06_15-56-53/graph.dot
To generate svg, run:
dot -Tsvg /home/jack/tmp/nonatomic/out/run_2024-12-06_15-56-53/graph.dot -o graph.svg && open graph.svg
FAILED: Model checker failed. Invariant: AllOk
------
Init
--
state: {"pingers":"[role Ping#0 (params(ID = 0),fields(expected = None, name = \"pinger-0\", received = None)), role Ping#1 (params(ID = 1),fields(expected = None, name = \"pinger-1\", received = None))]","ponger":"role Pong#0 (fields(response = None))"}
------
Ping#0.SendPing
--
state: {"pingers":"[role Ping#0 (params(ID = 0),fields(expected = \"hello pinger-0\", name = \"pinger-0\", received = None)), role Ping#1 (params(ID = 1),fields(expected = None, name = \"pinger-1\", received = None))]","ponger":"role Pong#0 (fields(response = None))"}
------
thread-0
--
state: {"pingers":"[role Ping#0 (params(ID = 0),fields(expected = \"hello pinger-0\", name = \"pinger-0\", received = None)), role Ping#1 (params(ID = 1),fields(expected = None, name = \"pinger-1\", received = None))]","ponger":"role Pong#0 (fields(response = \"hello pinger-0\"))"}
------
Ping#1.SendPing
--
state: {"pingers":"[role Ping#0 (params(ID = 0),fields(expected = \"hello pinger-0\", name = \"pinger-0\", received = None)), role Ping#1 (params(ID = 1),fields(expected = \"hello pinger-1\", name = \"pinger-1\", received = None))]","ponger":"role Pong#0 (fields(response = \"hello pinger-0\"))"}
------
thread-1
--
state: {"pingers":"[role Ping#0 (params(ID = 0),fields(expected = \"hello pinger-0\", name = \"pinger-0\", received = None)), role Ping#1 (params(ID = 1),fields(expected = \"hello pinger-1\", name = \"pinger-1\", received = None))]","ponger":"role Pong#0 (fields(response = \"hello pinger-1\"))"}
------
thread-0
--
state: {"pingers":"[role Ping#0 (params(ID = 0),fields(expected = \"hello pinger-0\", name = \"pinger-0\", received = None)), role Ping#1 (params(ID = 1),fields(expected = \"hello pinger-1\", name = \"pinger-1\", received = None))]","ponger":"role Pong#0 (fields(response = \"hello pinger-1\"))"}
------
thread-0
--
state: {"pingers":"[role Ping#0 (params(ID = 0),fields(expected = \"hello pinger-0\", name = \"pinger-0\", received = \"hello pinger-1\")), role Ping#1 (params(ID = 1),fields(expected = \"hello pinger-1\", name = \"pinger-1\", received = None))]","ponger":"role Pong#0 (fields(response = \"hello pinger-1\"))"}
------
Writen graph json: /home/jack/tmp/nonatomic/out/run_2024-12-06_15-56-53/error-graph.json
Writen graph dotfile: /home/jack/tmp/nonatomic/out/run_2024-12-06_15-56-53/error-graph.dot
To generate an image file, run:
dot -Tsvg /home/jack/tmp/nonatomic/out/run_2024-12-06_15-56-53/error-graph.dot -o error-graph.svg && open error-graph.svg
Writen error states as html: /home/jack/tmp/nonatomic/out/run_2024-12-06_15-56-53/error-states.html
To open:
open /home/jack/tmp/nonatomic/out/run_2024-12-06_15-56-53/error-states.html

The nice thing is that it gives you commands to run that open diagrams and show the error trace of the counterexample. Let’s do that.

When an action is not atomic, execution of an action could yield at any moment to a different action in a different role instance or even the same role instance.

> open /home/jack/tmp/nonatomic/out/run_2024-12-06_15-56-53/error-states.html

Fig 3. The error-states.html of the non-atomic spec with the silly mistake.

Each of the rows is a yield point that corresponds to a line of code. It is not clear but I’m guessing that row 5 is where the second thread (probably executing the DoPing of the second pinger) overwrites the self.response of the ponger, thus causing the wrong response to be sent back. Given such a trivial spec, I can handle this. But I could not handle this if that was an impenetrable trace with hundreds of rows.

Let’s compare this to a similar error trace from the atomic version.

Let’s introduce a silly mistake. Now the ponger caches the first response it sends, then it always sends that cached response from then on, regardless of who pings it.

The error trace is very clear. I don’t even have to open up the JSON diffs of the various steps.

Fig 4. The error-states.html of the atomic spec with the silly mistake.

Having done formal verification for a few years now I can tell you that half the time is spent analyzing error traces to understand what went wrong. Right now at least, the error traces of non-atomic actions are too difficult to read and so my productivity would be too low.

Final thoughts - the good and the bad

I really like JP’s refactor of my initial spec that uses non-atomic actions with implicit channels for communication. I like being able to control where atomicity actually needs to be placed by using atomic blocks (something also possible with PlusCal). Being able to control the massive state space explosion that results from the additional interleavings via limiting action concurrency is also nice. I can start out with low concurrency as I expect most issues will likely be detected in low concurrency scenarios, then to get more confidence once the spec is more bulletproof I can dial up the concurrency (and pay the price of the large state space).

I am left a little unsure about how much of the state space is being left unchecked. With this implicit channel approach, three atomic actions without concurrency limits collapsed into one non-atomic one with concurrency limits. Both sacrifice some reachable states, but in a different way.

The blocker for me right now is the readability of error traces. It could be as easy as placing the pertinent line of code in each row of the HTML table, or some other such trick. Once that gets fixed I will seriously consider using non-atomic actions. JP is usually pretty fast at closing gaps like this.

But even so… I still like atomic actions. The benefit of atomic actions is that they are easy to reason about. Building formal models can already be difficult, with many subtle challenges to trip you up. I have the feeling that there are a few landmines I’d only really get to know by treading on them. I’m not sure I’m ready to do that just yet.

What do you think? Tell me on Bluesky!

Share