Verifying Kafka transactions - Diary entry 4 - Writing an initial Fizzbee spec — Jack Vanlightly

Verifying Kafka transactions - Diary entry 4 - Writing an initial Fizzbee spec

I have written a Fizzbee spec that is the equivalent of the TLA+ spec.

Let’s dive into the decisions I had to make with this initial Fizzbee spec.

Decisions

Decision one - Atomic actions

Fizzbee is very similar to TLA+ in many ways, but one of the ways that it differs is that you are not limited to atomic actions. In fact, with Fizzbee, if you want atomic actions then you need to label the actions, and any functions they call, atomic. Else the model checker will explore a state space where an action yields part way through. 

My first decision point was whether to make actions atomic or not. I decided to make them atomic, else I would need to model locks where critical sections exist. This spec is going to be complex enough, so making actions atomic by default seems like the best way to go.

I have matched the TLA+ design, with four actions covering the same behavior and same stored state. This might not always remain the case, as Fizzbee is different and it may not always be the best choice to keep their designs closely aligned.

Decision two - Roles

Fizzbee has the concept of a role, which acts like a class:

  • You can instantiate many of them.

  • Each instantiation can hold local private state.

  • Each instantiation has methods. In Fizzbee these are actions (invoked by the model checker) and funcs, which can be called by actions and other funcs.

The Fizzbee spec has three roles, with associated actions or funcs:

  • Client (role)

    • SendInitPidRequest (action)

    • ReceiveInitPidResponse (action)

  • TransactionLogPartition

    • Append (func)

    • AdvanceHwmAndReturnEntry (func)

  • TransactionCoordinator

    • ReceiveInitPidRequest (action)

    • CompleteInitPidRequest (action)

The initial state instantiates a number of each of these roles, based on the same model parameters as the TLA+ specification. At the top of the file, these model parameters are defined and set (there is no separate cfg file for this).

\* Model parameters
NUM_BROKERS = 2 
NUM_CLIENTS = 2
NUM_TXN_LOG_PARTITIONS = 1
NUM_TRANSACTION_IDS = 1

Decision three - the Network

The network is modeled in a very similar way to TLA+. Where TLA+ has a single set, where each message contains the source and destination, my Fizzbee spec is a map of sets. The keys are the receiving actors (client and TCs) and each value is an unordered set of inflight messages. With Fizzbee I don’t need to model the processed messages set (for easier trace reading), as the traces make that clear without this additional set. This is good as there is no View with Fizzbee that I can use to hide that set from state fingerprinting as I can with TLC (which is needed as it balloons the state space).

# ----------------------------------------------
# Network
# ----------------------------------------------

atomic func NetReply(source, recv_msg, send_msg):
    # Add the reply message to the net buffer of the dest
    net_input_buff[recv_msg.source].add(send_msg)
    # Remove the processed message from the net buffer of the source
    net_input_buff[source].remove(recv_msg)

atomic func NetSend(dest, msg):
    net_input_buff[dest].add(msg)

atomic func NetMsgProcessed(source, msg):
    # Remove the processed message from the net buffer of the source
    net_input_buff[source].remove(msg)

The logic in each action is the equivalent of the TLA+, written in Starlark, a subset of Python (with some Fizzbee DSL as well). I won’t go into the specifics of each action, I’ll leave that to the reader to read and understand it, and compare to the TLA+. 

Some initial comparisons to TLA+

Now perhaps take a moment to read the Fizzbee spec and the TLA+ spec, and convince yourself they are equivalent.

Which did you prefer?

When comparing TLA+ to Fizzbee, in terms of writing specs, there are a couple of points worth noting, besides Fizzbee requiring a shallower learning curve.

Fizzbee doesn’t have to be atomic

I already covered this, but it is a big difference between the two. I have opted for atomic actions because it’s easier to reason about, avoids the needs for locks and reduces the state space. But on simpler specs I imagine it’s a really nice feature.

Regarding my comment on locks, you don’t actually model locks, just mark blocks of code as atomic, for example:

action DoSomething:
    atomic:
        self.x = self.x + 1
        self.y = self.y + 1

    atomic:
        self.a = self.a + 1
        self.b = self.b + 1
    

The above increments x and y atomically, then a and b.

Knowing when state changes

In TLA+, a step takes us from one state to another state. The values of the variables in the current state are immutable, you can only set the primed variables of the next state. Such as ctr’ = ctr + 1. That makes it crystal clear when state is being changed.

In TLA+, you can only assign a value to primed variable once in a single step. That is, within one action, you can’t do:

ctr’ = ctr + 1
ctr’ = ctr + 1

If you need to be able to increment the counter multiple times in a single step, then you need to operate on an intermediate value, then assign the final value to the primed variable. LET can be useful for doing that, for example, to increment it 3 times:

LET ctr1 == ctr + 1
    ctr2 == ctr1 + 1
IN ctr’ = ctr2 + 1

Fizzbee has no such limitation, which can really simplify things. You could simply write:

role Counter:
    action Init:
        self.ctr = 0

    atomic fair action IncTwice:
        self.ctr = self.ctr + 1
        self.ctr = self.ctr + 1

However, it’s a double edged sword. Sometimes you can change state without realizing it.

In Fizzbee, you pass data around by value or by reference like you do in Python. It is possible for one actor (A) to get a reference to data of another actor (B), and for A to inadvertently change the state of B. This happened to me when working on the Paimon or Iceberg spec, I don’t remember which. A client role obtained an object from the object storage role, then the client modified it, ready to write it back as a new object, but it changed the original object in object storage as they both had a reference to the same object! JP provided me with the deepcopy function to allow the object storage role to return a copy of the data, rather than a reference.

role ObjectStore:
    atomic func Read(file_id):
        if file_id in self.objects:
            obj = self.objects[file_id]
            obj_copy = deepcopy(obj)
            return obj_copy
        else:
            return None

Passing by value and by reference can also trip you up within a role, if you are used to TLA+ semantics of where state change is explicit. For example, the following role and invariant:

invariants:
  always c.ctr <= 3  # This statement is always true
  always len(c.arr) <= 3

action Init:
    c = Counter()

role Counter:
    action Init:
        self.ctr = 0
        self.arr = []

    atomic func Inc(ctr, val):
        ctr = ctr + val

    atomic fair action IncTwice:
        self.Inc(self.ctr, 1)
        self.Inc(self.ctr, 1)

    atomic func Append(ar, val):
        ar.append(val)

    atomic fair action AppendTwice:
        self.Append(self.arr, 1)
        self.Append(self.arr, 1)

Running this in the playground will result in:

FAILED: Model checker failed. Invariant:  
------
Init
--
state: {"c":"role Counter#0 (fields(arr = [], ctr = 0))"}
------
Counter#0.AppendTwice
--
state: {"c":"role Counter#0 (fields(arr = [1, 1], ctr = 0))"}
------
Counter#0.AppendTwice
--
state: {"c":"role Counter#0 (fields(arr = [1, 1, 1, 1], ctr = 0))"}
------

See how the ‘ctr’ never got incremented, as it was pass-by-value, but the list did get appended to, because of pass-by-reference.

It’s not a massive criticism of Fizzbee, only that one needs to be careful about how one mutates state. Just like in the real world of programming, know when you are passing by value and by reference. Make sure that data exchanged by roles doesn’t get mutated by the receiver, if it does, get a deepcopy.

Running the spec

You can run Fizzbee in the playground, but I run it locally, either on my work MacOS or a Linux VM on my Windows workstation.

To run it yourself, you’ll need to install Bazel and clone the Fizzbee repo, then run the following command from the Fizzbee repo directory:

bazel build //…

You run a spec via the fizz command, such as:

./fizz path/to/spec.fizz

Before running a spec there are some additional differences to TLA+ that are worth covering:

  • Unlike TLC (TLA+ model checker), Fizzbee does not spill state to disk when it exhausts memory. Therefore, for non-trivial specs you’ll either need lots of memory, or enough swap available for growth in virtual memory. When I set up a Linux VM to run in VMWare Workstation, by default it allocates 2 GB of swap file. I tend to increase this to at least 2-30 GB, to give it some margin before being killed by the OOM Killer. Relying on larger swap is not very practical as it slows down Fizzbee significantly (plus it places some wear on the SSD).

  • Unlike TLC, Fizzbee is single-threaded at the time of writing. It may therefore struggle to brute force check large state spaces. This is one large gap.

  • Liveness checking is a little different in Fizzbee. It is similar, but different enough I will tackle that in a separate post. Today, just typical safety checking.

  • Symmetry is supported by Fizzbee. See Symmetry Reduction. I turn on symmetry by either making the roles symmetric or not. If you want to learn more about symmetry I just wrote a blog post about it: An Introduction to Symmetry in TLA+. I made a separate version of the Fizzbee spec to test out symmetry.

  • Fizzbee doesn’t have Views. For this spec, I don’t need that capability as I don’t need the processed messages set at all like I do with TLA+.

I ran some benchmarks of Fizzbee vs TLC, also checking the number of unique states. The number of unique states should match at this point, given that they are essentially identical.

Given 16 CPU threads and 64 GB of RAM, with a 1 TB Samsung 970 Pro, let’s see how they compared.

Configuration TLC (no symmetry) TLC (symmetry) Fizzbee (no symmetry) Fizzbee (symmetry)
Parts=1, TIDs=1, Clients=1, Brokers=1 States=5 / Time=4s States=5 / Time=4s States=5 / Time=1ms States=5 / Time=1ms
Parts=1, TIDs=1, Clients=2, Brokers=2 States=316 / Time=7s States= 163 / Time=7s States=316 / Time=1s States=162 / Time=1s
Parts=1, TIDs=1, Clients=3, Brokers=3 States=15,498 / Time=8s States=1919 / Time=8s States=15,498 / Time=1m 4s States=1802 / Time=39s
Parts=1, TIDs=1, Clients=4, Brokers=4 States=1,023,928 / Time=23s States= 18,707 / Time=10s See note *1 See note *2
Parts=2, TIDs=2, Clients=4, Brokers=4 States= 17,002,489 / Time=2m 27s States= 150,557 / Time=1m 12s - -
Parts=2, TIDs=2, Clients=4, Brokers=4 States= 77,440,456 / Time=10m 35s States= 713,859 / Time=16m - -
  • Note *1: I killed the process after 5 hours as it had used up almost all the remaining swap (180 GB).

  • Note *2: It took 41 minutes to do its upfront symmetry computation, then afterwards it was very slow so I killed it.

Some points to note:

  • On the small model parameters, without symmetry, both models had identical unique states which is expected.

  • There is a slight difference once symmetry is added and I really don’t have time to dig into that difference right now, but at least they are pretty close.

  • Fizzbee struggled to compute the permutations for symmetry based on 4 clients and 4 brokers (taking 41 minutes). TLC also struggles with this in general, but with slightly larger sets (see my earlier blog post An Introduction to Symmetry in TLA+). TLC managed the permutations of 4 clients and 4 brokers quite easily in this case.

  • Fizzbee basically couldn’t handle 4 clients and 4 brokers with or without symmetry.

Fizzbee struggles with larger state spaces for a number of reasons:

  • The model checker is still single-threaded.

  • It doesn’t spill state to disk when it runs out of memory (you must ensure there is enough swap, and swap is slow).

  • It needs smaller sets when using symmetry.

But all is not lost! We can also fallback to simulation mode, which both TLC and Fizzbee support.

The BFS approach of brute-force model checking exhaustively explores all possible states for full coverage, while simulation mode samples random paths through the state space, making it faster but less exhaustive. It parallelizes well as each worker thread independently explores random paths - i.e there is no shared stack variable for BFS that all workers pull from. Likewise, state fingerprints don’t need to be stored, reducing the storage demands. Finally, symmetry is not needed, so larger sets can be used for model parameters. The downside is that it is not exhaustive. However, I commonly use it on very large models that can’t be brute-forced, and it can still find bugs effectively, sometimes very quickly.

In Fizzbee’s case it is single threaded but we can compare TLC given 14 workers and taking 14000 simulations to completion, against 1 Fizzbee process and 1000 simulations. When I use Fizzbee in simulation mode, I tend to kick off as many Fizzbee processes as there are CPU threads (minus a couple for some headroom).

With the model parameters: Brokers=4, clients=4, TIDs=2, partitions=2, I saw the following running times:

  • TLC with 14 workers completed the 14000 simulations in 20 seconds.

  • Fizzbee completed the 1000 simulations in 4m 4 seconds (and we’ll assume that 14 processes would have completed 1000 simulations each in the same time).

This all correlates with my subjective experience running Fizzbee - it is slower than TLC. But this isn’t surprising as Fizzbee is still pretty new with a single developer behind it. Optimizing the model checker is on the roadmap. I’ll likely be coming back to this performance difference multiple times throughout this formal verification experience.

The final thing I will say about Fizzbee is that you will most likely choose it because you don’t like the syntax of TLA+. You may be more productive with it than TLA+ and are willing to throw some extra hardware at the model checking. That can be a valid trade off to make.

Final thoughts on the Fizzbee spec

While the Fizzbee model checker is lagging TLC, the main value of Fizzbee is being able to write a formal model in Python while retaining TLA+ semantics. I’ll let you be the judge based on the current TLA+ spec and Fizzbee spec.

I still have the subject of liveness to cover with Fizzbee. Writing these diary entries has slowed me down a bit, so to make some progress I’m going to be heads down working on the specs. I’ll report back next week on the latest developments and that post on liveness.

Share