Verifying Kafka transactions - Diary entry 3 - Getting confidence in the TLA+ spec — Jack Vanlightly

Verifying Kafka transactions - Diary entry 3 - Getting confidence in the TLA+ spec

One thing I forgot to write about yesterday was how I get confidence that the TLA+ specification is doing what I expect. The thing about model checking a spec is that because there are usually millions or billions of states to explore, you can’t always see that certain states that should be reachable do get reached. It’s common to have a specification bug that causes your spec to have valid areas of the state space be unreachable, without you knowing it. This is why TLA+ practitioners learn to be a skeptical bunch and treat early success with great cynicism.

This is what I did to get confidence in this initial specification.

Measure 1 - Use a liveness property

As I explained in the last diary entry, the following liveness property showed me that eventually, all clients obtain a PID.

EventuallyAllClientsGetPid ==
    <>[](\A c \in Clients : client[c].state = HasPid)

Liveness issues are really useful for gaining confidence that something that you expect to eventually happen, does happen.

Measure 2 - Use a false invariant to ensure ConcurrentTransactions gets triggered

If client c1 a valid InitPidRequest to the right broker, the TC will write an entry to the transaction log. If client c2 then sends an InitPidRequest to the same broker, it should reject it with a ConcurrentTransactions error.

I tested this with my TestInv invariant set as follows:

TestInv ==
    ~\E msg \in messages :
        /\ msg.type = InitPidResponse
        /\ msg.code = ConcurrentTransactions

The above says: there does not exist an InitPidResponse with code ConcurrentTransactions.

This requires two clients and one broker as the smallest model parameters. It resulted in TLC finding a violation with the following trace:

  1. Initial state

  2. SendInitPidRequest(c2, b1)

  3. SendInitPidRequest(c1, b1)

  4. ReceiveInitPidRequest(b1, c1)

  5. ReceiveInitPidRequest(b1, c2)

Measure 3 - Use a false invariant to ensure NotCoordinator gets triggered

I changed TestInv to be:

TestInv ==
    ~\E msg \in messages :
        /\ msg.type = InitPidResponse
        /\ msg.code = NotCoordinatorForTransactionalId

This requires one client and two brokers, one partition and one TID as the smallest model parameters. It resulted in TLC finding a violation with the following trace:

  1. Initial state

  2. SendInitPidRequest(c1, b1)

  3. ReceiveInitPidRequest(b1, c1)

If we inspect the state of step 3 we see that the TID was mapped to p1 (t_to_p_mapping) and p1 was owned by b2 (tc_part_metadata). Thus b1 rejected the request.

With those three measures, I was pretty confident the spec was correct.

A change to Init

In the initial state, the mappings of TIDs to transaction log partitions and those partitions to brokers used existential quantifiers (\E or there exists). This caused TLC to explore every possible combination of starting mappings. This inflated the state space a lot as the model parameters got larger. I decided to restrict the initial state to a single combination, where both TID-to-partition and partition-to-broker mappings are evenly distributed. To do this, I replaced \E with CHOOSE.

If the set of mappings has two members, then using an existential quantifier will cause the model checker to explore the state space based on choosing one and then on choosing the other, that is, it will explore two alternate paths (two parallel universes where both choices were made). CHOOSE basically makes a single choice out of many. This often makes CHOOSE a poor choice as it will prevent full state exploration. In this case, I’m happy to reduce state space like this, and later I may put the existential quantifiers back.

Original with existential quantifiers (\E). You can think [TransactionIds -> TxnLogPartitions] as: return a set containing every combination of TID -> partition. The existential quantifier expresses that we want to explore every single one of those combinations.

\E tid_to_part_mapping \in [TransactionIds -> TxnLogPartitions] : 
    \E b_partitions \in [Brokers -> SUBSET TxnLogPartitions] :
        ... some constraints here

New version with CHOOSE. I placed the constraints in separate formulae, with additional constraints that the mappings be balanced. I won’t cover that here but you can see it in the spec and try and understand it.

LET tid_to_part_mapping == CHOOSE mapping \in [TransactionIds -> TxnLogPartitions] :
                                            BalancedTidToPartSpread(mapping) 
    b_partitions        == CHOOSE mapping \in [Brokers -> SUBSET TxnLogPartitions] :
                                            BalancedBrokerToPartLeadership(mapping)
IN … rest of init

This has further reduced the state space.


Next up, the Fizzbee equivalent.

Share