Kafka KIP-966 - Fixing the Last Replica Standing issue

Kafka KIP-966 - Fixing the Last Replica Standing issue

The Kafka replication protocol just got a new KIP that improves its durability when running without fsync. As I previously blogged, Why Kafka Doesn’t Need Fsync to be Safe, there are distributed system designs that exist which allow for asynchronous storage engines. Being asynchronous means that the system can reap performance benefits which are not available to a synchronous storage engine.

Kafka vs Redpanda Performance - Do the claims add up?

Apache Kafka has been the most popular open source event streaming system for many years and it continues to grow in popularity. Within the wider ecosystem there are other open source and source available competitors to Kafka such as Apache Pulsar, NATS Streaming, Redis Streams, RabbitMQ and more recently Redpanda (among others).

Redpanda is a source available Kafka clone written in C++ using the Seastar framework from ScyllaDB, a wide-column database. It uses the popular Raft consensus protocol for replication and all distributed consensus logic. Redpanda has been going to great lengths to explain that its performance is superior to Apache Kafka due to its thread-per-core architecture, use of C++, and its storage design that can push high performance NVMe drives to their limits.

They list a bold set of claims and those claims seem plausible. Built in C++ for modern hardware with a thread-per-core architecture sounds compelling and it seems logical that the claims must be true. But are they?

Is sequential IO dead in the era of the NVMe drive?

Is sequential IO dead in the era of the NVMe drive?

Two systems I know pretty well, Apache BookKeeper and Apache Kafka, were designed in the era of the spinning disk, the hard-drive or HDD. Hard-drives are good at sequential IO but not so good at random IO because of the relatively high seek times. No wonder then that both Kafka and BookKeeper were designed with sequential IO in mind.

Both Kafka and BookKeeper are distributed log systems and so you’d think that sequential IO would be the default for an append-only log storage system. But sequential and random IO sit on a continuum, with pure sequential on one side and pure random IO on the other. If you have 5000 files which you are appending to in small writes in a round-robin manner, and performing fsyncs, then this is not such a sequential IO access pattern, it sits further to the random IO side. So just by being an append-only log doesn’t mean you get sequential IO out of the gate.

Why Apache Kafka doesn't need fsync to be safe

Why Apache Kafka doesn't need fsync to be safe

TLDR: Apache Kafka doesn’t need fsyncs to be safe because it includes recovery in its replication protocol. It is a real-world distributed system that uses asynchronous log writing + recovery with some additional extra safety built-in. Asynchronous log writing allows it to provide robust performance on a variety of hardware and with a wide variety of workloads.

Now that the TLDR is done, let’s dive into it.

The fact that by default Apache Kafka doesn’t flush writes to disk is sometimes used as ammunition against it. The argument is that if Kafka doesn’t flush data before acknowledging produce requests then surely the cluster can lose acknowledged data due to crashes and reboots. It sounds plausible and so people may believe it - but I’m here writing this today to explain why that isn’t the case.

Applying Flexible Paxos to Raft

Applying Flexible Paxos to Raft

Flexible Paxos provides us the insight that Paxos (and Raft) only need that election and replication quorums intersect. But standard Raft and Paxos are configured so that every quorum intersects. So what does that mean exactly?

Let’s take the election quorum and Raft. An election quorum is a subset of the set of servers that have voted for the same server in the same election term and that quorum is formed of a majority. For a 3 node cluster we need 2 votes and a 5 node cluster we need 3 votes and so on.

The next question is: what are all the possible quorums that exist and are there any two quorums that do not intersect? The possible majority quorums are {n1, n2}, {n2, n3} and {n1, n3} and there are no two quorums that do not intersect. This is the property we get from majority quorums.

Write for others but mostly for yourself

Write for others but mostly for yourself

I started my blog originally to help me get to the next level in my career and help establish myself as an authority in the areas of tech that I was focusing on. I liked writing and thought I had something to say.

Looking back at my 6 years of blogging now it’s hard to recognise myself from the engineer I was back then before writing was a regular habit for me. It’s funny because in the end my blog was the key to unlock the next door in my career but not necessarily for the reasons I expected. I figured if I could write some interesting posts I could turn up to an interview and use it as a kind of portfolio, but it became so much more than that.

Tweaking the BookKeeper protocol - Unbounded Ledgers

In the last post I described the necessary protocol changes to ensure that all entries in closed ledgers reached Write Quorum (WQ) and all entries in all but the last fragment in open ledgers reach write quorum.

In this post we’re going to look at another tweak to the protocol to allow ledgers to be unbounded and allow writes from multiple clients over their lifetime.

Tweaking the BookKeeper protocol - Guaranteeing write quorum

Introduction

Recently I wrote a blog post on my team blog about the differences between Raft and the Apache BookKeeper replication protocol. In it I covered one difference that surprises people which is that a ledger can have multiple blocks of entries that only ever reach Ack Quorum and not Write Quorum due to how ensemble changes work. A Raft log on the other hand has the property that the replication factor (RF) reached by any given entry matches the following:

Prefix RF >= Entry RF >= Suffix RF

That is to say, if a given entry has reached RF of 3, then the entire log prefix must also be at 3 or above (depending on the desired RF configured). But with BookKeeper that is not the case. For example, with WQ=3/AQ=2, a given entry that has reached RF of 3 may have entries before it that only reached RF of 2

Learn about TLA+ and the formal verification of Apache BookKeeper

At the time of writing I work at Splunk in the messaging-as-a-service team (we offer Apache Pulsar as in internal Splunk service). In late 2020, early 2021 I decided to formally verify the Apache BookKeeper protocol in TLA+. My main objective was to simply learn the protocol by reverse engineering the code into a specification and that worked extremely well. I also found a protocol bug and an implementation bug as a result which was an added bonus.

Using TLA+ to learn how an existing system works is an amazingly effective learning method. Yes you can read code and docs and you might end up with a hand-wavy level of clarity. But modelling a system in something like TLA+ leaves no room for ambiguity. So I highly recommend it.

You can read about it on the Splunk messaging-as-a-service team blog https://medium.com/splunk-maas:

You can see the current state of the project in my GitHub repo: https://github.com/Vanlightly/bookkeeper-tlaplus.

At the time of writing the Splunk messaging-as-a-service is hiring software engineers, so do contact me if you are interested in working on Apache Pulsar, Apache BookKeeper and all the tooling required to run these systems as a service.