How to (not) Lose Messages on an Apache Pulsar Cluster

In this post we’ll put the protocols we covered in the Understanding How Apache Pulsar Works post to the test. As in previous tests of How to Lose Messages on a RabbitMQ Cluster and How to Lose Messages on a Apache Kafka Cluster, I’ll be using Blockade to kill off nodes, slow down the network and lose packets. Unlike in those previous tests, these tests are automated and go further, not only testing for data loss but also correct ordering and duplication.

In each scenario we’ll stand-up a new blockade cluster with a specific configuration of:

  • Apache Pulsar broker count

  • Apache BookKeeper node (Bookie) count

  • Ensemble size (E)

  • Write quorum size (Qw)

  • Ack quorum size (Qa)

Understanding How Apache Pulsar Works

I will be writing a series of blog posts about Apache Pulsar, including some Kafka vs Pulsar posts. First up though I will be running some chaos tests on a Pulsar cluster like I have done with RabbitMQ and Kafka to see what failure modes it has and its message loss scenarios.

I will try to do this by either exploiting design defects, implementation bugs or poor configuration on the part of the admin or developer.

In this post we’ll go through the Apache Pulsar design so that we can better design the failure scenarios. This post is not for people who want to understand how to use Apache Pulsar but who want to understand how it works. I have struggled to write a clear overview of its architecture in a way that is simple and easy to understand. I appreciate any feedback on this write-up.

How to Lose Messages on a Kafka Cluster - Part 1

In my previous post I used Blockade, Python and some Bash scripts to test a RabbitMQ cluster under various failure conditions such as failed nodes, network partitions, packet loss and a slow network. The aim was to find out how and when a RabbitMQ cluster loses messages. In this post we’ll do exactly the same but with a Kafka cluster. We’ll use our knowledge of the inside workings of Kafka and Zookeeper to produce various failure modes that produce message loss. Please read my post on Kafka fault tolerance as this post assumes you understand the basics of the acknowledgements and replication protocol.

How to Lose Messages on a RabbitMQ Cluster

In my RabbitMQ vs Kafka series Part 5 post I covered the theory of RabbitMQ clustering and some of the gotchas. In this post we'll demonstrate the message loss scenarios described in that post using Docker and Blockade. I recommend you read that post first as this post assumes understanding of the topics covered.

Blockade is a really easy way to test out how distributed systems cope with network partitions, flaky networks and slow networks. It was inspired by the Jepson series. In this post we'll either be killing off nodes, partitioning the cluster, introducing packet loss or slowing down the network. So with Blockade, some bash and python scripts we’ll test out some failure scenarios.

RabbitMQ vs Kafka Part 6 - Fault Tolerance and High Availability with Kafka

In the last post we took a look at the RabbitMQ clustering feature for fault tolerance and high availability. In this post we'll dig deep into Apache Kafka and its offering.

With Kafka the unit of replication is the partition. Each topic has one or more partitions and each partition has a leader and zero or more followers. When you create a topic you specify the number of partitions and the replication factor. A replication factor of three is common, this equates to one leader and two followers. Both leaders and followers can be referred to as replicas.

RabbitMQ vs Kafka Part 5 - Fault Tolerance and High Availability with RabbitMQ Clustering

Fault tolerance and High Availability are big subjects and so we'll tackle RabbitMQ and Kafka in separate posts. In this post we'll look at RabbitMQ and in Part 6 we'll look at Kafka while making comparisons to RabbitMQ. This is a long post, even though we only look at RabbitMQ, so get comfortable.

In this post we'll look at the strategies for fault tolerance, consistency and high availability (HA) and the trade-offs each strategy makes. RabbitMQ can operate as a cluster of nodes and as such can be classed as a distributed system. When it comes to distributed data systems we often speak about consistency and availability.

We talk about consistency and availability with distributed systems because they describe how the system behaves under failure. A network link fails, a server fails, a hard disk fails, a server is temporarily unavailable due to GC or a network link is lossy or slow. All these things can cause outages, data loss or data conflicts. It turns out that it is generally not possible to provide a system that is ultimately consistent (no data loss, no data divergence) and available (will accept reads and writes) under all failure modes.

We'll see that consistency and availability are at two ends of a spectrum and you'll need to choose which of those you'll optimize for. The good news is that with RabbitMQ this is a choice that you can make. It gives you the nerd knobs required to tune it for greater consistency or greater availability.

In this post we'll be paying close attention to what configurations produce data loss of acknowledged writes. There is a chain of responsibility between producers, brokers and consumers. Once a message has been handed off to a broker, it is the broker's job not to lose that message. When the broker acknowledges receipt of a message to the publisher, we don't expect that message to be lost. But we'll see that this indeed can happen depending on your broker and publisher configuration.

RabbitMQ Work Queues: Avoiding Data Inconsistency with Rebalanser

With RabbitMQ we can scale-out our consumers by simply adding more, but we can also scale-out our queues. There are a few reasons why scaling out our queues might be preferential to simply adding more consumers to a single queue (competing consumers), one of those reasons is when using the work queue pattern.