Apache Pulsar

Testing Producer Deduplication in Apache Kafka and Apache Pulsar

Testing Producer Deduplication in Apache Kafka and Apache Pulsar

Failures can induce message duplication on both the producer and consumer side. In this post we’ll focus solely on producer side duplication, looking at how the deduplication feature works in Apache Pulsar and Apache Kafka. I have run many hours of deduplication tests of both messaging systems and we´ll see the results of those tests.

On the producer side, when a producer sends a message and an error occurs, such as a TCP connection failure, the producer has no way to know if the message was persisted or not. We have two choices, send the message again to ensure it gets delivered and risk duplication, or not send it again and risk the message never getting delivered.

How to (not) Lose Messages on an Apache Pulsar Cluster

How to (not) Lose Messages on an Apache Pulsar Cluster

In this post we’ll put the protocols we covered in the Understanding How Apache Pulsar Works post to the test. As in previous tests of How to Lose Messages on a RabbitMQ Cluster and How to Lose Messages on a Apache Kafka Cluster, I’ll be using Blockade to kill off nodes, slow down the network and lose packets. Unlike in those previous tests, these tests are automated and go further, not only testing for data loss but also correct ordering and duplication.

In each scenario we’ll stand-up a new blockade cluster with a specific configuration of:

  • Apache Pulsar broker count

  • Apache BookKeeper node (Bookie) count

  • Ensemble size (E)

  • Write quorum size (Qw)

  • Ack quorum size (Qa)

Understanding How Apache Pulsar Works

Understanding How Apache Pulsar Works

I will be writing a series of blog posts about Apache Pulsar, including some Kafka vs Pulsar posts. First up though I will be running some chaos tests on a Pulsar cluster like I have done with RabbitMQ and Kafka to see what failure modes it has and its message loss scenarios.

I will try to do this by either exploiting design defects, implementation bugs or poor configuration on the part of the admin or developer.

In this post we’ll go through the Apache Pulsar design so that we can better design the failure scenarios. This post is not for people who want to understand how to use Apache Pulsar but who want to understand how it works. I have struggled to write a clear overview of its architecture in a way that is simple and easy to understand. I appreciate any feedback on this write-up.