Testing Producer Deduplication in Apache Kafka and Apache Pulsar

Testing Producer Deduplication in Apache Kafka and Apache Pulsar

Failures can induce message duplication on both the producer and consumer side. In this post we’ll focus solely on producer side duplication, looking at how the deduplication feature works in Apache Pulsar and Apache Kafka. I have run many hours of deduplication tests of both messaging systems and we´ll see the results of those tests.

On the producer side, when a producer sends a message and an error occurs, such as a TCP connection failure, the producer has no way to know if the message was persisted or not. We have two choices, send the message again to ensure it gets delivered and risk duplication, or not send it again and risk the message never getting delivered.

How to Lose Messages on a Kafka Cluster - Part 1

In my previous post I used Blockade, Python and some Bash scripts to test a RabbitMQ cluster under various failure conditions such as failed nodes, network partitions, packet loss and a slow network. The aim was to find out how and when a RabbitMQ cluster loses messages. In this post we’ll do exactly the same but with a Kafka cluster. We’ll use our knowledge of the inside workings of Kafka and Zookeeper to produce various failure modes that produce message loss. Please read my post on Kafka fault tolerance as this post assumes you understand the basics of the acknowledgements and replication protocol.

RabbitMQ vs Kafka Part 6 - Fault Tolerance and High Availability with Kafka

In the last post we took a look at the RabbitMQ clustering feature for fault tolerance and high availability. In this post we'll dig deep into Apache Kafka and its offering.

With Kafka the unit of replication is the partition. Each topic has one or more partitions and each partition has a leader and zero or more followers. When you create a topic you specify the number of partitions and the replication factor. A replication factor of three is common, this equates to one leader and two followers. Both leaders and followers can be referred to as replicas.

Event-Driven Architectures - Queue vs Log - A Case Study

In the previous post we looked at relative event ordering and the decoupling of publishers and consumers among other things. In this post we'll take those concepts and look at an example architecture. We'll look at the various modelling possibilities we have with RabbitMQ representing a queue based system, and Kafka representing a log based system.

SQL Server CDC to Redshift Pipeline

In this post we'll take a look at what Change Data Capture (CDC) is and how we can use it to get data from SQL Server into Redshift in either a near real-time streaming fashion or more of a batched approach.

CDC is a SQL Server Enterprise feature and so not available to everyone. Also there are vendors that sell automated change data capture extraction and load into Redshift, such as Attunity and that may be your best option. But if you can't or don't want to pay for another tool on top of your SQL Server Enterprise license then this post may help you.

API Series Part 7 - Inter-Service Communication Overview

In this post we'll explore our options regarding how different services could communicate with each other. There is no code in this post, we'll just explore some different architectures and the trade-offs included with each one. In the end you can make up your own mind about your own specific domain. I have my preference for Govrnanza and in the next part we'll implement that strategy and get into some code.

RabbitMQ vs Kafka Part 4 - Message Delivery Semantics and Guarantees

Both RabbitMQ and Kafka offer durable messaging guarantees. Both offer at-most-once and at-least-once guarantees but kafka offers exactly-once guarantees in a very limited scenario.

Let's first understand what these guarantees mean:

  • At-most-once delivery. This means that a message will never be delivered more than once but messages might be lost.
  • At-least-once delivery. This means that we'll never lose a message but a message might end up being delivered to a consumer more than once.
  • Exactly-once delivery. The holy grail of messaging. All messages will be delivered exactly one time.

Delivery is probably the wrong word for the above terms, instead Processing might be a better way of putting it. After all what we care about is whether a consumer can process a message and whether that is at-most-once, at-least-once or exactly-once. But using the word processing complicates things, exactly-once delivery makes less sense now as perhaps we need it to be delivered twice in order to be able to successfully process it once. If the consumer dies during processing, then we need that the message be delivered a second time for a new consumer.

RabbitMQ vs Kafka Part 3 - Kafka Messaging Patterns

In Part 2 we covered the patterns and topologies that RabbitMQ enables. In this part we'll look at Kafka and contrast it against RabbitMQ to get some perspective on their differences. Remember that this comparison is within the context of an event-driven application architecture rather than data processing pipelines, although the line between them can be a bit grey. Perhaps it is more like a continuum and this comparison focuses on the event-driven applications end of that continuum.

RabbitMQ vs Kafka Part 2 - RabbitMQ Messaging Patterns

In this part we're going to forget about the low level details in the protocols and concentrate on the higher level patterns and message topologies that can be achieved in RabbitMQ. In Part 3 of the series we'll do the same for Apache Kafka.

First we'll cover the building blocks, or routing primitives, of RabbitMQ:

  • Exchange types and bindings
  • Queues
  • Dead letter exchanges
  • Ephemeral exchanges and queues
  • Alternate Exchanges
  • Priortity Queues

Then we'll combine them all into a set of example patterns.