Apache Kafka

Kafka vs Redpanda Performance - Part 6 - Draining backlogs

In the last post we saw how only Apache Kafka was able to fully utilize the 2 GB/s throughput limit of the i3en.6xlarge. In this post we’re going to test the ability of Kafka and Redpanda to drain a backlog while under continued producer load.

This test starts up the producers and consumers, at the target throughput, but then pauses the consumers for a period until consumer lag builds up to a desired amount (also known as a backlog), then the consumers are resumed and we see how long it takes for them to catch-up and return to sub-second end-to-end latency.

Kafka vs Redpanda Performance - Part 5 - Reaching the limits of the NVMe drive

In the previous post we saw how using record keys impacted both Apache Kafka and Redpanda. However, Redpanda struggled more than Kafka with the more numerous and smaller sized batches that result from key-based partition distribution. 

Next I decided to see if I could get Apache Kafka and Redpanda to reach the absolute limit of the NVMe drive throughput on the i3en.6xlarge - 2 GB/s. To do this I deployed both systems without TLS and modified the Redpanda 1 GB/s benchmark to attempt 2 GB/s, use 10 producers/consumers and use acks=1 instead of acks=all.

Kafka vs Redpanda Performance - Part 4 - Impact of record keys

In the last post we saw that Redpanda latency can literally jump once data retention limits kick in. In this post we’re going to look at the impact of using record keys.

When we don’t use record keys, the producer default partitioner accumulates messages in batches in a first-come-first-serve manner and randomly chooses a partition to send it to. The new Uniform Sticky Partitioner can also probabilistically choose less overloaded partitions. This is all good for performance because even with a short linger.ms, the producer can create large batches which is good for performance. On the contrary, sending more numerous and smaller batches can negatively impact performance.

Kafka vs Redpanda Performance - Part 3 - Hitting the retention limit

In the last post we saw that Redpanda performance degraded over time and that we needed to include a certain amount of drive over-provisioning to cope with the random IO nature of Redpanda. In this post we’re going to look at a phenomenon I see in every high throughput test I run which reaches the retention limit.

Kafka vs Redpanda Performance - Part 1 - 4 vs 50 producers

The Redpanda benchmark and TCO analysis claims that Redpanda needs only 3 brokers on the i3en.6xlarge to reach 1 GB/s and Apache Kafka needs 9 brokers and Kafka still shows inferior end-to-end latency to Redpanda. I decided to see if I could reproduce those claims and see if the Redpanda performance was generalizable. That is, does the 1 GB/s with 4 producers and consumers translate to other workloads? I.e, is it a useful benchmark you could make decisions off?

I ran the Redpanda 1 GB/s benchmark against Redpanda and Apache Kafka on identical hardware at 6 different throughputs: 500, 600, 700, 800, 900 and 1000 MB/s. I also ran it with the original 4 producers and consumers, then with 50 producers and consumers. The result was significant performance degradation with 50 producers for Redpanda. The other noteworthy result was that Redpanda was unable to reach 1000 MB/s with TLS which conflicts with the Redpanda benchmarks.