Kafka vs Redpanda Performance - Part 4 - Impact of record keys

In the last post we saw that Redpanda latency can literally jump once data retention limits kick in. In this post we’re going to look at the impact of using record keys.

When we don’t use record keys, the producer default partitioner accumulates messages in batches in a first-come-first-serve manner and randomly chooses a partition to send it to. The new Uniform Sticky Partitioner can also probabilistically choose less overloaded partitions. This is all good for performance because even with a short linger.ms, the producer can create large batches which is good for performance. On the contrary, sending more numerous and smaller batches can negatively impact performance.

The problem with null record keys is that we don’t get message ordering. So it is more common to use record keys which give us key-based ordering - that is, messages of the same key are delivered in order. This is known as a partial order. However, the price we pay is that messages are batched according to their key and we end up with more numerous, smaller batches.

What we’ll find is that when we use record keys, 4 producers can’t get close to 1 GB/s. In fact, 1 GB/s on this hardware ends up being a bit of a tall order. So I decided to see if I could get Kafka and Redpanda to 1 GB/s by using more producers.

Trying to get to 1 GB/s with up to 400 producers

I changed two things about the 1 GB/s benchmark. I added the config keyDistributor: "KEY_ROUND_ROBIN" to the workload file and changed the number of producers, testing between 50 and 400 producers.

Fig 1. Neither Kafka nor Redpanda reach 1 GB/s but Kafka reaches 200 MB/s more.

We see neither could reach 1 GB/s however Kafka managed about 200 MB/s more than Redpanda. So I tried a lower throughput of 500 MB/s.

Trying to get to 500 MB/s with up to 100 producers

Fig 2. Kafka manages the target 500 MB/s with 100 producers. Redpanda tops out at 330 MB/s.

This time Kafka reached the target throughput but Redpanda fell far short. 

Trying again, but with roughly ⅓ of the partitions got improved results for Redpanda (though still lower than Kafka).

Fig 3. Reducing the partition count to 100 helped Redpanda a lot and it was able to reach 500 MB/s with 100 producers. Kafka also benefited, reaching 500 MB/s with just 40 producers.

This may be due to larger, less numerous batch sizes associated with a lower partition count. The larger the partition count the larger the producer side penalty. You may think that this is simple to fix, just use less partitions, but this may not always be possible as you are usually consumer-bound. That is, consumption is usually more intensive than production. So you scale your partitions to the needs of your consumers, not always your producers.

We see that by using record keys to obtain message ordering, Redpanda can take a big hit in performance - especially on higher throughputs and higher partition counts - it simply can’t push the same amount of data as Kafka.

Changing tack I moved away from the Redpanda benchmark baseline and designed one that could be considered more typical of your average topology.

200 MB/s, 40 and 80 topic workloads

With 200 MB/s spread over a total of 40 topics of 10 partitions each (400 partitions) and another test with 80 topics (800 partitions) the message batches will be numerous and small.

Fig 4. Redpanda continues to fare poorly compared to Kafka even on 10 partition topics, using record keys.

The Redpanda 400 partition test showed lower end-to-end latencies than the Redpanda 800 partition test in the lower percentiles, then shot past at the higher percentiles. To see why we can look at the latency at a particular percentile over time. We should not that the OMB “latency over time” charts are based on 10 second windows and therefore show lower numbers than the final results which are applied to the whole test period.

If we look at p99, calculated at 10 second intervals, we see the cause - two large latency spikes. These were enough to push the tail latencies higher than the 800 partition test.

Fig 5. The Redpanda 400 partition test with two latency spikes.

Again Redpanda just doesn’t seem to handle these small but numerous message batch workloads as well as Kafka.

50 MB/s, 40 and 80 topic workloads

This trend continues even as we reduce the throughput further.

Fig 6. The trend continues. Kafka saw a latency spike in its 40 topic (400 partition) test which drive up tail latency, bringing it just under Redpanda. The 80 topic test saw much better results for Kafka.

In this test we see that the Kafka 400 partition test saw a higher tail latency than its 800 partition counterpart, though still lower than Redpanda. If we look at the p99.99 chart, we see that a lone latency spike just before the 2000 second mark was responsible for pushing the Kafka 400 partition tail latency higher than the 800 partition test.

Fig 7. p99.99 end-to-end latency over time. Kafka showed the lowest, most stable p99.99 latencies, with a single latency spike in the Kafka 400 partition test.

Conclusions

Despite three i3en.6xlarge, capable of 2 GB/s, Redpanda really struggled with the key-based partition strategy, leaving Kafka to win on both throughput and latency. The smaller batch sizes really seemed to be adding a lot of overhead for Redpanda.

Key-based partitioning is very common, probably more common than using null record keys. For the 1 GB/s and 500 MB/s tests we simply added the config keyDistributor: "KEY_ROUND_ROBIN" to the workload file and changed the number of producers. 

I come back to the point that the Redpanda benchmarks based on 1 GB/s, 288 partitions, 4 producers, 4 consumers and null record keys is not generalizable. So far I have shown that simply adding some producers, or changing the partition strategy can completely destroy the Redpanda results - with much stronger results for Kafka.

As always, benchmark your workload, don’t rely on someone else’s benchmarks, including these. This blog post really exists to bring a dose of reality to the Redpanda benchmarks and TCO claims. Of course, you can take Redpanda, or Apache Pulsar, or RabbitMQ Streams, or NATS Streaming and find workloads that can beat Kafka. Each system has its own performance profile for different workloads, different hardware, not to mention the host of tunables that are possible. But Kafka is a powerhouse in the streaming world for a reason, it has robust performance across a variety of workloads and hardware.

How to run these tests

You can take the original Redpanda benchmarks and simply add the config keyDistributor: "KEY_ROUND_ROBIN" to the workload file and change the number of producers. I recommend running Kafka tests in my OMB repo which has Kafka configured correctly and has Java 17 installed. I will be contributing these improvements back to the original OMB repo.

You can find the workload scripts I ran in my repo here.

In the next post we set acks=1 and try to reach the NVMe drives limit of 2 GB/s.

Series links: