Kafka vs Redpanda Performance - Do the claims add up? — Jack Vanlightly

Kafka vs Redpanda Performance - Do the claims add up?

Apache Kafka has been the most popular open source event streaming system for many years and it continues to grow in popularity. Within the wider ecosystem there are other open source and source available competitors to Kafka such as Apache Pulsar, NATS Streaming, Redis Streams, RabbitMQ and more recently Redpanda (among others).

Redpanda is a source available Kafka clone written in C++ using the Seastar framework from ScyllaDB, a wide-column database. It uses the popular Raft consensus protocol for replication and all distributed consensus logic. Redpanda has been going to great lengths to explain that its performance is superior to Apache Kafka due to its thread-per-core architecture, use of C++, and its storage design that can push high performance NVMe drives to their limits.

They claim to be the latest stage in evolution of distributed log systems, painting a picture of Kafka and other systems as being designed for a bygone era of the spinning disk whose time has come. They make a compelling argument not only for better performance but lower Total Cost of Ownership (TCO) and their benchmarks seem to back it all up.

The interesting thing about Redpanda and Kafka is that they are both distributed, append-only log systems that typically have a replication factor of 3, are both leader-follower based and use the same client protocol. Additionally both chose to map one partition to one active segment file rather than use a shared-log storage model such as Apache Pulsar and Pravega. For the same throughput, they write the same amount of data to disk (to the same number of files) and they transmit the same amount of data over the network. So regarding the superior throughput claims, the question is, given we typically size our cloud instances based on network and disk IO capabilities, will a CPU optimized broker make a big difference on cost? Or is it that Redpanda can simply write data to disk faster than Kafka? When it comes to latency, will the thread-per-core architecture make a difference? Do concurrent data structures, based on locks, in Kafka represent a sizeable portion of the end-to-end latency? These are some of the questions I had when I began.

The Redpanda claims

According to their Redpanda vs Kafka benchmark and their Total Cost of Ownership analysis, if you have a 1 GB/s workload, you only need three i3en.6xlarge instances with Redpanda, whereas Apache Kafka needs nine and still has poor performance. They are a bold set of claims and they seem plausible. Built in C++ for modern hardware with a thread-per-core architecture sounds compelling and it seems logical that the claims must be true. But are they?

Since joining the Apache Kafka community in February 2022, I haven’t seen any interest from the community in verifying these Redpanda claims of superior performance. Even ChatGPT seems to think Redpanda has superior performance to Kafka because basically no-one has actually tested whether it’s true. So I decided to put Redpanda to the test myself, and see if it all stands up to close scrutiny. I didn’t test 9 node Kafka clusters - all tests were with 3 brokers and always on the i3en.6xlarge.

I’m a distributed systems engineer with a large amount of benchmarking experience and I’ve previously contributed to the OpenMessagingBenchmark project. I’ve spent months of my life benchmarking RabbitMQ quorum queues, RabbitMQ Streams while at VMware. While at Splunk I spent significant time benchmarking Apache BookKeeper as part of my performance and observability work as a BookKeeper committer. Now I work at Confluent, with a focus on Apache Kafka and our cloud service. So I’m pretty used to this kind of project and the great thing is that the OpenMessagingBenchmark makes it pretty easy to run.

What did I find?

I ran benchmarks against Redpanda and Kafka on identical hardware: three i3en.6xlarge, same client configs and either both with or without TLS.

I can tell you, now having wrapped up my benchmarking work, that Redpanda’s claims regarding Kafka and even their own performance are greatly exaggerated. This is probably no surprise given all this is really just benchmarketing, but as I stated before, if no-one actually tests this stuff out and writes about it, people will just start believing it. We need a reality check.

As you’ll see from this analysis, no-one should use three i3en.6xlarge for a 1 GB/s workload. Redpanda manages it with the given partition counts and client counts, but quickly doesn’t manage it when you start making minor modifications to that workload.

I started testing Redpanda back in March and have been running the same tests against both Kafka and Redpanda since then. Some notable findings are:

  • The 1 GB/s benchmark is not at all generalizable as Redpanda performance deteriorated significantly with small tweaks to the workload, such as running it with 50 producers instead of 4.

  • Redpanda performance during their 1 GB/s benchmark deteriorated significantly when run for more than 12 hours.

  • Redpanda end-to-end latency of their 1 GB/s benchmark increased by a large amount once the brokers reached their data retention limit and started deleting segment files. Current benchmarks are based on empty drive performance.

  • Redpanda struggled when producers set record keys, causing messages to be dispatched to partitions based on those keys resulting in message batches to be both smaller and more numerous.

  • Redpanda was unable to push the NVMe drives to their throughput limit of 2 GB/s with acks=1 but Kafka was.

  • Kafka was able to drain large backlogs while under constant 800 MB/s or 1 GB/s producer load but Redpanda was not. Its backlogs continued to grow or entered a stable equilibrium where the backlog drained partially but never fully.

In all the above cases, Kafka usually outperformed Redpanda to a large degree, both reaching higher throughput and achieving lower end-to-end latency, even the tail latencies - on identical hardware. In other tests I ran Redpanda outperformed Kafka (though never on throughput) - so yes Redpanda “can” be “faster” than Kafka but the reverse is also true.

I will briefly explain these findings below with data. I will also publish deeper-dive blogs that explore the results in more depth and if possible discuss potential underlying reasons behind each finding. 

I hope you come away with a new appreciation that trade-offs exist, there is no free lunch despite the implementation language or algorithms used. Optimizations exist, but you can’t optimize for everything. In distributed systems you won’t find companies or projects that state that they optimized for CAP in the CAP theorem. Equally, we can’t optimize for high throughput, low latency, low cost, high availability and high durability all at the same time. As system builders we have to choose our trade-offs, that single silver-bullet architecture is still out there, we haven’t found it yet.

First, an audit of the benchmark code

I started with the OpenMessagingBenchmark repository of Redpanda, as it doesn’t exist in the official OMB repository. Before I began I did an audit of the code to make sure everything looked good. Unfortunately for Kafka it was misconfigured with a couple of issues.

Issue #1 is that in Kafka’s server.properties file has the line log.flush.interval.messages=1 which forces Kafka to fsync on each message batch. So all tests, even those where this is not configured in the workload file will get this fsync behavior. I have previously blogged about how Kafka uses recovery instead of fsync for safety. Setting this line will degrade Kafka performance and it is relatively uncommon for people to set it. So I removed that from the server.properties file. Redpanda incorrectly claim Kafka is unsafe because it doesn’t fsync - it is not true.

Issue #2 is that Java 11 is still used. To be fair, this is how it is in the upstream OMB repository and in the Confluent fork. However, Java 11 is already 5 years old and Kafka works well with Java 17. Kafka especially benefits from Java 17 with TLS. So I updated the Ansible script to install Java 17 on both the clients (Redpanda included) and Kafka server.

Issue #3 is that the Redpanda driver (the clients) is hard-coded to asynchronously commit offsets every 5 seconds. The Kafka driver on the other hand by default asynchronously commits on each poll. So I updated the producer in the Redpanda driver to match the Kafka driver, where the commit behavior is controlled by configs and ensured they matched.

I made my own copy of the repo where all these changes exist.

Critical bug fixes to OMB

Throughout my time benchmarking Kafka and Redpanda I encountered a couple of bugs with the benchmark code along the way. The worst was a bug that I encountered a few times which caused the majority of the latency data to be lost, causing high percentile results to be significantly lower. The root cause was that when the benchmark code attempts to collect histograms from all the client machines, if there is an error, it retries. The problem is that on every call to get the histogram, the original histogram returns a copy but resets itself. So by the time the second attempt occurs, the original histogram only has a few seconds of data causing bad results. I only discovered this when doing a statistical analysis on some results that looked strange. I will be submitting a fix for this in the upstream OMB repository.

Finding 1 - 500 MB/s and 1 GB/s, 4 vs 50 producers

I ran the Redpanda 1 GB/s benchmark at 6 different throughputs: 500, 600, 700, 800, 900 and 1000 MB/s. I also ran it with the original 4 producers and consumers, then with 50 producers and consumers. The result was significant performance degradation with 50 producers for Redpanda. The other noteworthy result was that Redpanda was unable to reach 1000 MB/s with TLS which conflicts with the Redpanda benchmarks.

Fig 1. Redpanda end-to-end latencies with 50 producers reach 24 seconds.

I was unable to get Redpanda to 1000 MB/s with TLS, even with the sweet spot of 4 producers. I ran this three times and the chart below shows the best Redpanda result. 

Fig 2. Redpanda was unable to reach 1000 MB/s with TLS. With 50 producers it only managed 850 MB/s.

See Kafka vs Redpanda Performance - Part 1 - 4 vs 50 producers for more details and how to run this benchmark.

Finding 2 - Performance deterioration on long running tests

I ran the 1 GB/s, 288 partition, 4 producer/consumer benchmark for 24 and 36 hours. 

After around 12 hours, Redpanda end-to-end latencies jumped.

Fig 3. Redpanda p50-p90 end-to-end latency jumped after 12 hours.

The increase in tail latencies was huge.

Fig 4. Redpanda end-to-end tail latencies shot up to tens of seconds after 12 hours of sustained load.

End-to-end latency significantly increased in the higher percentiles. The p99 measurements hit 3.5s, while p99.99 goes as high as 26s. 

The culprit turned out to be the NVMe drives which started exhibiting very high write latencies after 12 hours of constant load. This manifested in the Redpanda metrics as “time spent starving for disk”.

Fig 5. One Redpanda broker starts reporting being starved for disk after 12 hours.

Both write and read IO time increased at the 12 hour mark.

Fig 6. The average write and read time per op jumped on one broker after 12 hours.

I was able to reproduce this on three out of three deployments, running the 1 GB/s workload. I am satisfied this is reproducible by anyone.

While Redpanda deteriorated over time, I saw that Kafka actually improved over time.

Fig 7. Kafka p99 end-to-end latencies for a 24 hour period.

Including the tail latencies.

Fig 8. Kafka p99.99 end-to-end latencies for a 24 hour period.

The NVMe drive degradation for Redpanda came down to how SSDs work under the hood and the random IO nature of Redpanda. It writes data in small 16 KB chunks which when combined with 288 partitions, each with an active segment file, results in an IO access pattern that sits towards the random IO end of the spectrum. Random IO places a greater burden on SSD drives due to write-amplification which occurs as a result of drive GC rewriting blocks of data - causing extra latency. This is an example of a system reaching the NVMe drive GC bottleneck.

I go into more detail on this in Kafka vs Redpanda Performance Part 2 - Long running tests.

Finding 3 - Redpanda latency shoots up when hitting the retention limit

Retention limits exist in event streaming systems as without them, the disks would fill up and the servers would crash. Even when using tiered storage, we need to have a local retention limit.

When I ran the Redpanda tests for long enough to reach the retention limit, I saw a strange stepwise increase in end-to-end latency. The size of the step seemed to be affected by the amount of disk used and the throughput of the test. Low throughput tests didn’t exhibit the behavior, but the medium and high throughput ones did.

In the case below, retention limit was reached at just before 22:00.

Fig 9. The Redpanda stepwise increase in end-to-end latency once the brokers start deleting segment files.

This is reproducible on all the mid-to-high throughput tests I ran. A production system is unlikely to see all partitions hit their retention limits at once, so you may only see this stepwise increase in a benchmark - but the retention limit performance penalty will still exist, it will just be harder to spot.

Unfortunately for me most of my results were obtained before I made this discovery. However, I reran the 1 GB/s Redpanda benchmark, allowing the warm-up period to run long enough to fill the disks to reach the retention limit and I got very different results for Redpanda.

Fig 10. Redpanda end-to-end latency results for the 1 GB/s benchmark, when recording latencies after the data retention limit had come into effect.

The extra burden of deleting segment files has a large impact on end-to-end latency. This effect manifested on the higher throughput tests and did not always exhibit on lower throughput workloads (<100 MB/s). If you are considering running Kafka vs Redpanda benchmarks, do remember to measure performance on a cluster that has already reached its retention limit.

See more detail in Kafka vs Redpanda Performance - Part 3 - Hitting the retention limit where I also explain how to reproduce this.

Finding 4 - The impact of record keys

Using record keys is very common as it allows us to get message ordering. Messages are dispatched to partitions based on the hash of the record key - a pretty standard data distribution technique. The downside is that it typically is harder on the brokers as it results in smaller, more numerous message batches which creates more work.

I found that with record keys, Redpanda couldn’t match Kafka on throughput when trying to reach 1 GB/s and 500 MB/s using the 288 partition benchmark as a base.

Fig 11. Kafka was able to reach the target 500 MB/s with 100 producers using record keys. Redpanda topped out at 330 MB/s.

I changed tack and designed a workload that was a bit more common than a single topic with hundreds of partitions. With 40 and 80 topics, of 10 partitions each, Kafka achieved lower end-to-end latencies, even on the high percentiles.

Fig 12. Redpanda struggled with more producers which used record keys for message ordering.

The pattern is that as the message batch rate increases but batch size drops, Redpanda’s advantage drops or even disappears completely. The fsync rate is typically very high (> 200K/s) for high partition and high message batch rate benchmarks even for the 50 MB/s test which may explain the higher latencies.

See Kafka vs Redpanda Performance - Part 4 - Impact of record keys for more details.

Finding 5 - Redpanda was unable to reach the NVMe drive limit with acks=1

I took the Redpanda 1 GB/s benchmark and modified it to start at 1 GB/s and increment in steps until it reached 2 GB/s, with 10 producers and acks=1.

Fig 13. Redpanda only reached 1400 MB/s with acks=1, whereas Kafka reached 1900 MB/s.

The result was that Kafka peaked at 1900MB/s throughput, but actually reached the physical NVMe drive limit of 2 GB/s. 

Fig 14. When Kafka topped out at 1.9 GB/s, it had actually reached the 2 GB/s physical limit of the NVMe drives on the i3en.6xlarge.

It seems ironic that in fact only Kafka could fully utilize the NVMe drives in the end.

See Kafka vs Redpanda Performance - Part 5 - Reaching the NVMe drive limit for the details.

Finding 6 - Redpanda unable to drain backlogs while under 1 GB/s producer load

I ran a suite of tests with the producer rate of either 800 MB/s or 1 GB/s and paused consumers in order to build up a backlog. Then I resumed the consumers and measured the time for the consumers to drain the backlog and return to sub-second end-to-end latency - all while under constant producer load.

Redpanda was unable to drain any of the backlogs, with backlogs either continuing to grow after resuming the consumers, or the backlog to partially drain with the brokers and consumers entering into an equilibrium state where a stable backlog remained. Kafka was able to drain the backlogs in every case.

Balancing producer vs consumer rate can be challenging for system builders. For stressful loads, do you prioritize consumers or producers or just allow for emergent behavior based on competition for resources inside the brokers? If you prioritize producers then consumers can end up lagging so far behind that data ends up being lost. If you prioritize consumers, then producers might end up seeing a lot of back pressure and vital data you need to get into the system is not able to do so. There’s no easy answer, but for an organization deploying these systems you better figure out how your chosen cluster size can handle things like backlogs after consumer outages. A Redpanda cluster of three i3en.6xlarge is clearly not going to cut it in this case.

See Kafka vs Redpanda Performance - Part 6 - Draining backlogs for the details.

Conclusions

All these tests were run on identical hardware (three i3en.6xlarge), with the same client configurations and TLS configurations.

It is clear that the Redpanda benchmarks are not generalizable and there exists a large spread of workloads where it performs significantly worse than Apache Kafka.

To summarize what I found in these tests:

  • Keeping throughput the same but simply changing the producer and consumer count from 4 to 50 caused Redpanda performance to drop significantly lower than Kafka.

  • Running the Redpanda benchmark for 24 hours with constant load caused the NVMe drives on the Redpanda brokers to slow down due to internal drive GC, resulting in huge latencies - even with the AWS recommended 10% over-provisioning of the drives. No such slow down occurred with Kafka due to its more sequential IO access pattern.

  • Running the benchmarks and measuring latency after the retention limit had been reached showed much higher end-to-end latencies for Redpanda, with no change for Kafka. End-to-end latency results for Redpanda are really only valid once the retention size has been reached.

  • Using record keys reduced Redpanda throughput and increased latencies significantly. Kafka comfortably beat Redpanda on the record key tests. Smaller and more numerous batches cause Redpanda to choke on many small tasks and high fsync rates across many files.

  • Only Kafka was able to fully utilize the 2 GB/s throughput of the NVMe drives using acks=1.

  • While under constant producer load, consumers only managed to drain backlogs with Kafka.

Going back to the original TCO analysis claims that you only need three Redpanda brokers on i3en.6xlarge instances for a 1 GB/s workload but you need nine instances for Kafka: it seems clear this is not the case. The truth is that a correctly provisioned production cluster would have more capacity than these three instances for a 1 GB/s workload, whether you chose Kafka or Redpanda. The fact that Redpanda cannot drain a backlog while under 1 GB/s producer load on this amount of hardware is a clear indication that three i3en.6xlarge is not good sizing. If you bring down a Redpanda broker, likewise it is unable to maintain the 1 GB/s load. In fact, if you read the Redpanda docs on sizing, they say to run five i3en.12xlarge for this workload, which takes us from a 72 CPU core system to a whopping 240 core system!

These tests are easily run yourself, each blog post explains how. I recommend running Kafka via my OMB repo rather than the Redpanda OMB as I have Kafka correctly configured and with Java 17.

Some reflections

After having spent a fair amount of time benchmarking both I would say that when Redpanda gets the right workload it can really shine - the problem is that there are many workloads where it doesn’t. Its performance is a little more precarious than Apache Kafka. Batch sizes need to be not too small, throughput shouldn’t be too high on high partition workloads and the drives need to be adequately provisioned with enough empty space to allow for the random IO nature of its storage layer. Low throughput is where Redpanda has its most robust end-to-end latency results, then as you increase throughput, the specifics of the workload can have a disproportionate impact on performance. It is also extremely sensitive to drive latency.

Kafka has its issues too. The page cache is a double edged sword. It’s great for robust performance across a wide variety of workloads, but it can cause end-to-end latency spikes that can affect tail latencies. As I said before, each system chose its tradeoffs. For those who are extremely end-to-end latency sensitive, a beefy Redpanda cluster might be what you want. It should be beefy because you really don’t want Redpanda to start stressing the hardware because a lot of its advantages seem to disappear. You might also need to tailor your workload to match Redpanda - some record key workloads may simply not perform.

What does built for modern hardware really mean?

Redpanda say they are the next stage in evolution of distributed log systems… but is that actually true? They claim Kafka is now outdated, but is it? The argument seems to boil down to being “built for modern hardware”.

The question that I am mulling over is what does “built for modern hardware” really mean for a distributed append-only log system? Redpanda uses a thread-per-core architecture but is it CPU that an append-only log system should be optimizing for?

Redpanda’s claim about being built for modern NVMe drives is true. The whole design is predicated on leveraging the fact that NVMe drives can handle random IO far more effectively than spinning disks. But just because these drives can handle more random IO than the HDD doesn’t mean its a better architecture. As I recently blogged, sequential IO is far from dead when it comes to NVMe drives, it’s as important as ever.

In fact I don’t believe the Redpanda storage architecture is optimal for a log system and is probably its biggest weakness. With Redpanda, the partition is both a user-facing abstraction as well as a replication and storage abstraction. It maps one partition to one Raft cluster, which writes to its own physical log on disk. This is an obvious design choice but one that perhaps is missing out on where innovation can really be found for a distributed log storage system. Mapping one partition to one active segment file and performing fsyncs on top of large numbers of individual files has a cost - even with high performance NVMe drives. I saw 200K fsyncs per second and a very high self-reported CPU utilization with a 50 MB/s throughput due to a high partition count and small batch size. The trade-off of high fsync rates across a multitude of active files is that it can just be too slow - even on modern hardware.

Kafka also uses this one-partition-one-active-file model but was built to mitigate the downsides of this architecture while benefiting from its strengths. It relies on the OS to flush the page cache with a more sequential IO pattern and it avoids fsyncs by using recovery in its replication protocol. The benefit of this approach is robust performance across most workloads. The trade-offs are that high percentile latencies can be higher than a system that is constantly performing smaller flushes and for best durability we need multiple fault-domains.

There are other options where the logical log model and the physical log model are decoupled which introduces some benefits (and challenges). If you want more sequential write patterns, you'd want the storage abstraction to be coarser-grained than the user-facing abstraction (e.g. a shared log that stores writes from many partitions). Apache BookKeeper (used by Apache Pulsar and Pravega) is an example of this, it decouples logical logs from physical logs by taking the shared-log approach for its storage engine which gives it a purely sequential disk IO pattern no matter how many Pulsar partitions it must support. This allows BookKeeper to decouple performance from partition count to a great extent. The trade-off is that all data is written twice: the write optimized WAL and the read-optimized long-term ledger storage, but always sequentially.

Personally, I am a big fan of the shared log design in the storage layer despite its other issues such as needing more indexes for reads and more drives. It’s choice of trade-offs can give it good performance across a wide range of workloads and on both SSDs or HDDs. Apache Kafka sits in the middle of this sequential/random IO continuum, with trade-offs somewhere in between BookKeeper and Redpanda.

When it comes to distributed logs, I would argue that for log based workloads, “built for modern hardware” means largely the same as it always did - using sequential IO which has better mechanical sympathy even for modern SSDs. Given that the product is a log, why not leverage that in your storage architecture? Kafka and BookKeeper don’t need to reserve large amounts of NVMe drive space for over-provisioning for a reason.

Does Redpanda deserve to sit at the apex of distributed log design? I don’t think it does. But it’s not at the back either. We see distributed log storage systems of all shapes and sizes, each making different trade-offs and all benefiting and paying the price for those decisions - Redpanda is no different.

Apache Kafka has pedigree for a reason, it isn’t old fashioned or outdated despite what competitors may want you to believe. Its design is as relevant today, with NVMe drives, as it was when it was conceived. That’s not to say it doesn’t have warts, it does, but the Kafka community hasn’t stopped improving the broker (such as adding Queues for Kafka) - Kafka is alive and well in 2023.

So which is faster? <joke>

Redpanda can show benchmarks where Redpanda performs better. I can show you benchmarks where Kafka performs better by simply tweaking the workloads slightly or using things like record keys. The point is that benchmarks are really only useful when you run them yourself and for your own specific workload.

Competition is good, but false narratives don’t help anyone. Which is faster? As always the reality is that “it depends”.

Series links: