In the last post we saw how only Apache Kafka was able to fully utilize the 2 GB/s throughput limit of the i3en.6xlarge. In this post we’re going to test the ability of Kafka and Redpanda to drain a backlog while under continued producer load.

This test starts up the producers and consumers, at the target throughput, but then pauses the consumers for a period until consumer lag builds up to a desired amount (also known as a backlog), then the consumers are resumed and we see how long it takes for them to catch-up and return to sub-second end-to-end latency.

This test is an important one when doing a sizing analysis as this is an event that can happen in production. Some kind of outage affects your consumers but you don’t want to (or can’t) stop your producers. The ability of your consumers to drain the backlog that has built up and return to normal is critical.

Measuring lag

See Redpanda docs for measuring consumer lag. Alternatively you can simply measure the client end-to-end metrics on the clients dashboard of my OMB repository.

For Kafka, you can deploy the Kafka Lag Exporter or simply use end-to-end metrics as a guide.

Noteworthy behavior

The Redpanda backlog draining tests I ran all had a common pattern of behavior. When the brokers were in the early phase of backlog draining, the brokers read from disk as those records were no longer cached in memory, but when they could the brokers switched to the read caches in the final stage. In every case, the reading from disk period was the most rapid and once reads to disk subsided the backlog draining slowed. In some cases, as soon as reads were being served from caches, the drain rate slowed so much that the backlog started to grow again. In one case Redpanda entered an equilibrium where the backlog partially drained and then remained fixed at a certain size.

1 GB/s producer load, 5TB backlog, consumer fanout of 1

As usual we’re using the three i3en.6xlarge according to the Redpanda benchmarks and TCO.

In this test the consumers are down for 84 minutes which creates a 5TB backlog on disk.

Redpanda run #1 - After 4 hours, it was unable to drain the backlog

The first run was not making progress after 4 hours.

Fig 1. The backlog is not draining after 4 hours.

During the accelerated period of disk reading the Redpanda consumers were unable to catch-up, so I figured that once the slower catch-up period started, the backlog would start growing again - therefore I stopped the test at the 4 hour mark.

Fig 2. When Redpanda reads from disk, the catch-up or drain rate is more rapid.

Redpanda run #2 - Unable to drain.

After 9 hours, Redpanda reached an equilibrium where the backlog remained partially drained with end-to-end latency being stable at 30 minutes.

Fig 3. Redpanda reaches an equilibrium where the backlog becomes stable but never drains.

Catch-up is accelerated during read from disk phase. Once reads are served from caches, the catchup slowed and Redpanda settled into the strange equilibrium.

Fig 4. The early read to disk phase subsides, and so too does the catch-up rate.

Kafka run #1 - Drained in 3 hours

Fig 5. Kafka drains the backlog in 3 hours. We see consumer lag build up, then starts to drop almost as quick as it grew. Kafka also saw a slow down in catch-up towards the end here.

1 GB/s producer load, 2.5TB backlog drain test, fanout 2

Consumer downtime of 42 minutes creates a 2.5TB backlog on disk. With two consumer groups, an extra 5TB of extra reads performed. Measure time for backlog to reach zero while still receiving 1GB/s ingress.

Redpanda run - Backlog immediately grows - unable to drain.

Fig 6. Redpanda didn’t get close to being able to drain the backlog with two consumer groups and the constant 1 GB/s producer load.

Kafka - Drained in 130 minutes

Fig 7. Kafka drains the backlog in 130 minutes.

800 MB/s, 1.67 TB backlog drain test, fanout 3

Consumer downtime of 34 minutes creates a 1.67TB backlog on disk. With three consumer groups, an extra 5TB of extra reads performed. Measure time for backlog to reach zero while still receiving 1GB/s ingress.

Redpanda - Backlog drops then grows - unable to drain

Fig 8. At 800 MB/s, Redpanda initially does well, but then stalls and the backlog starts to grow again.

Kafka - drained in 74 minutes

Fig 9. With constant 800 MB/s producer load, Kafka drains the backlog in 74 minutes.

Conclusions

The ability for consumers to catch up after downtime, while producers continue to apply load, is important for any event streaming system. Redpanda showed that under the 800 MB/s and 1 GB/s workloads in this post, consumers were unable to catch-up after a brief interruption. Kafka on the other hand was able to drain the backlogs and return to normal in all cases.

Backlog draining tests are an important part of a sizing analysis and could be overlooked. It seems this test was overlooked when the Redpanda TCO analysis was performed.

How to run this test

To run a backlog draining test, simply add the following line to your workload file:

consumerBacklogSizeGB: 5000

The above line tells OMB to pause consumers until 500 GB of consumer lag has accumulated. It counts each subscription in the calculation so 5TB with 3 subscriptions means that 1.67TB of backlog is built up on disk.

You can track the consumer lag of Redpanda with a PromQL query.

For Kafka, view the Kafka Lag Exporter Github repo for instructions.

You can see the workloads I ran in my OMB repo here.

Series links: