Kafka Share Groups and Parallelizing Consumption - Part 3: Client-local parallelism

In the last post Broker-Visible vs Client-Local Parallelism we looked at two ways of scaling Kafka consumption. The final unit of parallelism can be visible to the broker, as consumers, or it can be local to the client, as threads, virtual threads, async tasks, or some other execution mechanism hidden behind a smaller number of consumers. 

Broker-visible parallelism is simple to reason about: if each consumer processes records serially, we add more consumers to increase parallelism. But each consumer adds overhead to the brokers: broker-side protocol state, TCP connections, group membership, fetch state, and participation in the consumer or share group protocol. With long processing times and/or high throughput, the required number of parallel workers can easily exceed what is practical to model as broker-visible consumers.

That is where client-local parallelism becomes important. Instead of scaling by adding more consumers, each consumer application can poll records and process them concurrently inside the client. This allows a smaller number of Kafka consumers to drive a much larger amount of parallel work.

In this post, we’ll compare client-local parallelism with consumer groups and share groups using the Apache Kafka clients, by way of Dimster, the benchmarking tool used throughout this series. Dimster uses the official Apache Kafka clients under the hood. The main comparison is between two styles of client-local parallelism: blocking and continuous styles.

Broker-Visible vs Client-Local Parallelism

This post is a little side-quest from my “Kafka Share Groups and Parallelizing Consumption” series.

My “Kafka Share Groups and Parallelizing Consumption” series (part 1, part 2) has been laser focused on how different configurations and behaviors affect parallel consumption in share groups. So far I’ve shown that you most definitely can hold share groups wrong. You could quite easily and inadvertently create a work queue and with the right combination of things going against you, see a small number of consumers dominate, leaving most consumers starved of messages. All the while lag builds and builds. You need to know the settings and what they do.

But it’s worth asking the question: is parallelizing consumption what share groups are for?

Kafka Share Groups and Parallelizing Consumption - Part 2: Producer Batches and share.acquire.mode

In the last post we used simulated consumer processing time to reveal how important it is to set an appropriate value for max.poll.records. The rule of thumb was a value somewhat lower than:

group.share.partition.max.record.locks / number of consumers per partition

But there’s more to parallel consumption than max.poll.records. The size of producer batches also plays a role when using the default share.acquire.mode (batch_optimized).

Kafka Share Groups and Parallelizing Consumption — Part 1: Tuning max.poll.records

All tests were executed against Kafka 4.2.0 using Dimster

In the last post we measured the overhead that the mechanics of share groups adds, and saw that it is pretty small. Likewise we saw that raw throughput was also comparable to consumer groups and even saw it exceed consumer group throughput on one test.

In this post we’re going to simulate processing time in the consumers to make these benchmarks more realistic and show the utility of share groups (namely the ability to parallelize processing beyond the partition count).

We’ll see how the following two configurations play an important role in parallelizing consumption with share groups:

  • max.poll.records (consumer config)

  • group.share.partition.max.record.locks (broker-side config)

Benchmarking Apache Kafka Consumer Groups vs Share Groups (overhead test)

In my last blog post I introduced Dimster (DIMensional teSTER), a performance benchmarking tool for Apache Kafka with a specific set of philosophies.

In this first share group benchmarking post, we’re going to use share groups as they are not intended to be used, but for a good reason. Share groups allow you to move past partitions as the unit of parallelism by allowing multiple consumers to read from the same partition, using message queue semantics. We’ll run those kinds of tests in the next post. In this post I just want to understand if the mechanics of how share groups work add any additional overhead compared to consumer groups. So we’ll use share groups as if they were consumer groups (by capping consumer count to partition count).

Objective: Use synthetic tests to measure the overhead of share groups compared to consumer groups in identical conditions.

How: Like-for-like tests which use an identical workload/topology using consumerType (CONSUMER_GROUP|SHARE_GROUP) as a dimension. Given identical producer/consumer counts, producer rate, topic/partition counts, do share groups scale as well as consumer groups? Do they add any latency overhead?

Introducing Dimster, a performance benchmarking tool for Apache Kafka

Most of my career in distributed systems has been as a tester, performance engineer and formal verification specialist. I’ve written performance benchmarking tools in the past, for RabbitMQ and Apache Pulsar but in recent years I’ve used OpenMessagingBenchmark (OMB) to run benchmarks against Apache Kafka and other messaging systems. But OMB is hard to deploy and has several limitations compared to more sophisticated benchmarking systems I’ve developed in the past. With Claude becoming so much better since Christmas I decided to write a Kafka-centric performance benchmarking tool, with a lot of inspiration from OMB. I took the bits I like about OMB and the things I like about the tooling I’ve built in the past, to make a performance testing tool for testing Apache Kafka.

In this post I’ll introduce some aspects of Dimster that are core to its design:

  1. Dimensional testing

  2. Shareable, self-contained results with reproducibility in mind

  3. Test modes

  4. Benchmark prep and post-processing

  5. Kubernetes as a standardized runtime

The Three Durable Function Forms

Durable execution engines (DEEs) talk about “workflows”, “activities”, “virtual objects”, “handlers”, and “functions”, but they’re often describing the same underlying execution patterns. This post proposes a model that extends the generic durable function into three forms: stateless functions, sessions, and actors. This complements my previous posts (on determinism and durable function trees) in this series I dub “The Theory of Durable Execution”.

I’ll cover this in three parts:

  1. The behavior-state continuum

  2. The three durable function forms and associated properties

  3. Mapping the DE frameworks to these forms

The Durable Function Tree - Part 2

In part 1 we covered how durable function trees work mechanically and the importance of function suspension. Now let's zoom out and consider where they fit in broader system architecture, and ask what durable execution actually provides us.

Function Trees and Responsibility Boundaries

Durable function trees are great, but they aren’t the only kid in town. In fact, they’re like the new kid on the block, trying to prove themselves against other more established kids.

Earlier this year I wrote Coordinated Progress, a conceptual model exploring how event-driven architecture, stream processing, microservices and durable execution fit into architecture, within the context of multi-step business processes, aka, workflows. I also wrote about responsibility boundaries, exploring how multi-step work is made reliable inside and across boundaries. I’ll revisit that now, with this function tree model in mind.

The Durable Function Tree - Part 1

In my last post I wrote about why and where determinism is needed in durable execution (DE). In this post I'm going to explore how workflows can be formed from trees of durable function calls based on durable promises and continuations. 

Here's how I'll approach this:

  • Part 1

    • Building blocks: Start with promises and continuations and how they work in traditional programming.

    • Making them durable: How promises and continuations are made durable.

    • The durable function tree: How these pieces combine to create hierarchical workflows with nested fault boundaries.

    • Function trees in practice: A look at Temporal, Restate, Resonate and DBOS.

  • Part 2

    • Responsibility boundaries: How function trees fit into my Coordinated Progress model and responsibility boundaries

    • Value-add: What value does durable execution actually provide?

    • Architecture discussion: Where function trees sit alongside event-driven choreography, and when to use each.

Demystifying Determinism in Durable Execution

Determinism is a key concept to understand when writing code using durable execution frameworks such as Temporal, Restate, DBOS, and Resonate. If you read the docs you see that some parts of your code must be deterministic while other parts do not have to be.  This can be confusing to a developer new to these frameworks. 

This post explains why determinism is important and where it is needed and where it is not. Hopefully, you’ll have a better mental model that makes things less confusing.