Event-Driven Architectures - The Queue vs The Log

A messaging system is at the heart of most event-driven architectures and there are a plethora of different technologies in the space and they can be classified as either queue based or log based.

Queue based: RabbitMQ, ActiveMQ, MSMQ, AWS SQS, JMQ and many more.

Log based: Apache Kakfa, Apache Pulsar, AWS Kinesis, Azure Event Hubs and many more.

Each messaging system has different features but at the heart are their data structures: queue or log. In this post we'll take a look at how the underlying data structure affects your event-driven architecture.

The Queue

Queues are fundamentally transient in nature. The very reading from a queue removes the data.

Different applications cannot share a queue because then they would compete to consume the messages. They need their own queue.

If we want to permit multiple applications to consume messages then we need the publish-subsribe pattern. Now we don't want to publish to queues but a topic. A topic is a somewhat abstract concept and allows us to decouple publishing from consumption. A routing mechanism is required to route messages from topics to physical queues.

An application can have its own queue and route multiple events types from multiple topics to a single queue. This allows applications to keep the ordering of related events. Which events it wants to combine can be configured differently for each application.

The Log

Logs are persistent in nature and a shared resource. Reading from logs does not remove the data. Multiple consumers can read from the same log at the same time from different positions. There is no routing of messages to per-application queues. This means that modelling is now based on a shared scenario. We cannot mix together different events according to the needs of each application. We need to make decisions on what related events should be grouped per topic, as part of a wider system architecture.

This means that modelling an event-driven architecture using a queue based messaging platform is very different to a log based platform.

Why Send Multiple Different Events to the Same Queue/Log?

Let's imagine we have the following events: OrderPlaced, OrderModified and OrderCancelled. A user could easily cancel an order shortly after placing the order. For each OrderCancelled event we get 1000 OrderPlaced events. If we have a separate queue or log for each event then there is a very good chance we can process cancellations out of order. All it would take is a backlog of OrderPlaced events to accrue in the OrderPlaced queue/log and an empty OrderCancelled queue/log.

I can think of many examples:

Modification gets processed before creation event
New customer address event processed before the new customer event
Modifications get processed out of order leaving the data in an inconsistent state

So if we can, we really want to be able to group related events into a single queue/log. When all related events, such as all event related to a customer, or an order or a flight exist in the same sequence we get ordering guarantees that are not possible when we store them separately.

The Decoupling of Publishers and Consumers

With queue systems that support publish-subscribe, we can publish to topics (or exchanges like in RabbitMQ) and use routing to move those messages to physical queues that are not shared between applications. This decoupling of publishing to consumption is powerful and makes an architecture easy to evolve and customize. We can publish to topics without forcing any grouping of events. We can have a separate topic per event and just publish each event to its corresponding topic. The routing to queues is not fixed and can be changed without impacting other applications. This gives us a lot of freedom to evolve and change our architecture. Additionally, each application can get relative ordering of any arbitrary set of events, thus avoiding many pain points that can occur in an event-driven architecture.

With a log, publishing and consumption are more coupled regarding who consumes what. A publisher publishes to a topic and a consumer consumes all the messages of that topic. It is true that a consumer could read from multiple topics and achieve something similar to the routing of topics to queues but those events cannot be ordered relative to each other like when events are mixed into a single queue. The choice of which events to mix into a given topic are made for the greater good. Once a decision has been made, any changes to that have long lasting consequences as some logs may store data for a long long time. We can't go changing which events go to which log all the time. This puts the log at a distinct disadvantage to queues in this regard.

But a log decouples publishing from consuming in a way that a queue cannot. We are free from temporal constraints. No longer are consumers constrained to consume messages once, from around the same time the publisher sent the messages out. Consumers can go back in time and consume events from the past. New consumers can come online and read messages from back before the application was even developed! This is a massive improvement over queues.

Scaling and Ordering

Where queues might be able to give you ordering across arbitrary sets of events, they might lose those guarantees when you start scaling.

The easy way to scale queue consumers is to create competing consumers that process the messages of a single queue in parallel. Processing a single sequence in parallel means you lose the ordering. I need to study the options with other queuing technologies, but with RabbitMQ we do have a solution.

The solution is to mimick what log based systems do. Kafka, Pulsar, Kinesis and Event Hubs all have log partitions and you can route messages to partitions based on a message key. This means that all messages of a given entity, like an order, go to the same partition. We only have a single consumer per partition which means we keep the ordering guarantees. This allows for scaling and ordering.

RabbitMQ has the Consistent Hashing Exchange. We can create multiple queues and bind them to the Consistent Hashing Exchange and perform hash based routing on the routing key, a message header or a message property. So far so good. Just like Kafka really. Except that Kafka also manages our consumers and performs automatic partition assignment for our consumers. This is where RabbitMQ isn't so great. The logic of queue assignment is not there, we have to develop it ourselves. But this is simply a technical challenge, not an architecture challenge. It is something I will be looking into further in the future.

But there is an additional problem. Now that our events are partitioned into multiple queues, if we want to maintain ordering across an arbitrary set of events, we'll need those events to use the same type of partitioning key. For example, an order id, a customer id etc. But it is unlikely that an arbitrary set of events will all be related to same entity.

So ultimately, we cannot claim queues can provide ordering across an arbitrary set of events when we perform partitioning. Only across groups of events that have the same partitioning key.

Something you can do to maximize the set of events that can be ordered together is to add multiple message headers. For example, a customer address event could have a Customer Id and an Address Id message header. That way different applications can include this event in an ordered grouping by either id. It increases the numbers of events that can be grouped together.

Just take into account that with Kafka for example, we have the same issue of message keys. If we mix five events into a given log (topic) then we need to make sure they all have the same message key, Order Id for example. Else we'll get no benefit of mixing them together.

The Decision - Queue or Log

So when evaluating your next messaging system, think about whether it is a queue based system or a log based system. Ask yourself what you value more?

Queues can lead to an architecture that is easier to evolve and customize to each application. Each application can get ordering guarantees across the events it needs (with caveats around scaling). But data is transient. Once it has been consumed, it is gone. If you need it again you'll need to find another source.

Logs are going to force you to take the time to carefully design your grouping of events per log. You have to make the decision for your overall system, you cannot optimize the grouping for each consumer. Making changes is more difficult because your logs are a shared resource. Messages may be stored for a long time and so backwards compatibility is going to be an issue. But you get the kick ass benefits of having persistent data:

time travel
event sourcing
source for data replication to other systems

So hopefully I've given you something to think about. In the next post, we've built out some queue based and log based topologies based on what has been described in this post.