Event-Driven Architectures - Queue vs Log - A Case Study

In the previous post we looked at relative event ordering and the decoupling of publishers and consumers among other things. In this post we'll take those concepts and look at an example architecture. We'll look at the various modelling possibilities we have with RabbitMQ representing a queue based system, and Kafka representing a log based system.

Our example architecture has four services that interact via events.

The Sales and Inventory service emits three events:

order.placed
order.modified
order.cancelled

The Billing service consumes those three events, and either charges the client or refunds the client, emitting:

order.billed
modification.billed
order.refunded

The Fulfillment service consumes:

order.placed
order.modified
order.cancelled
order.billed
modification.billed

When an order.placed event is received the shipment starts getting prepared. Only once an order.billed event is received does the shipment get shipped. If an order.modification is received and the shipment has not shipped yet, then it can be modified, else a new shipment starts getting prepared. Only once a modification has been billed can the shipment take place. If an order.cancelled event is received and the shipment has not shipped yet, then it is cancelled and the items returned to inventory.

The Notifications service simply sends email to the client on certain important events. Currently those are:

order.placed
order.shipped
order.refunded

Queue Based Topology

With a publish-subcribe queue based topology each publisher publishes to a topic. We can have a topic per event. In the case of RabbitMQ, we have exchanges. We could create a fanout exchange per event and this mimicks topics that you get with other queue based pub-sub systems.

We really want related events to be ordered relative to each other, so each application will have a single queue which it will bind to the event exchanges it needs.

By routing the events it needs to a single queue we ensure each application consumes those events in the correct relative order. If we route the events to a separate queue per event we can easily process an order.cancelled before an order.placed whenever the order.placed queue gets backed up a little.

Fig 3. RabbitMQ topology with fanout exchange per event, single queue per application

But as we saw in the last post, this does not scale so well. If we want strong ordering guarantees we can stick to a single queue and single consumer of that queue. But when one consumer is not enough, then we can create competing consumers for the queue, but our ordering guarantees are lost.

So we need to partition our queues in order to get scale and ordering guarantees. We can use the Consistent Hashing Exchange for this.

Currently we have our fulfillment queue bound directly to the five exchanges it needs.

Fig 4. Fulfillment consumer binds its queue directly to event exchanges

We change that to having five queues bound to a Consistent Hashing Exchange which itself binds to the five exchanges.

Fig 5. Fulfillment consumer binds a Consistent Hashing Exchange to the event exchanges it wants to consume, scales out its queue to 5 partitioned queues and has a single consumer instance per queue. Messages have an OrderId header which is used for … — Fig 5. Fulfillment consumer binds a Consistent Hashing Exchange to the event exchanges it wants to consume, scales out its queue to 5 partitioned queues and has a single consumer instance per queue. Messages have an OrderId header which is used for routing to the partitioned queues.

We can use this topology for each consumer that needs to scale out while maintaining correct relative ordering of various events.

How we match consumer instances to queues is out of scope for this post. But you could make something dynamic, using Consul or ZooKeeper to help guarantee the correct queue assignment. Or you could go for something static, using your configuration system and deployment pipeline.

Log Based Topology

We no longer have the luxury of the decoupled architecture we had with RabbitMQ where we published to event exchanges and let the applications decide how they want to mix events in their queues. With Kafka our topics are physical data structures which are shared by producers and consumers alike. We must think hard and choose our topology well. Changing it later could be painful.

We could go for a single Orders topic, guaranteeing relative ordering of all the events related to a given order. Or we could have three topics related to orders, billing and shipping. Or a topic per event. I think a topic per event is a poor choice due to processing related events out of order.

Option 1 - Single Orders Topic

Fig 6. A single topic for all order related events. Consumers read all messages and ignore the events they are not interested in.

With this topology we get guaranteed ordering of all the various events related to orders. Because this is Kafka, we also get scaling and ordering out-of-the-box, if the producers all use the partitioning by message key method of choosing a partition. In this case the message id would be the OrderId or even CustomerId.

You may be thinking that it is wasteful for each consumer to be reading messages that they are not interested in. Even if each consumer simply ignores the events it is not interested in. The Billing service is even forced to read the events it itself published! But correctness is not a nice-to-have. Correctness is a property we should all have in mind whether it is designing an architecture or writing a single function. Even when we decide to optimize for a different property, we should do so consciously, knowing that we sacrifice correctness for another more important consideration.

When this shared topic can be problematic is if you have a situation where one or two events dominate, with high volume. A consumer that is only interested in a single event that has a low message volume is forced to scale out in order to be able to read the fire hose of order events. In this case we might decide to loosen the grouping of related events as we'll see in Option 2.

Option 2 - Multiple Topics, Mini Groupings

In this topology we still group related events, but the groupings are not global to all events related to orders.

Fig 7. Using multiple topics wuith smaller groupings of related events

Even though we don't get total ordering of all order related events, we still maintain strong relative ordering of events. We also reduce the demands on consumers that only need a low volume event.

One last thing to mention is the Option 3, one topic per event. This may well be optimal in a data analytics pipeline. Data analytics systems tend to have less stringent requirements regarding event ordering. But this comparison is all about an event-driven architecture where ordering matters.

Comparing the Topologies

With Kafka, whether or not you go for option 1 or 2, or something else again, you still are now coupling your producers and consumers to the same topic. If you want to change your topics later on that can be a painful exercise. Especially if you use event sourcing, or you use your topics for data replication to other systems. So while you can get ordering and scaling, your architecture is less easy to evolve and hard to optimize for any given application.

With RabbitMQ, we have better decoupling and can change our topology relatively easily. It is doubtful we would ever need to move away from an exchange per event, this mimicks the pub-sub topic concept nicely. We may well decide to change the backend architecture though. The number and type of consumers, how we scale out consumers. How consumers group related events. All these things can be done independently per consumer application without impacting the wider architecture.

But don't forget the killer features of Kafka: scale and persistence. You might find RabbitMQ is not stable enough for the scale at which you operate. You might want to use an event sourcing model for replicating state changes across multiple applications, data analytics platforms. Being able to reconsume 24 hours of messages because a bug was introduced in a consumer is a really nice capability to have.

So as always it's all about trade-offs and the architect needs to evaluate the trade-offs against their specific situation.