Bringing Order to Chaos. Fixing a Broken Messaging System
When a messaging system lies at the heart of your application architecture you need to make it easy to respond to message processing failures. The more queues you have the more important a coherent incident response capability is. Unfortunately, it is all too common to see a chaotic policy or no policy at all for handling messages that cannot be processed successfully. Messages get delayed or lost on a regular basis and no-one is even sure how many.
The normal approach is to create a dead letter queue on a queue by queue basis and send messages there that cannot be processed. But what do you do from there? In this post we'll be looking at a message lifecycle baked into the messaging architecture that can solve this problem.
Dumb Dead Letter Queues Don't Work
Incident Response and Root Cause Diagnosis
You have a bunch of dead letter queues with messages in them, so what now? What possible causes could there be for each message being there?
The original queue was full
The message/queue TTL expired
It failed due to a transient error.
It failed because it is unprocessable (bad data, message versioning issue, serialization error, business validation fails, etc)
It failed because of a bug in the consumer.
Depending on the cause, you either want to put the message back in the main queue or discard/archive the message.
How do you diagnose the cause? If all you have is the message then you have to hope that you have some error message in your error logs that has information that can be correlated to a specific message in the deadletter queue. That takes time.
Or may be you just set up some automation that periodically puts messages back in the original queue. You'll need to ensure that this automation can keep track of the messages so unprocessable messages don't enter an infinite loop.
Too Many Dead Letter Queues
When you have a few dead letter queues, it is manageable, but when you get into the high tens, hundreds or even thousands then having a single dead letter queue per normal queue starts to get out of control. If you are manually moving messages then this position is untenable.
Developers Don't Think About Reliability
The developer adds an exception handling block and simply sends the message to the dead letter queue. Job done, no more to think about.
The developer is exactly the person who has the most power to take the correct course of action. They can write code to detect transient and persistent errors and respond accordingly. But the "send everything to dead letter" architecture robs us of that opportunity to act intelligently.
An Alternative - Create a Standard Message Lifecycle
What does a lifecycle consist of? Basically it is a workflow of possible paths that a message can take. It starts at the publishing of a message, then includes the successful processing of a message, or retries in case of transient failure, a place for unprocessable messages to go to, a place to archive failed messages, the option to discard, returning failed messages to the original consumers etc.
Core to this concept is also that failed messages carry with them all the information practically possible to diagnose the failure. This means:
The consumer application
The exception or error message
Who published the message
How many times has this message been processed?
Build the lifecycle into your messaging code library with a super simple code API for developers to use. Additionally, ensure that the code API forces them to think about the nature of the failure - is it transient or persistent? Do we retry, discard, send to a failed message queue?
Doing It With RabbitMQ
RabbitMQ makes this message lifecyle business super simple with its powerful routing capabilities, its custom message headers and dead letter exchanges.
Below shows the message lifecycle I have implemented at my current client. In green is the common infrastructure that is shared across all consumers. This diagram just shows a single publisher and consumer in order to keep it simple.
Messages flow from publishers to subscribers. When a subscriber cannot process a message they can either send it for a retry, send it to failed exchange or discard it. Messages that get sent for a retry are sent to a delay exchange/queue where it waits for the time that the subscriber has dictated and then it is returned to the consumer queue.
Messages that are sent to the failed exchange are wrapped in a FailedEvent which gives some context to the failure. Failed messages are consumed and written to SQL Server.
Also, dead letter messages are written to SQL Server. A message is dead lettered when it expires or the queue is full. Messages are dead lettered from the head of the queue only, so when a message is sent to a full queue, the message at the front of the queue is dead-lettered (not the new message).
Application support engineers triage the incoming failed and dead lettered messages and can either send messages back to the consumer or archive them. The FailedEvent consumers fingerprint the FailedEvents in order to allow support engineers to perform bulk operations. For example, if a consumer gets updated but has a bug and all messages fail with the same NullReferenceException, then each FailedEvent gets the same fingerprint. The support team or the developer team can fix the bug, deploy, then the a support engineer uses the "triage and response web UI" to select the fingerprint and returns all messages with that fingerprint to the consumers.
Rules Based Automation
Depending on the size of your system you may want to offload as much triage as you can to automation. The support engineers themselves can create the rules to say when to discard, archive or return a message to the original consumer.
That is where a DSL comes in. I have not implemented this yet, but it would look something like the below examples. The rules would be executed before persisting messages to SQL Server. The output of the rule would be to take action instead of persisting the message.
MATCH FailedEvent WHERE MsgType = 'order' AND ReturnCount < 3 THEN RETURN ELSE PERSIST
MATCH FailedEvent WHERE Application = 'MyCompany.LogAnalyser' THEN DISCARD
MATCH DeadLetterMessage WHERE MsgType = 'flight-boarding-notification' AND Cause = TTL THEN DISCARD ELSE PERSIST
With our current usage of SQL Server, it would make sense to put this in the FailedEvent and DeadLetter consumers, but other data stores offer other options. Elasticsearch has the percolator queries. Each DSL based rule could be translated to an Elasticsearch percolator query that would allow us to automate a response.
You need to be able to track a message throughout its entire lifetime. When a support engineer or a rules based automation takes a decision, it needs to know if it has seen this message before. There's no sense in returning a message to the consumer again and again if it always ends up back at the triage center.
Message Lifecycle Id
For this purpose in our system we add a MessageLifecycleId message header to all messages. Every time a message is sent, even when a message is sent for retries, it gets a new MessageId, but we always transfer over the MessageLifecycleId header. No matter how many times a message bounces around the system, it keeps its MessageLifecycleId.
A routing slip is a list that travels with each message that lists the places it has been. Each entry can include: Application, Server, Time, Action etc. It can be very useful when triaging a specific message as it contains inside the message its complete history. From a support engineer's perspective it doesn't get much better than that. But it does add more bytes to each message so there is a cost. We are looking at this as a feature for our messaging system, but there are alternatives.
Polymorphic Routing and Wire Taps
We have implemented polymorphic routing in order to give us some interesting wire tap posibilities. Every message that gets published also gets routed to exchanges representing the message types base classes and interfaces. I got the idea from NServiceBus. What is great is that we have created a RabbitMQ Beat that we can attach to any exchange we wish, in real-time, and start collecting the messages in Elasticsearch. Currently the support engineers use the capability rather surgically to see what messages specific publishers are publishing and/or which messages specific consumers are receiving. But in the future we might just hook up the Beat to the "System.Object" exchange and persist everything to Elasticsearch.
Once you have everything in Elasticsearch, using the MessageLifecycleId, you can get routing slip like information.
One Last Note of Caution
If you have reasonable scale then be careful with your data store. The scale you run at will determine your choice of data store but also importantly, how you use that data store. The messaging system can likely be scaled up more easily than a relational data store for example.
When a third party service goes offline, or a disk fills up or some other bad event happens you will get flooded with messages. Please think about the following:
Your data retention policy. In relational databases, deleting data can be costly. There's a transaction log to keep up to date. If you use a relational store then use table partitioning, so that removing old messages is as simple and quick as dropping a partition. Other stores like Elasticsearch have indexes that can be dropped in a similar manner.
The data is essentially time series. Consider a data store that works well with time series data.
There are many more details that I could talk about, but this gives a blueprint for a messaging system that lowers the total cost of ownership as all applications follow a standard protocol for how messages are dealt with. This message lifecycle is designed to make life easier for your support engineers and ultimately make the business more successful - which is always the yard stick at the end of the day.