September 6, 2019

A Look at Multi-Topic Subscriptions with Apache Pulsar

September 6, 2019

This is a sister post to one I am writing about multi-topic subscriptions with Apache Kafka that you can read soon on the Cloud Karafka blog (link coming soon). I will provide a summary of those results before we get started with Apache Pulsar. The run the same tests in my tests of both technologies.

The objective is to get an understanding of what to expect from multi-topic subscriptions, specifically we are testing message ordering. Message ordering is a fundamental component of messaging systems and even though cross topic ordering is not guaranteed by Pulsar or Kafka, I find it interesting and useful to know what to expect.

Jack Vanlightly

February 10, 2019

Distributed Systems

Building A "Simple" Distributed System - It's the Logs Stupid

Jack Vanlightly

February 10, 2019

Distributed Systems

In previous posts we covered designing the protocol and verifying it with TLA+. Then we designed the implementation with Apache ZooKeeper. In this post we’ll look at a very important prerequisite for testing and release to production - good logging. The links to the rest of the series are at the bottom of this post.

It’s the Logs Stupid

Without good logging, you’re in for a world of pain and wasted hours trying to figure out why something failed. Forget the debugger, put it to one side and embrace logging as part of the development and test process. The logs will be the way in which you can identify what was going on in the environment and in each node at the time of failure. Your code will fail, over and over again, in new and surprising ways until finally towards the end of the development process you start to see it cope with everything you throw at it. We’ll be throwing a lot of nasty behaviour at the code and it will need to handle it.

Jack Vanlightly

February 2, 2019

Distributed Systems

Building A "Simple" Distributed System - The Implementation

Jack Vanlightly

February 2, 2019

Distributed Systems

Building A "Simple" Distributed System - The Implementation

In previous posts we’ve identified the requirements for our distributed resource allocation library, including one invariant (No Double Resource Access) that must hold 100% of the time no matter what and another (All Resources Evenly Accessed) that must hold after a successful rebalancing of resources. We documented a protocol that describes how nodes interact with a central registry to achieve the requirements, including how they deal with all conceived failure scenarios. Then we built a TLA+ specification and used the model checker to verify the designed protocol, identifying a defect in the process.

In this post we’ll tackle the implementation and in the next we’ll look at testing.

Jack Vanlightly

January 28, 2019

Distributed Systems

Building A "Simple" Distributed System - Formal Verification

Jack Vanlightly

January 28, 2019

Distributed Systems

Building A "Simple" Distributed System - Formal Verification

In the last post, we described a protocol that should satisfy the requirements and invariants established in the first post. Today we will look at formal verification with TLA+.

Formal verification is just another (niche) tool in the toolbox. Some tools require more skill than others to use. Some tools are more expensive than others. It is up to the practioner to decide if/when/how to use them.

The hard part is that you won't necessarily know if it is beneficial to a given problem you face, if you aren't already skilled in it. If a tool is very difficult to learn, then you might never invest in it enough to be able to make that call. Or you might invest a lot of time into it, to find it isn't a great match for your problem. At which point it gets stowed in your toolbox where it may or may not get used again. I expect many software engineers see learning formal methods as a difficult (it is) and high risk venture.

So, given the above, my aim of this post is for software engineers without prior experience of TLA+ to be able to get the gist of the spec and see why it was useful for this project. Please give me feedback if I succeeded or not.

Jack Vanlightly

January 27, 2019

Distributed Systems

Building A "Simple" Distributed System - The Protocol

Jack Vanlightly

January 27, 2019

Distributed Systems

Building A "Simple" Distributed System - The Protocol

In the last post we covered what our distributed resource allocation library, Rebalanser, should do. In this post we’ll look at a protocol that could achieve those requirements, always respecting our invariants (described in the last post).

A protocol is basically a set of rules which govern how each node in a Rebalanser group acts in order to achieve the desired behaviours. Each node must communicate with the others in such a way that it can achieve consensus about the resource allocations and also guarantee that it does not start accessing a resource until another node has stopped accessing it.

Jack Vanlightly

January 26, 2019

Distributed Systems

Building a "Simple" Distributed System - The What

Jack Vanlightly

January 26, 2019

Distributed Systems

Building a "Simple" Distributed System - The What

This is a blog series where I share my approach and experience of building a distributed resource allocation library. As far as distributed systems go, it is a simple one and ideal as a tool for learning about distributed systems design, programming and testing.

The field of distributed systems is large, encompassing a myriad of academic work, algorithms, consistency models, data types, testing tools/techniques, formal verification tools and more. I will be covering just the theory, tools and techniques that were relevant for my little project.

Jack Vanlightly

November 20, 2018

Messaging Systems

Quorum Queues - Making RabbitMQ More Competitive in Reliable Messaging

Jack Vanlightly

November 20, 2018

Messaging Systems

Quorum Queues - Making RabbitMQ More Competitive in Reliable Messaging

The multiple design defects of RabbitMQ Mirrored Queues have been well documented by the community and acknowledged by the RabbitMQ team. In an age where new messaging systems are appearing that compete in the reliable messaging space, it is critical for RabbitMQ to improve its replicated queue story in order to continue to compete in that space. Which is why it is so exciting to see that the RabbitMQ team have been working hard to deliver a new replicated queue type based on the Raft consensus algorithm. Quorum queues are still in beta and as such are subject to change before release. Likewise, their capabilities will no doubt evolve and improve over future releases. There are currently limitations to the features of Quorum Queues but if data safety is your most important requirement then they aim to satisfy your needs.

In this post we'll going to look at the design of Quorum Queues and then in a later post we'll run a series of chaos tests to test the durability claims of this new queue type.

Jack Vanlightly

November 14, 2018

Messaging Systems

Why I Am Not a Fan of the RabbitMQ Sharding Plugin

Jack Vanlightly

November 14, 2018

Messaging Systems

Why I Am Not a Fan of the RabbitMQ Sharding Plugin

I recently spoke at the RabbitMQ Summit in London about using the Consistent Hash Exchange to maintain processing order guarantees while scaling out consumers. Afterwards I was asked why I don’t opt for the Sharding Plugin instead. One of the downsides of the Consistent Hash Exchange I spoke of in the talk was that you don’t get automatic queue assignment for your consumers. The Sharding Plugin makes an attempt to address this problem but doesn’t go all the way. In this post I’ll describe my issues with the Sharding Plugin.

Jack Vanlightly

November 2, 2018

Messaging Systems

Testing Producer Deduplication in Apache Kafka and Apache Pulsar

Jack Vanlightly

November 2, 2018

Messaging Systems

Testing Producer Deduplication in Apache Kafka and Apache Pulsar

Failures can induce message duplication on both the producer and consumer side. In this post we’ll focus solely on producer side duplication, looking at how the deduplication feature works in Apache Pulsar and Apache Kafka. I have run many hours of deduplication tests of both messaging systems and we´ll see the results of those tests.

On the producer side, when a producer sends a message and an error occurs, such as a TCP connection failure, the producer has no way to know if the message was persisted or not. We have two choices, send the message again to ensure it gets delivered and risk duplication, or not send it again and risk the message never getting delivered.

Jack Vanlightly

October 21, 2018

Messaging Systems

How to (not) Lose Messages on an Apache Pulsar Cluster

Jack Vanlightly

October 21, 2018

Messaging Systems

How to (not) Lose Messages on an Apache Pulsar Cluster

In this post we’ll put the protocols we covered in the Understanding How Apache Pulsar Works post to the test. As in previous tests of How to Lose Messages on a RabbitMQ Cluster and How to Lose Messages on a Apache Kafka Cluster, I’ll be using Blockade to kill off nodes, slow down the network and lose packets. Unlike in those previous tests, these tests are automated and go further, not only testing for data loss but also correct ordering and duplication.

In each scenario we’ll stand-up a new blockade cluster with a specific configuration of:

Apache Pulsar broker count
Apache BookKeeper node (Bookie) count
Ensemble size (E)
Write quorum size (Qw)
Ack quorum size (Qa)