Building A "Simple" Distributed System - The Protocol

Building A "Simple" Distributed System - The Protocol

In the last post we covered what our distributed resource allocation library, Rebalanser, should do. In this post we’ll look at a protocol that could achieve those requirements, always respecting our invariants (described in the last post).

A protocol is basically a set of rules which govern how each node in a Rebalanser group acts in order to achieve the desired behaviours. Each node must communicate with the others in such a way that it can achieve consensus about the resource allocations and also guarantee that it does not start accessing a resource until another node has stopped accessing it.

Building a "Simple" Distributed System - The What

Building a "Simple" Distributed System - The What

This is a blog series where I share my approach and experience of building a distributed resource allocation library. As far as distributed systems go, it is a simple one and ideal as a tool for learning about distributed systems design, programming and testing.

The field of distributed systems is large, encompassing a myriad of academic work, algorithms, consistency models, data types, testing tools/techniques, formal verification tools and more. I will be covering just the theory, tools and techniques that were relevant for my little project.

Quorum Queues - Making RabbitMQ More Competitive in Reliable Messaging

Quorum Queues - Making RabbitMQ More Competitive in Reliable Messaging

The multiple design defects of RabbitMQ Mirrored Queues have been well documented by the community and acknowledged by the RabbitMQ team. In an age where new messaging systems are appearing that compete in the reliable messaging space, it is critical for RabbitMQ to improve its replicated queue story in order to continue to compete in that space. Which is why it is so exciting to see that the RabbitMQ team have been working hard to deliver a new replicated queue type based on the Raft consensus algorithm. Quorum queues are still in beta and as such are subject to change before release. Likewise, their capabilities will no doubt evolve and improve over future releases. There are currently limitations to the features of Quorum Queues but if data safety is your most important requirement then they aim to satisfy your needs.

In this post we'll going to look at the design of Quorum Queues and then in a later post we'll run a series of chaos tests to test the durability claims of this new queue type.

Why I Am Not a Fan of the RabbitMQ Sharding Plugin

Why I Am Not a Fan of the RabbitMQ Sharding Plugin

I recently spoke at the RabbitMQ Summit in London about using the Consistent Hash Exchange to maintain processing order guarantees while scaling out consumers. Afterwards I was asked why I don’t opt for the Sharding Plugin instead. One of the downsides of the Consistent Hash Exchange I spoke of in the talk was that you don’t get automatic queue assignment for your consumers. The Sharding Plugin makes an attempt to address this problem but doesn’t go all the way. In this post I’ll describe my issues with the Sharding Plugin.

Testing Producer Deduplication in Apache Kafka and Apache Pulsar

Testing Producer Deduplication in Apache Kafka and Apache Pulsar

Failures can induce message duplication on both the producer and consumer side. In this post we’ll focus solely on producer side duplication, looking at how the deduplication feature works in Apache Pulsar and Apache Kafka. I have run many hours of deduplication tests of both messaging systems and we´ll see the results of those tests.

On the producer side, when a producer sends a message and an error occurs, such as a TCP connection failure, the producer has no way to know if the message was persisted or not. We have two choices, send the message again to ensure it gets delivered and risk duplication, or not send it again and risk the message never getting delivered.

How to (not) Lose Messages on an Apache Pulsar Cluster

How to (not) Lose Messages on an Apache Pulsar Cluster

In this post we’ll put the protocols we covered in the Understanding How Apache Pulsar Works post to the test. As in previous tests of How to Lose Messages on a RabbitMQ Cluster and How to Lose Messages on a Apache Kafka Cluster, I’ll be using Blockade to kill off nodes, slow down the network and lose packets. Unlike in those previous tests, these tests are automated and go further, not only testing for data loss but also correct ordering and duplication.

In each scenario we’ll stand-up a new blockade cluster with a specific configuration of:

  • Apache Pulsar broker count

  • Apache BookKeeper node (Bookie) count

  • Ensemble size (E)

  • Write quorum size (Qw)

  • Ack quorum size (Qa)

Understanding How Apache Pulsar Works

Understanding How Apache Pulsar Works

I will be writing a series of blog posts about Apache Pulsar, including some Kafka vs Pulsar posts. First up though I will be running some chaos tests on a Pulsar cluster like I have done with RabbitMQ and Kafka to see what failure modes it has and its message loss scenarios.

I will try to do this by either exploiting design defects, implementation bugs or poor configuration on the part of the admin or developer.

In this post we’ll go through the Apache Pulsar design so that we can better design the failure scenarios. This post is not for people who want to understand how to use Apache Pulsar but who want to understand how it works. I have struggled to write a clear overview of its architecture in a way that is simple and easy to understand. I appreciate any feedback on this write-up.

How to Lose Messages on a Kafka Cluster - Part 1

In my previous post I used Blockade, Python and some Bash scripts to test a RabbitMQ cluster under various failure conditions such as failed nodes, network partitions, packet loss and a slow network. The aim was to find out how and when a RabbitMQ cluster loses messages. In this post we’ll do exactly the same but with a Kafka cluster. We’ll use our knowledge of the inside workings of Kafka and Zookeeper to produce various failure modes that produce message loss. Please read my post on Kafka fault tolerance as this post assumes you understand the basics of the acknowledgements and replication protocol.

How to Lose Messages on a RabbitMQ Cluster

In my RabbitMQ vs Kafka series Part 5 post I covered the theory of RabbitMQ clustering and some of the gotchas. In this post we'll demonstrate the message loss scenarios described in that post using Docker and Blockade. I recommend you read that post first as this post assumes understanding of the topics covered.

Blockade is a really easy way to test out how distributed systems cope with network partitions, flaky networks and slow networks. It was inspired by the Jepson series. In this post we'll either be killing off nodes, partitioning the cluster, introducing packet loss or slowing down the network. So with Blockade, some bash and python scripts we’ll test out some failure scenarios.