Distributed Systems

A Cost Analysis of Replication vs S3 Express One Zone in Transactional Data Systems

A Cost Analysis of Replication vs S3 Express One Zone in Transactional Data Systems

Is it economical to build fault-tolerant transactional data systems directly on S3 Express One Zone, instead of using replication? Read on for an analysis.

Cloud object storage is becoming the universal storage layer for a wealth of cloud data systems. Some systems use object stores as the only storage layer, accepting the higher latency of object storage, and these tend to be analytics systems that can accept multi-second latencies. Transactional systems want single-digit millisecond latencies or latencies in the low tens of milliseconds and therefore don’t write to object stores directly. Instead, they land data on a fast replicated write-ahead-log (WAL) and offload data to an object store for read-optimized long-term, economical storage. Neon is a good example of this architecture. Writes hit a low-latency replicated write-ahead-log based on Multi-Paxos and data is eventually written to object storage.

Learning and reviewing system internals: tactics and psychology

Learning and reviewing system internals: tactics and psychology

Every now and then I get asked for advice on how to learn about distributed system internals and protocols. Over the course of my career I've picked up a learning and reviewing style that works pretty well for me.

To define these terms, learning and reviewing are similar but not the same:

  • Learning about how a system works is the easier of the two. By the means available to you (books, papers, blogs, code), you study the system to understand how it works and why it works that way.

  • Reviewing a system requires learning but also involves opinions, taking positions, making judgments. It is trickier to get right, more subjective, and often only time can show you if you were right or wrong about it and to what degree.

We all review systems to one degree or another, even if it's just a casual review where the results are some loosely held opinions shared by the coffee machine. But when it comes to sharing our opinions in more formal contexts, an architecture meeting, a blog post, a conference talk or a job interview, the stakes are higher and the risks are also greater. If you review a system and come to some conclusions, how do you know if you are right? What happens if you are wrong? Someone could point out your flawed arguments. You make a bad decision. Not only can reviewing complex systems be hard, it can be scary too.

The importance of liveness properties (with TLA+ Part 2)

The importance of liveness properties (with TLA+ Part 2)

In part 1 we introduced the concept of safety and liveness properties, then a stupidly simple gossip protocol called Gossa. Our aim is to find liveness bugs in the design and improve the design until all liveness issues are fixed.

Gossa had some problems. First it had cycles due to nodes contesting whether a peer was dead or alive. We fixed that by making deadness take precedence over aliveness but still the cluster could not converge. The next problem was that a falsely accused dead node was unable to refute its deadness as no-one would pay attention to it - deadness ruled.

The proposed fix I mentioned in part 1 was to allow a falsely accused node to refute its deadness via the introduction of a monotonic counter.

The importance of liveness properties (with TLA+ Part 1)

The importance of liveness properties (with TLA+ Part 1)

Invariants get most of the attention because they are easy to write, easy to check and find those histories which lead to really bad outcomes, such as lost data. But liveness properties are really important too and after a years of writing TLA+ specifications, I couldn’t imagine having confidence in a specification without one. This post and the next is a random walk through the world of model checking liveness properties in TLA+.

The outline is like this:

  • Part 1: I (hopefully) convince you that liveness properties are important. Then implement a gossip algorithm in TLA+ and use liveness properties to find problems.

  • Part 2: Continue evolving the algorithm, finding more and more liveness problems, overcome some challenges such as infinite state-space and impart some helpful principles - making you a better engineer and thinker by the end.

Kafka KIP-966 - Fixing the Last Replica Standing issue

Kafka KIP-966 - Fixing the Last Replica Standing issue

The Kafka replication protocol just got a new KIP that improves its durability when running without fsync. As I previously blogged, Why Kafka Doesn’t Need Fsync to be Safe, there are distributed system designs that exist which allow for asynchronous storage engines. Being asynchronous means that the system can reap performance benefits which are not available to a synchronous storage engine.

Kafka vs Redpanda Performance - Do the claims add up?

Apache Kafka has been the most popular open source event streaming system for many years and it continues to grow in popularity. Within the wider ecosystem there are other open source and source available competitors to Kafka such as Apache Pulsar, NATS Streaming, Redis Streams, RabbitMQ and more recently Redpanda (among others).

Redpanda is a source available Kafka clone written in C++ using the Seastar framework from ScyllaDB, a wide-column database. It uses the popular Raft consensus protocol for replication and all distributed consensus logic. Redpanda has been going to great lengths to explain that its performance is superior to Apache Kafka due to its thread-per-core architecture, use of C++, and its storage design that can push high performance NVMe drives to their limits.

They list a bold set of claims and those claims seem plausible. Built in C++ for modern hardware with a thread-per-core architecture sounds compelling and it seems logical that the claims must be true. But are they?

Is sequential IO dead in the era of the NVMe drive?

Is sequential IO dead in the era of the NVMe drive?

Two systems I know pretty well, Apache BookKeeper and Apache Kafka, were designed in the era of the spinning disk, the hard-drive or HDD. Hard-drives are good at sequential IO but not so good at random IO because of the relatively high seek times. No wonder then that both Kafka and BookKeeper were designed with sequential IO in mind.

Both Kafka and BookKeeper are distributed log systems and so you’d think that sequential IO would be the default for an append-only log storage system. But sequential and random IO sit on a continuum, with pure sequential on one side and pure random IO on the other. If you have 5000 files which you are appending to in small writes in a round-robin manner, and performing fsyncs, then this is not such a sequential IO access pattern, it sits further to the random IO side. So just by being an append-only log doesn’t mean you get sequential IO out of the gate.

Why Apache Kafka doesn't need fsync to be safe

Why Apache Kafka doesn't need fsync to be safe

TLDR: Apache Kafka doesn’t need fsyncs to be safe because it includes recovery in its replication protocol. It is a real-world distributed system that uses asynchronous log writing + recovery with some additional extra safety built-in. Asynchronous log writing allows it to provide robust performance on a variety of hardware and with a wide variety of workloads.

Now that the TLDR is done, let’s dive into it.

The fact that by default Apache Kafka doesn’t flush writes to disk is sometimes used as ammunition against it. The argument is that if Kafka doesn’t flush data before acknowledging produce requests then surely the cluster can lose acknowledged data due to crashes and reboots. It sounds plausible and so people may believe it - but I’m here writing this today to explain why that isn’t the case.

Applying Flexible Paxos to Raft

Applying Flexible Paxos to Raft

Flexible Paxos provides us the insight that Paxos (and Raft) only need that election and replication quorums intersect. But standard Raft and Paxos are configured so that every quorum intersects. So what does that mean exactly?

Let’s take the election quorum and Raft. An election quorum is a subset of the set of servers that have voted for the same server in the same election term and that quorum is formed of a majority. For a 3 node cluster we need 2 votes and a 5 node cluster we need 3 votes and so on.

The next question is: what are all the possible quorums that exist and are there any two quorums that do not intersect? The possible majority quorums are {n1, n2}, {n2, n3} and {n1, n3} and there are no two quorums that do not intersect. This is the property we get from majority quorums.

Tweaking the BookKeeper protocol - Unbounded Ledgers

In the last post I described the necessary protocol changes to ensure that all entries in closed ledgers reached Write Quorum (WQ) and all entries in all but the last fragment in open ledgers reach write quorum.

In this post we’re going to look at another tweak to the protocol to allow ledgers to be unbounded and allow writes from multiple clients over their lifetime.