Jack Vanlightly

Distributed Systems

Log Replication Disaggregation Survey - Apache Pulsar and BookKeeper

In this latest post of the disaggregated log replication survey, we’re going to look at the Apache BookKeeper Replication Protocol and how it is used by Apache Pulsar to form topic partitions.

Raft blends the roles and responsibilities into one monolithic protocol, MultiPaxos separates the monolithic protocol into separate roles, and Apache Kafka separates the protocol and roles into control-plane/data-plane. How do Pulsar and BookKeeper divide and conquer the duties of log replication? Let’s find out.

Share

Log Replication Disaggregation Survey - Neon and MultiPaxos

Over the next series of posts, we'll explore how various real-world systems and some academic papers have implemented log replication with some form of disaggregation. In this first post we’ll look at MultiPaxos. There are no doubt many real-world implementations of MultiPaxos out there, but I want to focus on Neon’s architecture as it is illustrative of the benefits of thinking in terms of logical abstractions and responsibilities when designing complex systems.

Share

How to disaggregate a log replication protocol

This post continues my series looking at log replication protocols, within the context of state-machine replication (SMR) or just when the log itself is the product (such as Kafka). So far I’ve been looking at Virtual Consensus, but now I’m going to widen the view to look at how log replication protocols can be disaggregated in general (there are many ways). In the next post, I’ll do a survey of log replication systems in terms of the types of disaggregation described in this post.

Share

Steady on! Separating Failure-Free Ordering from Fault-Tolerant Consensus

"True stability results when presumed order and presumed disorder are balanced. A truly stable system expects the unexpected, is prepared to be disrupted, waits to be transformed."Tom Robbins

This post continues my series looking at log replication protocols, within the context of state-machine replication (SMR) or just when the log itself is the product (such as Kafka). I’m going to cover some of the same ground from the Introduction to Virtual Consensus in Delos post, but focus on one aspect specifically and see how it generalizes.

Share

An Introduction to Virtual Consensus in Delos

This is the first of a number of posts looking at log replication protocols, mainly in the context of state machine replication (SMR). This first post will look at a log replication protocol design called Virtual Consensus from the paper: Virtual Consensus in Delos.

In 2020, a team of researchers and engineers from Facebook, led by Mahesh Balakrishnan, published their work (linked above) on a log replication design called Virtual Consensus that they had built as the log replication layer of their database, Delos.

As an Apache BookKeeper committer (non-active), I immediately saw the similarities to BookKeeper. Yet, the Virtual Consensus paper went further than BookKeeper, describing clean abstractions with clear separations of concerns. Just as the Raft paper has helped a lot of engineers implement SMR over the last 10 years, I believe the Virtual Consensus paper could do the same for the next 10. There are a few reasons to believe this that I will explain in this post.

Share

The Law of Large Numbers: A Foundation for Statistical Modeling in Distributed Systems

The Law of Large Numbers: A Foundation for Statistical Modeling in Distributed Systems

In my recent blog post, Obtaining Statistical Properties Through Modeling and Simulation, I described how we can use modeling and simulation to better understand both proposed and real systems. Not only that, but it can be extremely useful when assessing the effectiveness of optimizations.

However, in that post I missed a couple of additional interesting points that I think are worth covering.

Share

A Cost Analysis of Replication vs S3 Express One Zone in Transactional Data Systems

A Cost Analysis of Replication vs S3 Express One Zone in Transactional Data Systems

Is it economical to build fault-tolerant transactional data systems directly on S3 Express One Zone, instead of using replication? Read on for an analysis.

Cloud object storage is becoming the universal storage layer for a wealth of cloud data systems. Some systems use object stores as the only storage layer, accepting the higher latency of object storage, and these tend to be analytics systems that can accept multi-second latencies. Transactional systems want single-digit millisecond latencies or latencies in the low tens of milliseconds and therefore don’t write to object stores directly. Instead, they land data on a fast replicated write-ahead-log (WAL) and offload data to an object store for read-optimized long-term, economical storage. Neon is a good example of this architecture. Writes hit a low-latency replicated write-ahead-log based on Multi-Paxos and data is eventually written to object storage.

Share

Learning and reviewing system internals: tactics and psychology

Learning and reviewing system internals: tactics and psychology

Every now and then I get asked for advice on how to learn about distributed system internals and protocols. Over the course of my career I've picked up a learning and reviewing style that works pretty well for me.

To define these terms, learning and reviewing are similar but not the same:

  • Learning about how a system works is the easier of the two. By the means available to you (books, papers, blogs, code), you study the system to understand how it works and why it works that way.

  • Reviewing a system requires learning but also involves opinions, taking positions, making judgments. It is trickier to get right, more subjective, and often only time can show you if you were right or wrong about it and to what degree.

We all review systems to one degree or another, even if it's just a casual review where the results are some loosely held opinions shared by the coffee machine. But when it comes to sharing our opinions in more formal contexts, an architecture meeting, a blog post, a conference talk or a job interview, the stakes are higher and the risks are also greater. If you review a system and come to some conclusions, how do you know if you are right? What happens if you are wrong? Someone could point out your flawed arguments. You make a bad decision. Not only can reviewing complex systems be hard, it can be scary too.

Share

The importance of liveness properties (with TLA+ Part 2)

The importance of liveness properties (with TLA+ Part 2)

In part 1 we introduced the concept of safety and liveness properties, then a stupidly simple gossip protocol called Gossa. Our aim is to find liveness bugs in the design and improve the design until all liveness issues are fixed.

Gossa had some problems. First it had cycles due to nodes contesting whether a peer was dead or alive. We fixed that by making deadness take precedence over aliveness but still the cluster could not converge. The next problem was that a falsely accused dead node was unable to refute its deadness as no-one would pay attention to it - deadness ruled.

The proposed fix I mentioned in part 1 was to allow a falsely accused node to refute its deadness via the introduction of a monotonic counter.