This post continues my series looking at log replication protocols, within the context of state-machine replication (SMR) or just when the log itself is the product (such as Kafka). So far I’ve been looking at Virtual Consensus, but now I’m going to widen the view to look at how log replication protocols can be disaggregated in general (there are many ways). In the next post, I’ll do a survey of log replication systems in terms of the types of disaggregation described in this post.
Steady on! Separating Failure-Free Ordering from Fault-Tolerant Consensus
"True stability results when presumed order and presumed disorder are balanced. A truly stable system expects the unexpected, is prepared to be disrupted, waits to be transformed." — Tom Robbins
This post continues my series looking at log replication protocols, within the context of state-machine replication (SMR) or just when the log itself is the product (such as Kafka). I’m going to cover some of the same ground from the Introduction to Virtual Consensus in Delos post, but focus on one aspect specifically and see how it generalizes.
An Introduction to Virtual Consensus in Delos
This is the first of a number of posts looking at log replication protocols, mainly in the context of state machine replication (SMR). This first post will look at a log replication protocol design called Virtual Consensus from the paper: Virtual Consensus in Delos.
In 2020, a team of researchers and engineers from Facebook, led by Mahesh Balakrishnan, published their work (linked above) on a log replication design called Virtual Consensus that they had built as the log replication layer of their database, Delos.
As an Apache BookKeeper committer (non-active), I immediately saw the similarities to BookKeeper. Yet, the Virtual Consensus paper went further than BookKeeper, describing clean abstractions with clear separations of concerns. Just as the Raft paper has helped a lot of engineers implement SMR over the last 10 years, I believe the Virtual Consensus paper could do the same for the next 10. There are a few reasons to believe this that I will explain in this post.
Why Snowflake wants streaming
Rumors are swirling that Snowflake intends to acquire Redpanda and many are questioning why and what impact this might have on Confluent. First, let’s remember that these are just rumors and there’s nothing official. But given that people are speculating, here are my thoughts on how to interpret such an acquisition, whether it ends up happening or not.
There are a number of market trends in play right now, such as the rise of Iceberg and open data, as well as the war with Databricks and Snowflake’s refocus on AI. While it may not be evident at first, these are all driving Snowflake towards streaming.
AI Agents in 2025
Two interesting blog posts about AI agents have caught my attention over the last few weeks.
Anthropic wrote Building Effective Agents.
Chip Huyen wrote Agents.
Ethan Mollick has also written some excellent blog posts recently:
In this post, I’ll explore what some of the leading experts in this area are saying about AI agents and the challenges ahead.
On 1 million page views
I just read Phil Eaton’s post on reaching the 1 million page views milestone, which he was inspired to blog about due to Murat Demirbas doing the same thing back in 2017.
I just checked my all time blog stats and it turns out I can write one of these too 😄
To be atomic or non-atomic, that is the question (Fizzbee)
After posting my last Kafka transactions diary entry, JP (the Fizzbee maintainer) wrote a refactored version using non-atomic actions and a different way of representing the network. It’s a very interesting variant and I’m tempted to switch over to his version.
When an action is not atomic, execution of an action could yield at any moment to a different action in a different role instance or even the same role instance. With this yielding we can also replace explicit message passing with direct invocation of a function on another node. The invocation and the response can all yield to other concurrent events happening across the system.
Let me use a simple example to demonstrate - Ping-Pong.
An introduction to symmetry in TLA+
Symmetry reduction in TLA+ is a clever trick for cutting down the size of the state space we need to explore during model checking. Think about a distributed system with interchangeable components—servers, nodes, or processes that behave identically. Without symmetry reduction, the model checker wastes time exploring states that are essentially duplicates, differing only in the labels we’ve assigned to these components. Symmetry reduction says, “Hey, if swapping the identities of these components doesn’t change the behavior of the system, let’s treat those states as one.” This massively reduces the computational effort while keeping the results valid.
In this post, I’ll show some simple examples of symmetry using trivial specs where we can actually visualize the state space. The idea is to build a mental model of how symmetry reduction works.
Dismantling ELT: The Case for Graphs, Not Silos
ELT is a bridge between silos. A world without silos is a graph.
I’ve been banging my drum recently about the ills of Conway’s Law and the need for low-coupling data architectures. In my Curse of Conway and the Data Space blog post, I explored how Conway’s Law manifests in the disconnect between software development and data analytics teams. It is a structural issue stemming from siloed organizational designs, and it not only causes inefficiencies and poor collaboration but ultimately hinders business agility and effectiveness.
The Law of Large Numbers: A Foundation for Statistical Modeling in Distributed Systems
In my recent blog post, Obtaining Statistical Properties Through Modeling and Simulation, I described how we can use modeling and simulation to better understand both proposed and real systems. Not only that, but it can be extremely useful when assessing the effectiveness of optimizations.
However, in that post I missed a couple of additional interesting points that I think are worth covering.




