July 15, 2025

Responsibility Boundaries in the Coordinated Progress model

July 15, 2025

Building on my previous work on the Coordinated Progress model, this post examines how reliable triggers not only initiate work but also establish responsibility boundaries. Where a reliable trigger exists, a new boundary is created where that trigger becomes responsible for ensuring the eventual execution of the sub-graph of work downstream of it. The boundaries can even layer and nest, especially in orchestrated systems that overlay finer-grained boundaries.

Jack Vanlightly

June 11, 2025

Distributed Systems

Coordinated Progress – Part 4 – A Loose Decision Framework

Jack Vanlightly

June 11, 2025

Distributed Systems

Microservices, functions, stream processors and AI agents represent nodes in our graph. An incoming edge represents a trigger of work in the node, and the node must do the work reliably. I have been using the term reliable progress but I might have used durable execution if it hadn’t already been used to define a specific type of tool.

Jack Vanlightly

June 11, 2025

Distributed Systems

Coordinated Progress – Part 3 – Coupling, Synchrony and Complexity

Jack Vanlightly

June 11, 2025

Distributed Systems

In part 2, we built a mental framework using a graph of nodes and edges to represent distributed work. Workflows are subgraphs coordinated via choreography or orchestration. Reliability, in this model, means reliable progress: the result of reliable triggers and progressable work.

In part 3 we refine this graph model in terms of different types of coupling between nodes, and how edges can be synchronous or asynchronous. Let’s set the scene with an example, then dissect that example with the concepts of coupling and communication styles.

Jack Vanlightly

June 11, 2025

Distributed Systems

Coordinated Progress – Part 2 – Making Progress Reliable

Jack Vanlightly

June 11, 2025

Distributed Systems

In part 1, we described distributed computation as a graph and constrained the graph for this analysis to microservices, functions, stream processing jobs and AI Agents as nodes, and RPC, queues, and topics as the edges.

Within our definition of The Graph, a node might be a function (FaaS or microservice), a stream processing job, an AI Agent, or some kind of third-party service. An edge might be an RPC channel, a queue or a topic.

For a workflow to be reliable, it must be able to make progress despite failures and other adverse conditions. Progress typically depends on durability at the node and edge levels.

Jack Vanlightly

June 11, 2025

Distributed Systems

Coordinated Progress – Part 1 – Seeing the System: The Graph

Jack Vanlightly

June 11, 2025

Distributed Systems

At some point, we’ve all sat in an architecture meeting where someone asks, “Should this be an event? An RPC? A queue?”, or “How do we tie this process together across our microservices? Should it be event-driven? Maybe a workflow orchestration?” Cue a flurry of opinions, whiteboard arrows, and vague references to sagas.

Now that I work for a streaming data infra vendor, I get asked: “How do event-driven architecture, stream processing, orchestration, and the new durable execution category relate to one another?”

These are deceptively broad questions, touching everything from architectural principles to practical trade-offs. To be honest, I had an instinctual understanding of how they fit together but I’d never written it down, so this series is how I see it, my mental framework, and hopefully it will be useful and understandable to you.

Jack Vanlightly

March 13, 2025

Distributed Systems

Log Replication Disaggregation Survey - Apache Pulsar and BookKeeper

Jack Vanlightly

March 13, 2025

Distributed Systems

In this latest post of the disaggregated log replication survey, we’re going to look at the Apache BookKeeper Replication Protocol and how it is used by Apache Pulsar to form topic partitions.

Raft blends the roles and responsibilities into one monolithic protocol, MultiPaxos separates the monolithic protocol into separate roles, and Apache Kafka separates the protocol and roles into control-plane/data-plane. How do Pulsar and BookKeeper divide and conquer the duties of log replication? Let’s find out.

Jack Vanlightly

February 21, 2025

Distributed Systems

Log Replication Disaggregation Survey - Kafka Replication Protocol

Jack Vanlightly

February 21, 2025

Distributed Systems

In this post, we’re going to look at the Kafka Replication Protocol and how it separates control plane and data plane responsibilities. It’s worth noting there are other systems that separate concerns in a similar way, with RabbitMQ Streams being one that I am aware of.

Jack Vanlightly

February 19, 2025

Distributed Systems

Log Replication Disaggregation Survey - Neon and MultiPaxos

Jack Vanlightly

February 19, 2025

Distributed Systems

Over the next series of posts, we'll explore how various real-world systems and some academic papers have implemented log replication with some form of disaggregation. In this first post we’ll look at MultiPaxos. There are no doubt many real-world implementations of MultiPaxos out there, but I want to focus on Neon’s architecture as it is illustrative of the benefits of thinking in terms of logical abstractions and responsibilities when designing complex systems.

Jack Vanlightly

February 10, 2025

Distributed Systems

How to disaggregate a log replication protocol

Jack Vanlightly

February 10, 2025

Distributed Systems

This post continues my series looking at log replication protocols, within the context of state-machine replication (SMR) or just when the log itself is the product (such as Kafka). So far I’ve been looking at Virtual Consensus, but now I’m going to widen the view to look at how log replication protocols can be disaggregated in general (there are many ways). In the next post, I’ll do a survey of log replication systems in terms of the types of disaggregation described in this post.

Jack Vanlightly

February 6, 2025

Distributed Systems

Steady on! Separating Failure-Free Ordering from Fault-Tolerant Consensus

Jack Vanlightly

February 6, 2025

Distributed Systems

"True stability results when presumed order and presumed disorder are balanced. A truly stable system expects the unexpected, is prepared to be disrupted, waits to be transformed." — Tom Robbins

This post continues my series looking at log replication protocols, within the context of state-machine replication (SMR) or just when the log itself is the product (such as Kafka). I’m going to cover some of the same ground from the Introduction to Virtual Consensus in Delos post, but focus on one aspect specifically and see how it generalizes.