December 6, 2024

To be atomic or non-atomic, that is the question (Fizzbee)

December 6, 2024

After posting my last Kafka transactions diary entry, JP (the Fizzbee maintainer) wrote a refactored version using non-atomic actions and a different way of representing the network. It’s a very interesting variant and I’m tempted to switch over to his version.

When an action is not atomic, execution of an action could yield at any moment to a different action in a different role instance or even the same role instance. With this yielding we can also replace explicit message passing with direct invocation of a function on another node. The invocation and the response can all yield to other concurrent events happening across the system.

Let me use a simple example to demonstrate - Ping-Pong.

Jack Vanlightly

December 5, 2024

Formal verification

An introduction to symmetry in TLA+

Jack Vanlightly

December 5, 2024

Formal verification

Symmetry reduction in TLA+ is a clever trick for cutting down the size of the state space we need to explore during model checking. Think about a distributed system with interchangeable components—servers, nodes, or processes that behave identically. Without symmetry reduction, the model checker wastes time exploring states that are essentially duplicates, differing only in the labels we’ve assigned to these components. Symmetry reduction says, “Hey, if swapping the identities of these components doesn’t change the behavior of the system, let’s treat those states as one.” This massively reduces the computational effort while keeping the results valid.

In this post, I’ll show some simple examples of symmetry using trivial specs where we can actually visualize the state space. The idea is to build a mental model of how symmetry reduction works.

Jack Vanlightly

November 26, 2024

Data

Dismantling ELT: The Case for Graphs, Not Silos

Jack Vanlightly

November 26, 2024

Data

Dismantling ELT: The Case for Graphs, Not Silos

ELT is a bridge between silos. A world without silos is a graph.

I’ve been banging my drum recently about the ills of Conway’s Law and the need for low-coupling data architectures. In my Curse of Conway and the Data Space blog post, I explored how Conway’s Law manifests in the disconnect between software development and data analytics teams. It is a structural issue stemming from siloed organizational designs, and it not only causes inefficiencies and poor collaboration but ultimately hinders business agility and effectiveness.

Jack Vanlightly

November 21, 2024

Distributed Systems

The Law of Large Numbers: A Foundation for Statistical Modeling in Distributed Systems

Jack Vanlightly

November 21, 2024

Distributed Systems

In my recent blog post, Obtaining Statistical Properties Through Modeling and Simulation, I described how we can use modeling and simulation to better understand both proposed and real systems. Not only that, but it can be extremely useful when assessing the effectiveness of optimizations.

However, in that post I missed a couple of additional interesting points that I think are worth covering.

Jack Vanlightly

November 19, 2024

Formal verification

Obtaining statistical properties through modeling and simulation

Jack Vanlightly

November 19, 2024

Formal verification

Obtaining statistical properties through modeling and simulation

Sophisticated, simulations need not be. Valuable insights, even simple scripts reveal. — Formal Methods Yoda

A couple of weeks ago I was a guest on The Geek Narrator to talk about formal verification. I spoke a lot about how modeling and simulation are tremendously powerful tools, whether you use a formal verification language (such as TLA+) or just a Python script.

This post goes through a real world example of how I used modeling and simulation to understand the statistical properties of a proposed distributed system protocol, using both Python and TLA+. There is a talk version of this post from TLA+ Conf 2022.

Jack Vanlightly

November 13, 2024

Data

Incremental Jobs and Data Quality Are On a Collision Course - Part 2 - The Way Forward

Jack Vanlightly

November 13, 2024

Data

So what should we do instead?

This is less of a technology problem and more of a structural problem. We can’t just add some missing features to data tooling; it’s about solving a people problem, how we organize together, how team incentives line up, and also about applying well-established software engineering principles that are still to be realized in the data analytics space.

Jack Vanlightly

November 13, 2024

Data

Incremental Jobs and Data Quality Are On a Collision Course - Part 1 - The Problem

Jack Vanlightly

November 13, 2024

Data

Big data isn’t dead; it’s just going incremental

If you keep an eye on the data space ecosystem like I do, then you’ll be aware of the rise of DuckDB and its message that big data is dead. The idea comes from two industry papers (and associated data sets), one from the Redshift team (paper and dataset) and one from Snowflake (paper and dataset). Each paper analyzed the queries run on their platforms, and some surprising conclusions were drawn – one being that most queries were run over quite small data. The conclusion (of DuckDB) was that big data was dead, and you could use simpler query engines rather than a data warehouse. It’s far more nuanced than that, but data shows that most queries are run over smaller datasets.

Why?

Jack Vanlightly

November 5, 2024

Essays

On writing and getting from zero to done

Jack Vanlightly

November 5, 2024

Essays

On writing and getting from zero to done

How does one go from an idea through to a (self) published piece that is effective/beautiful/inspiring/useful? Whether it’s some analysis, creative writing, how-to developer material, writing is an art form. Getting from idea to completed work is not always straightforward.

Jack Vanlightly

November 2, 2024

Non-Tech

Learning first that you can learn

Jack Vanlightly

November 2, 2024

Non-Tech

Learning compounds, not only because new skills can be utilized to learn yet further skills, but over time, the more you learn, the more faith you have that you can learn. Greater faith leads to greater ambition when selecting new topics to learn and more enjoyment of the learning process itself.

Jack Vanlightly

October 31, 2024

Strategy and commentary

Forget the table format war; it’s open vs closed that counts

Jack Vanlightly

October 31, 2024

Strategy and commentary

Forget the table format war; it’s open vs closed that counts

Apache Iceberg is a hot topic right now, and looks to be the future standard for representing tables in object storage. Hive tables are overdue for a replacement. People talk about table format wars: Apache Iceberg vs Delta Lake vs Apache Hudi and so on, but the “war” at the forefront of my mind isn’t which table format will become dominant, but the battle between open vs closed – open table formats vs walled gardens.