The Law of Large Numbers: A Foundation for Statistical Modeling in Distributed Systems

The Law of Large Numbers: A Foundation for Statistical Modeling in Distributed Systems

In my recent blog post, Obtaining Statistical Properties Through Modeling and Simulation, I described how we can use modeling and simulation to better understand both proposed and real systems. Not only that, but it can be extremely useful when assessing the effectiveness of optimizations.

However, in that post I missed a couple of additional interesting points that I think are worth covering.

Obtaining statistical properties through modeling and simulation

Obtaining statistical properties through modeling and simulation

Sophisticated, simulations need not be. Valuable insights, even simple scripts reveal. — Formal Methods Yoda

A couple of weeks ago I was a guest on The Geek Narrator to talk about formal verification. I spoke a lot about how modeling and simulation are tremendously powerful tools, whether you use a formal verification language (such as TLA+) or just a Python script.

This post goes through a real world example of how I used modeling and simulation to understand the statistical properties of a proposed distributed system protocol, using both Python and TLA+. There is a talk version of this post from TLA+ Conf 2022.

Incremental Jobs and Data Quality Are On a Collision Course - Part 2 - The Way Forward

Incremental Jobs and Data Quality Are On a Collision Course - Part 2 - The Way Forward

So what should we do instead?

This is less of a technology problem and more of a structural problem. We can’t just add some missing features to data tooling; it’s about solving a people problem, how we organize together, how team incentives line up, and also about applying well-established software engineering principles that are still to be realized in the data analytics space.

Incremental Jobs and Data Quality Are On a Collision Course - Part 1 - The Problem

Incremental Jobs and Data Quality Are On a Collision Course - Part 1 - The Problem

Big data isn’t dead; it’s just going incremental

If you keep an eye on the data space ecosystem like I do, then you’ll be aware of the rise of DuckDB and its message that big data is dead. The idea comes from two industry papers (and associated data sets), one from the Redshift team (paper and dataset) and one from Snowflake (paper and dataset). Each paper analyzed the queries run on their platforms, and some surprising conclusions were drawn – one being that most queries were run over quite small data. The conclusion (of DuckDB) was that big data was dead, and you could use simpler query engines rather than a data warehouse. It’s far more nuanced than that, but data shows that most queries are run over smaller datasets. 

Why?

Forget the table format war; it’s open vs closed that counts

Forget the table format war; it’s open vs closed that counts

Apache Iceberg is a hot topic right now, and looks to be the future standard for representing tables in object storage. Hive tables are overdue for a replacement. People talk about table format wars: Apache Iceberg vs Delta Lake vs Apache Hudi and so on, but the “war” at the forefront of my mind isn’t which table format will become dominant, but the battle between open vs closed – open table formats vs walled gardens.

The teacher's nemesis

The teacher's nemesis

A few months ago I wrote Learning and Reviewing System Internals - Tactics and Psychology. One thing I touched on was how it is necessary to create a mental model in order to grok a codebase, or learn how a complex system works. The mental model gets developed piece by piece, using a layer of abstractions.

Today I am also writing about mental models and abstractions, but from the perspective of team/project leaders and their role in onboarding new team/project members. In this context, the team lead and senior engineers are teachers and how effective they has a material impact on the success of the team. However, there are real challenges and leaders can fail without being aware of it, with potentially poor outcomes if left unaddressed.

The curse of Conway and the data space

The curse of Conway and the data space

Conway’s Law:

"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure."

This is playing out worldwide across hundreds of thousands of organizations, and it is no more evident than in the split between software development and data analytics teams. These two groups usually have a different reporting structure, right up to, or immediately below, the executive team.

This is a problem now and is only growing.