The Law of Large Numbers: A Foundation for Statistical Modeling in Distributed Systems

In my recent blog post, Obtaining Statistical Properties Through Modeling and Simulation, I described how we can use modeling and simulation to better understand both proposed and real systems. Not only that, but it can be extremely useful when assessing the effectiveness of optimizations.

However, in that post I missed a couple of additional interesting points that I think are worth covering.

Empirical data vs math

The first point is that many of us don’t have super strong math skills. Going back to the gossip protocol SWIM from my prior blog post, the formula for the spreading of information over time is described as follows:

Fig 1. From SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol

I can tell you that I cannot produce formulas like that. I’m just not that good at math! I could no doubt plug that formula into a spreadsheet and produce some charts but coming up with the formula itself is not my wheelhouse. I always wished my math skills were better, but my brain doesn’t work that way.

But all is not lost. I’m a decent programmer, and I know enough about statistics to do simulation. This is where the Law of Large Numbers comes in.

A friend who read my last blog post said to me:

I came into this field from a physics education, so had a lot more exposure to stats, and it's still never intuitive for me. But I did learn to feel good about trusting the law of large numbers.

The Law of Large Numbers

Hopefully I’ve convinced you that modeling and simulation can help uncover crucial statistical properties. But why do these simulations provide reliable insights? The answer lies in the Law of Large Numbers.

The law of large numbers (LLN) is a fundamental principle in probability and statistics that describes the result of performing the same experiment many times. It states that, as the number of trials or observations increases, the average of the observed results will converge to the expected or theoretical average.

The classic example is flipping a fair coin repeatedly. The proportion of heads will get closer and closer to 0.5 (the expected, theoretical probability) as the number of flips increases.

There are some constraints to LLN:

The observations must be independent.
They should be drawn from the same probability distribution (identically distributed).
The population must have a finite expected value and variance.

Using the example of a fair coin, the "population" refers to the theoretical set of all possible outcomes of flipping the coin an infinite number of times. For a fair coin, this population has two equally likely outcomes: heads and tails, each with a theoretical probability of 0.5.

My cooperating queue consumer example qualified for all these constraints:

Each run (or trial) was completely independent (no influence from any prior runs).
All non-deterministic behaviors were sourced using the same distribution — a uniform distribution (i.e. the random stuff was truly random and not skewed).
The poplation (the number of rounds to reach balance) was finite.

Applying LLN to (Distributed) System Simulations

Systems are far more complex than fair coins which only have two outcomes. With a fair coin, we asymptote towards a probability of 0.5 for heads and for tails, as we perform more and more simulations. But complex systems have much more complex statistical properties than that. In the example of a gossip protocol or the cooperating consumers of my prior blog post, there is a high amount of variance in the results. Understanding the size and shape of this variance can be critical.

Visualizing probability distributions play a crucial role in understanding experiment results, because a good chart can clearly describe the likelihood of different outcomes. There are a number of chart types for showing distributions. In my last post I included a violin plot and a simple line plot with a number of percentiles. With enough simulations, the distribution of outcomes will approach the theoretical distribution that is expected.

Fig 2. Violin plots effectively combine summary statistics and detailed distribution shapes, making them ideal for visualizing data density, variability, and comparisons across groups. In this case we see that a higher dissemination limit (the number of times a node will share the same piece of information with peers) results in fewer rounds to reach convergence, with much lower variance compared to lower dissemination limits.

Understanding these probability distributions is critical. Outliers in individual runs gain context as aggregate trends emerge, ensuring statistically robust conclusions. Getting good at using data visualization libraries, and learning which types of plots are good for what is really useful (and fun).

Final thoughts

In the world of distributed systems, the Law of Large Numbers isn’t just a theoretical tool; it’s the foundation for extracting meaningful data from simulations, making you a better systems designer. It’s really a foundational tool in the system designers toolkit in my opinion. What’s more, simulation can make up for any short-comings in your math skills. Don’t be ashamed, math is hard!

I’ve used simulation on a number of occasions when I didn’t feel confident in my combinatorics and spreadsheet formulae:

When I didn’t trust my own use of combinatorics when trying to understand some properties of distributing data over a cluster of nodes under different failure scenarios. It allowed me to try out optimizations, and distribution strategies such as Copysets and the impact of cell-based architectures (without inscrutable Google sheets formulae).
For understanding basic disk and network resource usage, given different cluster sizes, different replication factors, different amounts of skew, and scenarios such as an AZ going down.
For doing cost analysis. I wrote my own cost analyzer based on simulation data. The pricing team found out and asked to compare our results as they had some reservations about their own formulas. The results were identical which made us both feel a lot more confident.

But be careful:

If your model is fundamentally wrong it will produce plausibly correct garbage. Being conscious of the weaknesses of your model is important. As the famous phrase goes: “All models are wrong but some are useful“.
Scrutinize the results. Unexpected results could be an indicator your model is wrong, but it could also uncover unexpected behaviors that do exist. So apply some scepticism to the process.

Now go out there and have some fun doing modeling and simulation :)