Learning and reviewing system internals: tactics and psychology

Learning and reviewing system internals: tactics and psychology

Every now and then I get asked for advice on how to learn about distributed system internals and protocols. Over the course of my career I've picked up a learning and reviewing style that works pretty well for me.

To define these terms, learning and reviewing are similar but not the same:

  • Learning about how a system works is the easier of the two. By the means available to you (books, papers, blogs, code), you study the system to understand how it works and why it works that way.

  • Reviewing a system requires learning but also involves opinions, taking positions, making judgments. It is trickier to get right, more subjective, and often only time can show you if you were right or wrong about it and to what degree.

We all review systems to one degree or another, even if it's just a casual review where the results are some loosely held opinions shared by the coffee machine. But when it comes to sharing our opinions in more formal contexts, an architecture meeting, a blog post, a conference talk or a job interview, the stakes are higher and the risks are also greater. If you review a system and come to some conclusions, how do you know if you are right? What happens if you are wrong? Someone could point out your flawed arguments. You make a bad decision. Not only can reviewing complex systems be hard, it can be scary too.

Hybrid Transactional/Analytical Storage

Hybrid Transactional/Analytical Storage

Confluent has made two key feature announcements in the spring of 2024:

  • Freight Clusters, a new cluster type that writes directly to object storage. It is aimed at the “freight” of data streaming workloads, log ingestion, clickstreams, large-scale ETL and so on that can be cost-prohibitive using a low latency multi-AZ replication architecture in the cloud.

  • Tableflow, an automated feature that provides seamless materialization of Kafka topics as Apache Iceberg tables (and vice-versa in the future). 

This trend towards object storage is not just happening at Confluent but across the data ecosystem.

Tableflow: the stream/table, Kafka/Iceberg duality

Tableflow: the stream/table, Kafka/Iceberg duality

Confluent just announced Tableflow, the seamless materialization of Apache Kafka topics as Apache Iceberg tables. This announcement has to be the most impactful announcement I’ve witnessed while at Confluent. This post is about why Iceberg tables aren’t just another destination to sync data to; they fundamentally change the world of streaming. It’s also about the macro trends that have led us to this point and why Iceberg (and the other table formats) are so important to the future of streaming.

The beauty of writing

The beauty of writing

I woke up this morning, sleep deprived after my cat woke me up repeatedly last night and discovered I needed to write something about writing. Perhaps it's because I'm reading "Bird by bird" again by Anne Lamott. So here is another post about writing as a software engineer.

I love writing, I love the feel of knowing there's a seed of a story or an idea inside. With some writing projects I know something is there worth telling, but I don't always know enough about it yet to be sure. Sometimes it requires upfront research and sometimes I think I know enough to start writing and see the idea emerge as I type. The piece advances, in fits and starts, as I go back and research gaps, and often discarding the whole thing and rewriting it again... until it has formed itself into something worth while. Then I send it to people for review and I get more ideas and insights making the piece better again.

S3 Express One Zone, not quite what I hoped for

S3 Express One Zone, not quite what I hoped for

AWS just announced a new lower-latency S3 storage class and for those of us in the data infrastructure business, this is big news. It’s not a secret that a low-latency object storage primitive has the potential to change how we build cloud data systems forever. So has this new world arrived with S3 Express One Zone?

The answer is no, but this is a good time to talk about cloud object storage, its role in modern cloud data systems and the potential future role it can take.

The Architecture of Serverless Data Systems

The Architecture of Serverless Data Systems

I recently blogged about why I believe the future of cloud data services is large-scale and multi-tenant, citing, among others, S3. 

Top tier SaaS services like S3 are able to deliver amazing simplicity, reliability, durability, scalability, and low price because their technologies are structurally oriented to deliver those things. Serving customers over large resource pools provides unparalleled efficiency and reliability at scale.” So said myself in that post.

To further explore this topic, I am surveying real-world serverless, multi-tenant data architectures to understand how different types of systems, such as OLTP databases, real-time OLAP, cloud data warehouses, event streaming systems, and more, implement serverless MT.

The importance of liveness properties (with TLA+ Part 2)

The importance of liveness properties (with TLA+ Part 2)

In part 1 we introduced the concept of safety and liveness properties, then a stupidly simple gossip protocol called Gossa. Our aim is to find liveness bugs in the design and improve the design until all liveness issues are fixed.

Gossa had some problems. First it had cycles due to nodes contesting whether a peer was dead or alive. We fixed that by making deadness take precedence over aliveness but still the cluster could not converge. The next problem was that a falsely accused dead node was unable to refute its deadness as no-one would pay attention to it - deadness ruled.

The proposed fix I mentioned in part 1 was to allow a falsely accused node to refute its deadness via the introduction of a monotonic counter.

The importance of liveness properties (with TLA+ Part 1)

The importance of liveness properties (with TLA+ Part 1)

Invariants get most of the attention because they are easy to write, easy to check and find those histories which lead to really bad outcomes, such as lost data. But liveness properties are really important too and after a years of writing TLA+ specifications, I couldn’t imagine having confidence in a specification without one. This post and the next is a random walk through the world of model checking liveness properties in TLA+.

The outline is like this:

  • Part 1: I (hopefully) convince you that liveness properties are important. Then implement a gossip algorithm in TLA+ and use liveness properties to find problems.

  • Part 2: Continue evolving the algorithm, finding more and more liveness problems, overcome some challenges such as infinite state-space and impart some helpful principles - making you a better engineer and thinker by the end.