Learning and reviewing system internals: tactics and psychology

Every now and then I get asked for advice on how to learn about distributed system internals and protocols. Over the course of my career I've picked up a learning and reviewing style that works pretty well for me.

To define these terms, learning and reviewing are similar but not the same:

  • Learning about how a system works is the easier of the two. By the means available to you (books, papers, blogs, code), you study the system to understand how it works and why it works that way.

  • Reviewing a system requires learning but also involves opinions, taking positions, making judgments. It is trickier to get right, more subjective, and often only time can show you if you were right or wrong about it and to what degree.

We all review systems to one degree or another, even if it's just a casual review where the results are some loosely held opinions shared by the coffee machine. But when it comes to sharing our opinions in more formal contexts, an architecture meeting, a blog post, a conference talk or a job interview, the stakes are higher and the risks are also greater. If you review a system and come to some conclusions, how do you know if you are right? What happens if you are wrong? Someone could point out your flawed arguments. You make a bad decision. Not only can reviewing complex systems be hard, it can be scary too.

In this post, I’ll share some of the tricks and mindset that I’ve developed regarding learning and reviewing system internals, distributed system protocols and so on. Whether it’s learning or reviewing, I initially thought to break this topic down into two broad categories:

  • How to understand a complex system.

  • How to share one’s understanding with others.

Not long into writing the first draft, I realized that I couldn’t separate them, the two are entwined. Just as the saying goes that “writing is thinking”, I think that “sharing is understanding”. You can only take your understanding so far with a passive approach. There is no better a way of testing your understanding, than sharing it with others. Without sharing and being challenged, I would have lower confidence in my own opinions and understanding of things.

With that preamble out of the way, I’ll get started.

Background learning

The more widely read you are, the easier it gets to tackle any one given system.

Academics are constantly reading papers and keeping up with the latest research, and the same applies to the best doctors, the best sports therapists, the best software engineers. If you don’t read, then you end up like the senior engineer with 10 years of experience that just lived 1 year and repeated it 10 times.

My approach to reading is as follows:

  • In areas where I have little to no experience, I start with books if they exist. Books don’t always cover internals, but if they do, then they are a great place to start as they are generally pretty high level and well-written (they got past the editors quality control).

  • In areas where I have some experience, I focus mostly on engineering papers and high quality blogs.

The general strategy is: Start wide and shallow, learn the landscape. After you have a broad understanding then start diving deeper into specific areas that interest you. 

The aim of background learning is to gain a broad understanding of what is possible, when approaches work, when they don’t work. You are learning the tools of the trade, the recognizable patterns. Often these patterns and concepts have names. Later when you are learning a software system, or protocol, you can say, “Ah, I’ve seen this pattern before” or “I have seen this tool in use before” and it makes it easier to understand a complex system. 

But where do I get time for all this reading? Don’t be afraid to spend some work hours on learning, your employer would thank you if they knew how much more effective in your job you were because of it.

The learner/reviewer frame of mind

When you know nothing, it is hard or impossible to come at learning with a critical mind. When I really started to take learning data system internals seriously, back in 2013, I am sure I believed everything I read. The reading that was formative for me back then was a book on SQL Server locking fundamentals, a book on Apache Cassandra internals and also the Jepsen tests. Stumbling on the Jepsen tests early in my career was a stroke of luck as I like to think it engendered in me a certain level of healthy skepticism and critical thinking.

With experience, the engineer learns to think critically and not drink the kool aid. The effective engineer learns a certain mindset which is both open and critical. When an engineer reviews a system design, he or she must do so with an open mind while also taking an adversarial position. An open mind because the system design often includes ingenious approaches that are new to the reviewer. The reviewer can be pleasantly surprised to learn something new that can be added to their toolkit of knowledge. But the reviewer also takes an adversarial position because the system may also have warts and weak points. The reviewer focuses on what can go wrong, what the edge cases are and so on. The best reviewers combine this open-mindedness and adversarial position to the benefit of themselves and whatever system they are analyzing.

Mental models and abstractions

“All models are wrong, but some are useful” ― George Box

If you are reviewing a well structured system design with good abstractions, then you are in luck,  it’s all laid out on paper for you already. But if you want to understand and review an existing system, then things can get a bit messy. 

Learning a large, complicated existing system can seem a bit daunting at first. The docs and blog posts may shed some light on the design but leave some patches in the dark. Building your own mental model out of this large and complicated system can be hard. You will need to get comfortable reading code, and reaching out to the community (whoever that may be) for help. Start at a high level, read the docs, read the blogs if they exist, then fill in the gaps reading code and talking to people.

How do software developers handle complexity in software? They write abstractions. How should a reviewer handle the inherent complexity of a system they are studying? Through abstractions. Some abstractions will be right there for you, built into the software by the system designers and engineers. Some will need to be invented by the reviewer to wrangle this thing into a more manageable set of parts.

“The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise.” ― Edsger Dijkstra

That is what you are trying to do. Build a mental model where you can be absolutely precise and also keep things small and simple enough to keep in your head. The art of a good system internals blog post, is to build that mental model for the reader, even if the system itself lacks any such model. From the safety of a high level mental model, you can dive off into the depths and find your way back again without becoming hopelessly lost.

Write it down

If you are trying to understand an existing software system, it's hard to build a solid understanding if you don't write it all down. Writing is thinking and if you're not thinking then you aren't doing a very good review. Sometimes it’s simply too big to keep in your head at first. If you don’t write it down, you’ll forget, get lost and generally do a poor job of it.

An informal review may just be a scratchpad of diagrams, notes, questions, answers and more questions still unanswered. A more formal review can go as far as a nicely written up and well structured analysis. The reason to do the second is that you can share it with others, which is especially important if your analysis is an opinionated review.

Don’t fall for your own hubris

“No theory adequately covers reality.” ― Anthony de Mello.

You’ve read the docs, the spec, the blog posts and read all the code. You think you know this system and are ready to pour forth your opinions and give your judgment. 

Stop, take a moment.

Is your mental model correct? Is it even possible to come to a judgment based on the level of analysis you’ve applied? Systems are complex, design decisions can be inscrutable because someone forgot to explain them anywhere. Do you think you have figured out all the reasons for why this part is this way and that other part is another way? You have an open-minded but adversarial mindset, but do you also realize that you are almost certainly wrong about something?

Humility.

Hubris is thinking you figured it all out by yourself and can rain judgments from atop your ivory tower. Humility is accepting that you are probably wrong to a certain degree and need help to understand how and where. You all might be wrong, and only find out through painful experiences down the road.

What level of understanding are you trying to reach?

There are multiple levels of understanding of a software system:

  • The mental model for how it works.

  • It’s performance characteristics.

  • It’s behavior in production settings.

And so on. 

Do you think, after reading the docs, you’ll know all of its performance characteristics? Do you think after reading a book you’ll be ready to run the thing in production? Of course not.

There are limits to the level of understanding you will be able to achieve, based on your time constraints and method of learning. This is where speaking to people is so useful. You alone won’t know the warts and wrinkles of a system in production through a theoretical analysis. Sure, you might be able to paint some broad strokes in some cases, based on previous experience, but in other cases you won’t have a clue. Sometimes the docs, the paper, or whatever has frustrating gaps! Know the limits of what you can achieve, based on the time, effort and methodology used. Make up for what you can’t know alone by talking to others that do know, through their own hard-won experience.

Test your understanding and the positions you have taken

As I have just said, don’t fool yourself. There are always a number of mistakes in your understanding. This is where sharing your analysis with others comes in. Ask them to review your review, challenge your understanding, your assumptions and conclusions. You will have gotten some things wrong or missed important points. The single-handed most important rule regarding getting things wrong is to be open-minded and not be defensive (more on that later).

Talking to the system designers themselves (if you can) is a great way to test your understanding. They likely have been working on it for months if not years. They usually know all the problems; they know why something you highlighted as a potential problem is not so in the majority of real world scenarios. It's rare you uncover something they were not aware of or didn't treat seriously enough.

Make a model

Another way I test my understanding is by writing a TLA+ model. I don't always have time for that so I only reach for it when I really need to. It doesn’t have to be TLA+, sometimes I write a simple model or simulation in Python. Reverse engineering a design into some kind of executable model or simulation is an incredibly effective way of internalizing the high level design or algorithm. Importantly, it takes a passive understanding, into an active implementation of an understanding. It makes things concrete, it strengthens your grasp of the components and relationships, and it shows you where some of the gaps in your understanding lie.

Get your hands dirty

There’s nothing like running the system you are learning about. Make it do stuff and inspect the file system. Run tests and use a debugger to follow the chain of calls. Run it in debug mode and inspect the logs.

Overcoming your fear of sharing your understanding/opinions with others

Just as hubris is a problem, so too is excessive self-doubt.

Speaking up in a meeting, writing a blog, doing a conference talk can all be scary. Some people struggle more with confidence and self-doubt than others. But I have to think that we all experience, or have experienced, the fear of other people’s negative opinions of us. I don’t want to get too “zen” here, but speaking for myself, I have found that a certain level of spiritual maturity has had a large impact on my career and my enjoyment of my profession. Freeing myself of the fear of criticism or imagined criticism was a key part of my professional maturity. But it goes both ways, a promotion can be as difficult as a cutting remark, as you feel the pressure of inflated expectations. You had a great year, got promoted, but now you have to go and pull it off again.

“Another illusion: You are all those labels that people have put on you, or that you have put on yourself. You’re not, you’re not! So you don’t have to cling to them. The day that somebody tells me I’m a genius and I take that seriously, I’m in big trouble. Can you understand why? Because now I’m going to start getting tense. I’ve got to live up to it, I’ve got to maintain it. I’ve got to find out after every lecture: ‘Did you like the lecture? Do you still think I’m a genius?’ See? So what you need to do is smash the label! Smash it, and you’re free! Don’t identify with those labels.” ― Anthony de Mello.

Taking this back to sharing (and therefore testing) your understanding of a system and any positions you may have taken. If you find yourself being defensive, or scared of even sharing your thoughts, then knock it off! Reviewing a system design is a learning process. You will make mistakes and you want people to call them out. When someone tells you that you got some key part of the analysis wrong, thank them and ask for more detail! 

You don't need a thick skin when you ask people to challenge you, you just need to change how you view the relationship. Whether the criticism was invited by you, or not, it changes nothing. Criticism of your review is not criticism of you, your worth or core identity. There is no honor to defend! You are asking them for help, and they give you help. You didn’t ask for help, then great, this person took time to write to you. You are likely wrong about some things, and they can help you see that. We're all wrong about a lot of stuff all the time. Sometimes we're wrong about things that are in our own area of expertise! Sometimes we’re wrong even after someone reads our review! No-one is safe. The key to learning is to drop the pretenses and accept the help you are given with an open mind and an open heart.

If you feel nervous sharing a written review, then it can help to tell people something like "Look, I expect I got some things wrong in my analysis, could you do me a favor and point out the mistakes you see?". If you're nervous about speaking up in a meeting then you can start with something like "I might be misunderstanding something important here but ...". This also communicates to those around you that you know you might be wrong and you're open to being told so. It actually shows more strength and maturity, which makes others feel more comfortable and more likely to tell you your mistakes (rather than just thinking them in the privacy of their own heads). 

No-one likes defensiveness and it only makes you look immature and weak. Be the grown-up in the room that people can trust (it feels good too). If you feel defensiveness welling up, remember humility, stop clinging to this facade that you are trying to maintain. The defensiveness comes because you are defending and clinging to some set of labels, some image of yourself. Let it go.

When I feel defensiveness, I personally like to remember my humility with the humor of Anthony de Mello. He jokes:

That’s the most liberating, wonderful thing in the world, when you openly admit you are an ass.  It’s wonderful.  Then, when people tell me “you’re wrong” I reply, “Well, what can you expect of an ass?”

Getting started

If you want to begin understanding system internals but not sure how then there are two things you can start with: 

  • Start with the background reading. Start reading some books, some blogs and some engineering papers. The only aim is to learn. There are no deadlines, only the rigor to keep at it, little by little. 

  • Learn a system that is relevant to your job now. For me, it was reading the book “SQL Server Concurrency: Locking, Blocking, and Row Versioning” in 2013, and then a load of SQL Server performance blogs. It made an immediate impact on my job at the time as I became “the guy” for SQl Server performance.

If you’re lucky, you’ll do both.

The end

Looking over my list of things, I dedicated as much time to psychological aspects as anything else. I think the psychological side is probably the most important and hardest to get right. Our own foolishness often gets in the way of learning. We can form opinions then feel the need to defend them when we should be welcoming counter opinions. If I could summarize it, it would be: humility, humility, humility.

The ability to learn effectively itself must be learned. Software is complicated, it's messy, it changes so quickly, how can any of us be so sure we’re right? With this level of change, the ability to learn is the most important skill of all. Tackle learning with humility, don’t be afraid to share your understanding - let people help you there. If you got something wrong, it wasn’t you, it was your understanding that was wrong.

It might seem at times that there is too much to learn, a system is too big, there are too many moving parts - relax! Take it easy and bite off one piece at a time. Build the mental model piece by piece, it will be your map so you don’t get lost. Starting to read a new codebase is always an exercise in flailing and groping in the dark at first. Be patient and keep going, you’ll soon look back at the valley where you started and be amazed at how far below you it has become.


Ps, I should probably give some recommendations for distributed systems reading. There are already plenty of lists you can find with a Google search, one that I used to refer to a lot is https://github.com/asatarin/testing-distributed-systems. In terms of broad, high level learning to start with, I don’t think you can go wrong with the book Database Internals by Alex Petrov. Martin Kleppmann did an excellent YouTube playlist , a book Data Intensive Applications, as well as a number of great blog posts. I also read Marc Brooker, Mahesh Balakrishnan, and Phil Eaton, who is very active getting a lot of discussion going. Finally, there is a wealth of learning in the Jepsen analyses. There will of course be lots more out there.