Jack Vanlightly

Building A "Simple" Distributed System - It's the Logs Stupid

Building A "Simple" Distributed System - It's the Logs Stupid

In previous posts we covered designing the protocol and verifying it with TLA+. Then we designed the implementation with Apache ZooKeeper. In this post we’ll look at a very important prerequisite for testing and release to production - good logging. The links to the rest of the series are at the bottom of this post.

It’s the Logs Stupid

Without good logging, you’re in for a world of pain and wasted hours trying to figure out why something failed. Forget the debugger, put it to one side and embrace logging as part of the development and test process. The logs will be the way in which you can identify what was going on in the environment and in each node at the time of failure. Your code will fail, over and over again, in new and surprising ways until finally towards the end of the development process you start to see it cope with everything you throw at it. We’ll be throwing a lot of nasty behaviour at the code and it will need to handle it.

Building A "Simple" Distributed System - The Implementation

Building A "Simple" Distributed System - The Implementation

In previous posts we’ve identified the requirements for our distributed resource allocation library, including one invariant (No Double Resource Access) that must hold 100% of the time no matter what and another (All Resources Evenly Accessed) that must hold after a successful rebalancing of resources. We documented a protocol that describes how nodes interact with a central registry to achieve the requirements, including how they deal with all conceived failure scenarios. Then we built a TLA+ specification and used the model checker to verify the designed protocol, identifying a defect in the process.

In this post we’ll tackle the implementation and in the next we’ll look at testing.