Building A "Simple" Distributed System - It's the Logs Stupid

In previous posts we covered designing the protocol and verifying it with TLA+. Then we designed the implementation with Apache ZooKeeper. In this post we’ll look at a very important prerequisite for testing and release to production - good logging. The links to the rest of the series are at the bottom of this post.

It’s the Logs Stupid

Without good logging, you’re in for a world of pain and wasted hours trying to figure out why something failed. Forget the debugger, put it to one side and embrace logging as part of the development and test process. The logs will be the way in which you can identify what was going on in the environment and in each node at the time of failure. Your code will fail, over and over again, in new and surprising ways until finally towards the end of the development process you start to see it cope with everything you throw at it. We’ll be throwing a lot of nasty behaviour at the code and it will need to handle it.

What makes “good logging”? We need to be able to see into the system to see the key actions and decisions that went on. Generally speaking that means knowing when, where, who, what and why.

Below are the logs of three nodes during a very short randomized test run that started with six resources. During the 1 minute test, the test runner decided to add a resource, add another resource, shutdown a node and shutdown a second node. FInally the test runner tells the final node to shutdown and ends the test. This was not real test, I just needed an example set of logs. It has the host name omitted.

Node 0 log

10:31:34,760 : INFO :  : Will try to open a new session in 243ms
10:31:35,135 : INFO :  : Initializing zookeeper client paths
10:31:38,433 : INFO : c_0000000000 : Client znode registered with Id c_0000000000
10:31:38,474 : INFO : c_0000000000 : Looking for a next smaller sibling to watch
10:31:38,545 : INFO : c_0000000000 : Becoming coordinator
10:31:38,820 : INFO : c_0000000000 : Have successfully become the coordinator
10:31:38,832 : INFO : c_0000000000 : Coordinator - Rebalancing triggered
10:31:38,933 : INFO : c_0000000000 : Coordinator - Get clients and resources list
10:31:39,148 : INFO : c_0000000000 : Coordinator - Assign resources (res5,res4,res1,res0,res3,res2) to clients (c_0000000002,c_0000000000,c_0000000001)
10:31:39,354 : INFO : c_0000000000 : Coordinator - Putting barriers on resources res4,res3
10:31:39,417 : INFO : c_0000000000 : Coordinator - Put barriers on 2 resources of 2
10:31:39,417 : INFO : c_0000000000 : Coordinator - Invoking OnAssignment with resources res4,res3
10:31:39,419 : INFO : c_0000000000 : Coordinator - OnAssignment complete
10:31:39,426 : INFO : c_0000000000 : Coordinator - Rebalancing complete
10:32:04,849 : INFO : c_0000000000 : Coordinator - KEEPER EVENT SyncConnected - NodeChildrenChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:04,904 : INFO : c_0000000000 : Coordinator - Rebalancing triggered
10:32:05,011 : INFO : c_0000000000 : Coordinator - Get clients and resources list
10:32:05,093 : INFO : c_0000000000 : Coordinator - Assign resources (res5,res4,res6,res1,res0,res3,res2) to clients (c_0000000002,c_0000000000,c_0000000001)
10:32:05,111 : INFO : c_0000000000 : Coordinator - Invoking OnUnassignment actions. Unassigned resources res4,res3
10:32:05,117 : INFO : c_0000000000 : Coordinator - Removing barriers on resources res4,res3
10:32:05,304 : INFO : c_0000000000 : Coordinator - Removed barriers on 2 resources of 2
10:32:05,304 : INFO : c_0000000000 : Coordinator - OnUnassignment complete
10:32:05,304 : INFO : c_0000000000 : Coordinator - Putting barriers on resources res4,res0
10:32:06,672 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res0
10:32:08,718 : INFO : c_0000000000 : Coordinator - Put barriers on 2 resources of 2
10:32:08,718 : INFO : c_0000000000 : Coordinator - Invoking OnAssignment with resources res4,res0
10:32:08,718 : INFO : c_0000000000 : Coordinator - OnAssignment complete
10:32:08,719 : INFO : c_0000000000 : Coordinator - Rebalancing complete
10:32:15,019 : INFO : c_0000000000 : Coordinator - KEEPER EVENT SyncConnected - NodeChildrenChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:26,006 : INFO : c_0000000000 : Coordinator - Rebalancing triggered
10:32:26,082 : INFO : c_0000000000 : Coordinator - Get clients and resources list
10:32:28,020 : INFO : c_0000000000 : Coordinator - Assign resources (res5,res4,res7,res6,res1,res0,res3,res2) to clients (c_0000000002,c_0000000000,c_0000000001)
10:32:30,779 : INFO : c_0000000000 : Coordinator - Invoking OnUnassignment actions. Unassigned resources res4,res0
10:32:30,779 : INFO : c_0000000000 : Coordinator - Removing barriers on resources res4,res0
10:32:30,862 : INFO : c_0000000000 : Coordinator - Removed barriers on 2 resources of 2
10:32:30,862 : INFO : c_0000000000 : Coordinator - OnUnassignment complete
10:32:30,862 : INFO : c_0000000000 : Coordinator - Putting barriers on resources res4,res1,res2
10:32:31,453 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res1
10:32:33,927 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res1
10:32:36,861 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res1
10:32:40,325 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res1
10:32:53,846 : WARN : c_0000000000 : Coordinator - Rebalancing cancelled
10:32:56,811 : INFO : c_0000000000 : The coordinator has exited for reason Cancelled
10:32:56,811 : INFO : c_0000000000 : Client terminating due: cancellation
10:32:56,811 : INFO : c_0000000000 : Client terminated

Node 1 log

10:31:34,790 : INFO :  : Will try to open a new session in 502ms
10:31:37,024 : INFO :  : Initializing zookeeper client paths
10:31:38,911 : INFO : c_0000000001 : Client znode registered with Id c_0000000001
10:31:38,958 : INFO : c_0000000001 : Looking for a next smaller sibling to watch
10:31:39,057 : INFO : c_0000000001 : Becoming a follower
10:31:39,104 : INFO : c_0000000001 : Follower - Set a watch on sibling node /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/clients/c_0000000000
10:31:39,182 : INFO : c_0000000001 : Follower - Set a watch on resources node
10:31:39,182 : INFO : c_0000000001 : Have become a follower, starting follower event loop
10:31:39,290 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:31:40,447 : INFO : c_0000000001 : Follower - Rebalancing triggered
10:31:40,508 : INFO : c_0000000001 : Follower - Putting barriers on resources res1,res2
10:31:40,545 : INFO : c_0000000001 : Follower - Put barriers on 2 resources of 2
10:31:40,545 : INFO : c_0000000001 : Follower - Invoking OnAssignment with resources res1,res2
10:31:40,545 : INFO : c_0000000001 : Follower - OnAssignment complete
10:31:40,545 : INFO : c_0000000001 : Follower - Rebalancing complete
10:32:05,219 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:06,297 : INFO : c_0000000001 : Follower - Rebalancing triggered
10:32:06,496 : INFO : c_0000000001 : Follower - Invoking OnUnassignment actions. Unassigned resources res1,res2
10:32:06,496 : INFO : c_0000000001 : Follower - Removing barriers on resources res1,res2
10:32:06,684 : INFO : c_0000000001 : Follower - Removed barriers on 2 resources of 2
10:32:06,684 : INFO : c_0000000001 : Follower - OnUnassignment complete
10:32:06,721 : INFO : c_0000000001 : Follower - Putting barriers on resources res6,res3
10:32:06,802 : INFO : c_0000000001 : Follower - Put barriers on 2 resources of 2
10:32:06,802 : INFO : c_0000000001 : Follower - Invoking OnAssignment with resources res6,res3
10:32:06,802 : INFO : c_0000000001 : Follower - OnAssignment complete
10:32:06,802 : INFO : c_0000000001 : Follower - Rebalancing complete
10:32:28,774 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:30,797 : INFO : c_0000000001 : Follower - Rebalancing triggered
10:32:30,800 : INFO : c_0000000001 : Follower - Invoking OnUnassignment actions. Unassigned resources res6,res3
10:32:30,800 : INFO : c_0000000001 : Follower - Removing barriers on resources res6,res3
10:32:30,877 : INFO : c_0000000001 : Follower - Removed barriers on 2 resources of 2
10:32:30,877 : INFO : c_0000000001 : Follower - OnUnassignment complete
10:32:30,919 : INFO : c_0000000001 : Follower - Putting barriers on resources res7,res0
10:32:31,456 : INFO : c_0000000001 : Follower - Put barriers on 2 resources of 2
10:32:31,456 : INFO : c_0000000001 : Follower - Invoking OnAssignment with resources res7,res0
10:32:31,456 : INFO : c_0000000001 : Follower - OnAssignment complete
10:32:31,456 : INFO : c_0000000001 : Follower - Rebalancing complete
10:32:56,819 : INFO : c_0000000001 : Follower - KEEPER EVENT Disconnected - None
10:32:56,850 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - None
10:33:02,016 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - NodeDeleted - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/clients/c_0000000000
10:33:02,808 : INFO : c_0000000001 : Follower - Invoking OnUnassignment actions. Unassigned resources res7,res0
10:33:02,808 : INFO : c_0000000001 : Follower - Removing barriers on resources res7,res0
10:33:02,845 : INFO : c_0000000001 : Follower - Removed barriers on 2 resources of 2
10:33:02,845 : INFO : c_0000000001 : Follower - OnUnassignment complete
10:33:02,855 : INFO : c_0000000001 : The follower has exited for reason PossibleRoleChange
10:33:02,888 : INFO : c_0000000001 : Looking for a next smaller sibling to watch
10:33:02,920 : INFO : c_0000000001 : Becoming coordinator
10:33:02,998 : INFO : c_0000000001 : Have successfully become the coordinator
10:33:02,999 : INFO : c_0000000001 : Coordinator - Rebalancing triggered
10:33:03,051 : INFO : c_0000000001 : Coordinator - Get clients and resources list
10:33:03,107 : INFO : c_0000000001 : Coordinator - Assign resources (res5,res4,res7,res6,res1,res0,res3,res2) to clients (c_0000000001)
10:33:03,131 : INFO : c_0000000001 : Coordinator - Putting barriers on resources res5,res4,res7,res6,res1,res0,res3,res2
10:33:03,386 : INFO : c_0000000001 : Coordinator - Put barriers on 8 resources of 8
10:33:03,386 : INFO : c_0000000001 : Coordinator - Invoking OnAssignment with resources res5,res4,res7,res6,res1,res0,res3,res2
10:33:03,386 : INFO : c_0000000001 : Coordinator - OnAssignment complete
10:33:03,386 : INFO : c_0000000001 : Coordinator - Rebalancing complete
10:33:08,479 : INFO : c_0000000001 : The coordinator has exited for reason Cancelled
10:33:08,479 : INFO : c_0000000001 : Client terminating due: cancellation
10:33:08,479 : INFO : c_0000000001 : No role - Invoking OnUnassignment actions. Unassigned resources res5,res4,res7,res6,res1,res0,res3,res2
10:33:08,479 : INFO : c_0000000001 : No role - Removing barriers on resources res5,res4,res7,res6,res1,res0,res3,res2
10:33:14,680 : INFO : c_0000000001 : No role - OnUnassignment complete
10:33:14,680 : INFO : c_0000000001 : Client terminated

Node 2 log

10:31:34,825 : INFO :  : Will try to open a new session in 739ms
10:31:37,196 : INFO :  : Initializing zookeeper client paths
10:31:38,455 : INFO : c_0000000002 : Client znode registered with Id c_0000000002
10:31:38,507 : INFO : c_0000000002 : Looking for a next smaller sibling to watch
10:31:38,584 : INFO : c_0000000002 : Becoming a follower
10:31:38,655 : INFO : c_0000000002 : Follower - Set a watch on sibling node /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/clients/c_0000000001
10:31:38,707 : INFO : c_0000000002 : Follower - Set a watch on resources node
10:31:38,709 : INFO : c_0000000002 : Have become a follower, starting follower event loop
10:31:39,230 : INFO : c_0000000002 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:31:39,813 : INFO : c_0000000002 : Follower - Rebalancing triggered
10:31:39,884 : INFO : c_0000000002 : Follower - Putting barriers on resources res5,res0
10:31:40,003 : INFO : c_0000000002 : Follower - Put barriers on 2 resources of 2
10:31:40,003 : INFO : c_0000000002 : Follower - Invoking OnAssignment with resources res5,res0
10:31:40,003 : INFO : c_0000000002 : Follower - OnAssignment complete
10:31:40,006 : INFO : c_0000000002 : Follower - Rebalancing complete
10:32:05,213 : INFO : c_0000000002 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:06,517 : INFO : c_0000000002 : Follower - Rebalancing triggered
10:32:06,526 : INFO : c_0000000002 : Follower - Invoking OnUnassignment actions. Unassigned resources res5,res0
10:32:06,535 : INFO : c_0000000002 : Follower - Removing barriers on resources res5,res0
10:32:06,634 : INFO : c_0000000002 : Follower - Removed barriers on 2 resources of 2
10:32:06,637 : INFO : c_0000000002 : Follower - OnUnassignment complete
10:32:06,706 : INFO : c_0000000002 : Follower - Putting barriers on resources res5,res1,res2
10:32:06,791 : INFO : c_0000000002 : Follower - Put barriers on 3 resources of 3
10:32:06,791 : INFO : c_0000000002 : Follower - Invoking OnAssignment with resources res5,res1,res2
10:32:06,791 : INFO : c_0000000002 : Follower - OnAssignment complete
10:32:06,791 : INFO : c_0000000002 : Follower - Rebalancing complete
10:32:26,116 : INFO : c_0000000002 : Follower - Invoking OnUnassignment actions. Unassigned resources res5,res1,res2
10:32:26,116 : INFO : c_0000000002 : Follower - Removing barriers on resources res5,res1,res2
10:32:26,734 : INFO : c_0000000002 : Follower - Aborted removal of barriers due to cancellation of the client.
10:32:29,766 : INFO : c_0000000002 : Follower - OnUnassignment complete
10:32:31,448 : INFO : c_0000000002 : The follower has exited for reason Cancelled
10:32:31,450 : INFO : c_0000000002 : Client terminating due: cancellation
10:32:31,450 : INFO : c_0000000002 : Client terminated

Each log clearly describes the triggering Apache ZooKeeper events, then the actions and decisions made by each node.

From node 0’s perspective, we see it become the coordinator (leader), see ZooKeeper events related to the resource and client changes that trigger multiple rebalancings. What is interesting is that we see that the node seems to get blocked in the final rebalancing, waiting for node 2 to release its barrier on resource 1.

From node 1’s perspective, it starts up, does not have the smallest node id and places a watch on the coordinator and resources znode. It gets notified by ZooKeeper of changes (KEEPER EVENT lines) and reacts to those notifications according to the protocol. The final actions are it responding to a call to the StopAsync method which is logged as cancellation. It safely shutsdown, calling OnUnassigned and removing its barriers.

From node 2’s perspective, it also becomes a follower and has a similar story to node 1. The difference is that it was told to shutdown mid-rebalancing and did not remove its barriers. This is what caused node 0 to get blocked during its final rebalancing. If node 0 had not also been told to shutdown it would have eventually have completed its rebalancing as the barriers belonging to node 2 would have been removed by ZooKeeper. Barriers are ephemeral nodes and automatically removed once a node’s session has expired, but the test did not wait for that. This exposes a probable improvement we could make - once cancellation has been issued, still allow barriers to be removed in order to reduce the impact of cancellation during rebalancings.

It’s when we see the logs combined that the series of events is easier to understand. If your nodes are deployed to different hosts/containers, then you’ll need to gather the logs and interleave them into a single log. If you run all nodes as part of a single test process, then you get this automatically.

10:31:34,760 : INFO :  : Will try to open a new session in 243ms
10:31:34,790 : INFO :  : Will try to open a new session in 502ms
10:31:34,825 : INFO :  : Will try to open a new session in 739ms
10:31:35,135 : INFO :  : Initializing zookeeper client paths
10:31:37,024 : INFO :  : Initializing zookeeper client paths
10:31:37,196 : INFO :  : Initializing zookeeper client paths
10:31:38,433 : INFO : c_0000000000 : Client znode registered with Id c_0000000000
10:31:38,455 : INFO : c_0000000002 : Client znode registered with Id c_0000000002
10:31:38,474 : INFO : c_0000000000 : Looking for a next smaller sibling to watch
10:31:38,507 : INFO : c_0000000002 : Looking for a next smaller sibling to watch
10:31:38,545 : INFO : c_0000000000 : Becoming coordinator
10:31:38,584 : INFO : c_0000000002 : Becoming a follower
10:31:38,655 : INFO : c_0000000002 : Follower - Set a watch on sibling node /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/clients/c_0000000001
10:31:38,707 : INFO : c_0000000002 : Follower - Set a watch on resources node
10:31:38,709 : INFO : c_0000000002 : Have become a follower, starting follower event loop
10:31:38,820 : INFO : c_0000000000 : Have successfully become the coordinator
10:31:38,832 : INFO : c_0000000000 : Coordinator - Rebalancing triggered
10:31:38,911 : INFO : c_0000000001 : Client znode registered with Id c_0000000001
10:31:38,933 : INFO : c_0000000000 : Coordinator - Get clients and resources list
10:31:38,958 : INFO : c_0000000001 : Looking for a next smaller sibling to watch
10:31:39,057 : INFO : c_0000000001 : Becoming a follower
10:31:39,104 : INFO : c_0000000001 : Follower - Set a watch on sibling node /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/clients/c_0000000000
10:31:39,148 : INFO : c_0000000000 : Coordinator - Assign resources (res5,res4,res1,res0,res3,res2) to clients (c_0000000002,c_0000000000,c_0000000001)
10:31:39,182 : INFO : c_0000000001 : Follower - Set a watch on resources node
10:31:39,182 : INFO : c_0000000001 : Have become a follower, starting follower event loop
10:31:39,230 : INFO : c_0000000002 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:31:39,290 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:31:39,354 : INFO : c_0000000000 : Coordinator - Putting barriers on resources res4,res3
10:31:39,417 : INFO : c_0000000000 : Coordinator - Put barriers on 2 resources of 2
10:31:39,417 : INFO : c_0000000000 : Coordinator - Invoking OnAssignment with resources res4,res3
10:31:39,419 : INFO : c_0000000000 : Coordinator - OnAssignment complete
10:31:39,426 : INFO : c_0000000000 : Coordinator - Rebalancing complete
10:31:39,813 : INFO : c_0000000002 : Follower - Rebalancing triggered
10:31:39,884 : INFO : c_0000000002 : Follower - Putting barriers on resources res5,res0
10:31:40,003 : INFO : c_0000000002 : Follower - Put barriers on 2 resources of 2
10:31:40,003 : INFO : c_0000000002 : Follower - Invoking OnAssignment with resources res5,res0
10:31:40,003 : INFO : c_0000000002 : Follower - OnAssignment complete
10:31:40,006 : INFO : c_0000000002 : Follower - Rebalancing complete
10:31:40,447 : INFO : c_0000000001 : Follower - Rebalancing triggered
10:31:40,508 : INFO : c_0000000001 : Follower - Putting barriers on resources res1,res2
10:31:40,545 : INFO : c_0000000001 : Follower - Put barriers on 2 resources of 2
10:31:40,545 : INFO : c_0000000001 : Follower - Invoking OnAssignment with resources res1,res2
10:31:40,545 : INFO : c_0000000001 : Follower - OnAssignment complete
10:31:40,545 : INFO : c_0000000001 : Follower - Rebalancing complete
10:32:04,849 : INFO : c_0000000000 : Coordinator - KEEPER EVENT SyncConnected - NodeChildrenChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:04,904 : INFO : c_0000000000 : Coordinator - Rebalancing triggered
10:32:05,011 : INFO : c_0000000000 : Coordinator - Get clients and resources list
10:32:05,093 : INFO : c_0000000000 : Coordinator - Assign resources (res5,res4,res6,res1,res0,res3,res2) to clients (c_0000000002,c_0000000000,c_0000000001)
10:32:05,111 : INFO : c_0000000000 : Coordinator - Invoking OnUnassignment actions. Unassigned resources res4,res3
10:32:05,117 : INFO : c_0000000000 : Coordinator - Removing barriers on resources res4,res3
10:32:05,213 : INFO : c_0000000002 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:05,219 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:05,304 : INFO : c_0000000000 : Coordinator - Removed barriers on 2 resources of 2
10:32:05,304 : INFO : c_0000000000 : Coordinator - OnUnassignment complete
10:32:05,304 : INFO : c_0000000000 : Coordinator - Putting barriers on resources res4,res0
10:32:06,297 : INFO : c_0000000001 : Follower - Rebalancing triggered
10:32:06,496 : INFO : c_0000000001 : Follower - Invoking OnUnassignment actions. Unassigned resources res1,res2
10:32:06,496 : INFO : c_0000000001 : Follower - Removing barriers on resources res1,res2
10:32:06,517 : INFO : c_0000000002 : Follower - Rebalancing triggered
10:32:06,526 : INFO : c_0000000002 : Follower - Invoking OnUnassignment actions. Unassigned resources res5,res0
10:32:06,535 : INFO : c_0000000002 : Follower - Removing barriers on resources res5,res0
10:32:06,634 : INFO : c_0000000002 : Follower - Removed barriers on 2 resources of 2
10:32:06,637 : INFO : c_0000000002 : Follower - OnUnassignment complete
10:32:06,672 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res0
10:32:06,684 : INFO : c_0000000001 : Follower - Removed barriers on 2 resources of 2
10:32:06,684 : INFO : c_0000000001 : Follower - OnUnassignment complete
10:32:06,706 : INFO : c_0000000002 : Follower - Putting barriers on resources res5,res1,res2
10:32:06,721 : INFO : c_0000000001 : Follower - Putting barriers on resources res6,res3
10:32:06,791 : INFO : c_0000000002 : Follower - Put barriers on 3 resources of 3
10:32:06,791 : INFO : c_0000000002 : Follower - Invoking OnAssignment with resources res5,res1,res2
10:32:06,791 : INFO : c_0000000002 : Follower - OnAssignment complete
10:32:06,791 : INFO : c_0000000002 : Follower - Rebalancing complete
10:32:06,802 : INFO : c_0000000001 : Follower - Put barriers on 2 resources of 2
10:32:06,802 : INFO : c_0000000001 : Follower - Invoking OnAssignment with resources res6,res3
10:32:06,802 : INFO : c_0000000001 : Follower - OnAssignment complete
10:32:06,802 : INFO : c_0000000001 : Follower - Rebalancing complete
10:32:08,718 : INFO : c_0000000000 : Coordinator - Put barriers on 2 resources of 2
10:32:08,718 : INFO : c_0000000000 : Coordinator - Invoking OnAssignment with resources res4,res0
10:32:08,718 : INFO : c_0000000000 : Coordinator - OnAssignment complete
10:32:08,719 : INFO : c_0000000000 : Coordinator - Rebalancing complete
10:32:15,019 : INFO : c_0000000000 : Coordinator - KEEPER EVENT SyncConnected - NodeChildrenChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:26,006 : INFO : c_0000000000 : Coordinator - Rebalancing triggered
10:32:26,082 : INFO : c_0000000000 : Coordinator - Get clients and resources list
10:32:26,116 : INFO : c_0000000002 : Follower - Invoking OnUnassignment actions. Unassigned resources res5,res1,res2
10:32:26,116 : INFO : c_0000000002 : Follower - Removing barriers on resources res5,res1,res2
10:32:26,734 : INFO : c_0000000002 : Follower - Aborted removal of barriers due to cancellation of the client.
10:32:28,020 : INFO : c_0000000000 : Coordinator - Assign resources (res5,res4,res7,res6,res1,res0,res3,res2) to clients (c_0000000002,c_0000000000,c_0000000001)
10:32:28,774 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:29,766 : INFO : c_0000000002 : Follower - OnUnassignment complete
10:32:30,779 : INFO : c_0000000000 : Coordinator - Invoking OnUnassignment actions. Unassigned resources res4,res0
10:32:30,779 : INFO : c_0000000000 : Coordinator - Removing barriers on resources res4,res0
10:32:30,797 : INFO : c_0000000001 : Follower - Rebalancing triggered
10:32:30,800 : INFO : c_0000000001 : Follower - Invoking OnUnassignment actions. Unassigned resources res6,res3
10:32:30,800 : INFO : c_0000000001 : Follower - Removing barriers on resources res6,res3
10:32:30,862 : INFO : c_0000000000 : Coordinator - Removed barriers on 2 resources of 2
10:32:30,862 : INFO : c_0000000000 : Coordinator - OnUnassignment complete
10:32:30,862 : INFO : c_0000000000 : Coordinator - Putting barriers on resources res4,res1,res2
10:32:30,877 : INFO : c_0000000001 : Follower - Removed barriers on 2 resources of 2
10:32:30,877 : INFO : c_0000000001 : Follower - OnUnassignment complete
10:32:30,919 : INFO : c_0000000001 : Follower - Putting barriers on resources res7,res0
10:32:31,448 : INFO : c_0000000002 : The follower has exited for reason Cancelled
10:32:31,450 : INFO : c_0000000002 : Client terminating due: cancellation
10:32:31,450 : INFO : c_0000000002 : Client terminated
10:32:31,453 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res1
10:32:31,456 : INFO : c_0000000001 : Follower - Put barriers on 2 resources of 2
10:32:31,456 : INFO : c_0000000001 : Follower - Invoking OnAssignment with resources res7,res0
10:32:31,456 : INFO : c_0000000001 : Follower - OnAssignment complete
10:32:31,456 : INFO : c_0000000001 : Follower - Rebalancing complete
10:32:33,927 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res1
10:32:36,861 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res1
10:32:40,325 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res1
10:32:53,846 : WARN : c_0000000000 : Coordinator - Rebalancing cancelled
10:32:56,811 : INFO : c_0000000000 : The coordinator has exited for reason Cancelled
10:32:56,811 : INFO : c_0000000000 : Client terminating due: cancellation
10:32:56,811 : INFO : c_0000000000 : Client terminated
10:32:56,819 : INFO : c_0000000001 : Follower - KEEPER EVENT Disconnected - None
10:32:56,850 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - None
10:33:02,016 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - NodeDeleted - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/clients/c_0000000000
10:33:02,808 : INFO : c_0000000001 : Follower - Invoking OnUnassignment actions. Unassigned resources res7,res0
10:33:02,808 : INFO : c_0000000001 : Follower - Removing barriers on resources res7,res0
10:33:02,845 : INFO : c_0000000001 : Follower - Removed barriers on 2 resources of 2
10:33:02,845 : INFO : c_0000000001 : Follower - OnUnassignment complete
10:33:02,855 : INFO : c_0000000001 : The follower has exited for reason PossibleRoleChange
10:33:02,888 : INFO : c_0000000001 : Looking for a next smaller sibling to watch
10:33:02,920 : INFO : c_0000000001 : Becoming coordinator
10:33:02,998 : INFO : c_0000000001 : Have successfully become the coordinator
10:33:02,999 : INFO : c_0000000001 : Coordinator - Rebalancing triggered
10:33:03,051 : INFO : c_0000000001 : Coordinator - Get clients and resources list
10:33:03,107 : INFO : c_0000000001 : Coordinator - Assign resources (res5,res4,res7,res6,res1,res0,res3,res2) to clients (c_0000000001)
10:33:03,131 : INFO : c_0000000001 : Coordinator - Putting barriers on resources res5,res4,res7,res6,res1,res0,res3,res2
10:33:03,386 : INFO : c_0000000001 : Coordinator - Put barriers on 8 resources of 8
10:33:03,386 : INFO : c_0000000001 : Coordinator - Invoking OnAssignment with resources res5,res4,res7,res6,res1,res0,res3,res2
10:33:03,386 : INFO : c_0000000001 : Coordinator - OnAssignment complete
10:33:03,386 : INFO : c_0000000001 : Coordinator - Rebalancing complete
10:33:08,479 : INFO : c_0000000001 : The coordinator has exited for reason Cancelled
10:33:08,479 : INFO : c_0000000001 : Client terminating due: cancellation
10:33:08,479 : INFO : c_0000000001 : No role - Invoking OnUnassignment actions. Unassigned resources res5,res4,res7,res6,res1,res0,res3,res2
10:33:08,479 : INFO : c_0000000001 : No role - Removing barriers on resources res5,res4,res7,res6,res1,res0,res3,res2
10:33:14,680 : INFO : c_0000000001 : No role - OnUnassignment complete
10:33:14,680 : INFO : c_0000000001 : Client terminated

The unified log makes it easy to understand the interactions of the nodes and why they behaved like they did. But we’re still missing information. The test runner is adding/removing clients and resources. It might also be injecting faults. It would be nice to see those in the logs as well.

10:31:34,760 : INFO :  : Will try to open a new session in 243ms
10:31:34,790 : INFO :  : Will try to open a new session in 502ms
10:31:34,825 : INFO :  : Will try to open a new session in 739ms
10:31:35,135 : INFO :  : Initializing zookeeper client paths
10:31:37,024 : INFO :  : Initializing zookeeper client paths
10:31:37,196 : INFO :  : Initializing zookeeper client paths
10:31:38,433 : INFO : c_0000000000 : Client znode registered with Id c_0000000000
10:31:38,455 : INFO : c_0000000002 : Client znode registered with Id c_0000000002
10:31:38,474 : INFO : c_0000000000 : Looking for a next smaller sibling to watch
10:31:38,507 : INFO : c_0000000002 : Looking for a next smaller sibling to watch
10:31:38,545 : INFO : c_0000000000 : Becoming coordinator
10:31:38,584 : INFO : c_0000000002 : Becoming a follower
10:31:38,655 : INFO : c_0000000002 : Follower - Set a watch on sibling node /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/clients/c_0000000001
10:31:38,707 : INFO : c_0000000002 : Follower - Set a watch on resources node
10:31:38,709 : INFO : c_0000000002 : Have become a follower, starting follower event loop
10:31:38,820 : INFO : c_0000000000 : Have successfully become the coordinator
10:31:38,832 : INFO : c_0000000000 : Coordinator - Rebalancing triggered
10:31:38,911 : INFO : c_0000000001 : Client znode registered with Id c_0000000001
10:31:38,933 : INFO : c_0000000000 : Coordinator - Get clients and resources list
10:31:38,958 : INFO : c_0000000001 : Looking for a next smaller sibling to watch
10:31:39,057 : INFO : c_0000000001 : Becoming a follower
10:31:39,104 : INFO : c_0000000001 : Follower - Set a watch on sibling node /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/clients/c_0000000000
10:31:39,148 : INFO : c_0000000000 : Coordinator - Assign resources (res5,res4,res1,res0,res3,res2) to clients (c_0000000002,c_0000000000,c_0000000001)
10:31:39,182 : INFO : c_0000000001 : Follower - Set a watch on resources node
10:31:39,182 : INFO : c_0000000001 : Have become a follower, starting follower event loop
10:31:39,230 : INFO : c_0000000002 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:31:39,290 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:31:39,354 : INFO : c_0000000000 : Coordinator - Putting barriers on resources res4,res3
10:31:39,417 : INFO : c_0000000000 : Coordinator - Put barriers on 2 resources of 2
10:31:39,417 : INFO : c_0000000000 : Coordinator - Invoking OnAssignment with resources res4,res3
10:31:39,419 : INFO : c_0000000000 : Coordinator - OnAssignment complete
10:31:39,426 : INFO : c_0000000000 : Coordinator - Rebalancing complete
10:31:39,813 : INFO : c_0000000002 : Follower - Rebalancing triggered
10:31:39,884 : INFO : c_0000000002 : Follower - Putting barriers on resources res5,res0
10:31:40,003 : INFO : c_0000000002 : Follower - Put barriers on 2 resources of 2
10:31:40,003 : INFO : c_0000000002 : Follower - Invoking OnAssignment with resources res5,res0
10:31:40,003 : INFO : c_0000000002 : Follower - OnAssignment complete
10:31:40,006 : INFO : c_0000000002 : Follower - Rebalancing complete
10:31:40,447 : INFO : c_0000000001 : Follower - Rebalancing triggered
10:31:40,508 : INFO : c_0000000001 : Follower - Putting barriers on resources res1,res2
10:31:40,545 : INFO : c_0000000001 : Follower - Put barriers on 2 resources of 2
10:31:40,545 : INFO : c_0000000001 : Follower - Invoking OnAssignment with resources res1,res2
10:31:40,545 : INFO : c_0000000001 : Follower - OnAssignment complete
10:31:40,545 : INFO : c_0000000001 : Follower - Rebalancing complete
10:32:04,825 : INFO : TEST RUNNER : Test 0
10:32:04,827 : INFO : TEST RUNNER : Adding a resource
10:32:04,849 : INFO : c_0000000000 : Coordinator - KEEPER EVENT SyncConnected - NodeChildrenChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:04,904 : INFO : c_0000000000 : Coordinator - Rebalancing triggered
10:32:04,968 : INFO : TEST RUNNER : Added a resource
10:32:05,011 : INFO : c_0000000000 : Coordinator - Get clients and resources list
10:32:05,093 : INFO : c_0000000000 : Coordinator - Assign resources (res5,res4,res6,res1,res0,res3,res2) to clients (c_0000000002,c_0000000000,c_0000000001)
10:32:05,111 : INFO : c_0000000000 : Coordinator - Invoking OnUnassignment actions. Unassigned resources res4,res3
10:32:05,117 : INFO : c_0000000000 : Coordinator - Removing barriers on resources res4,res3
10:32:05,213 : INFO : c_0000000002 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:05,219 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:05,304 : INFO : c_0000000000 : Coordinator - Removed barriers on 2 resources of 2
10:32:05,304 : INFO : c_0000000000 : Coordinator - OnUnassignment complete
10:32:05,304 : INFO : c_0000000000 : Coordinator - Putting barriers on resources res4,res0
10:32:06,297 : INFO : c_0000000001 : Follower - Rebalancing triggered
10:32:06,496 : INFO : c_0000000001 : Follower - Invoking OnUnassignment actions. Unassigned resources res1,res2
10:32:06,496 : INFO : c_0000000001 : Follower - Removing barriers on resources res1,res2
10:32:06,517 : INFO : c_0000000002 : Follower - Rebalancing triggered
10:32:06,526 : INFO : c_0000000002 : Follower - Invoking OnUnassignment actions. Unassigned resources res5,res0
10:32:06,535 : INFO : c_0000000002 : Follower - Removing barriers on resources res5,res0
10:32:06,634 : INFO : c_0000000002 : Follower - Removed barriers on 2 resources of 2
10:32:06,637 : INFO : c_0000000002 : Follower - OnUnassignment complete
10:32:06,672 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res0
10:32:06,684 : INFO : c_0000000001 : Follower - Removed barriers on 2 resources of 2
10:32:06,684 : INFO : c_0000000001 : Follower - OnUnassignment complete
10:32:06,706 : INFO : c_0000000002 : Follower - Putting barriers on resources res5,res1,res2
10:32:06,721 : INFO : c_0000000001 : Follower - Putting barriers on resources res6,res3
10:32:06,791 : INFO : c_0000000002 : Follower - Put barriers on 3 resources of 3
10:32:06,791 : INFO : c_0000000002 : Follower - Invoking OnAssignment with resources res5,res1,res2
10:32:06,791 : INFO : c_0000000002 : Follower - OnAssignment complete
10:32:06,791 : INFO : c_0000000002 : Follower - Rebalancing complete
10:32:06,802 : INFO : c_0000000001 : Follower - Put barriers on 2 resources of 2
10:32:06,802 : INFO : c_0000000001 : Follower - Invoking OnAssignment with resources res6,res3
10:32:06,802 : INFO : c_0000000001 : Follower - OnAssignment complete
10:32:06,802 : INFO : c_0000000001 : Follower - Rebalancing complete
10:32:08,718 : INFO : c_0000000000 : Coordinator - Put barriers on 2 resources of 2
10:32:08,718 : INFO : c_0000000000 : Coordinator - Invoking OnAssignment with resources res4,res0
10:32:08,718 : INFO : c_0000000000 : Coordinator - OnAssignment complete
10:32:08,719 : INFO : c_0000000000 : Coordinator - Rebalancing complete
10:32:14,971 : INFO : TEST RUNNER : Test 1
10:32:14,971 : INFO : TEST RUNNER : Adding a resource
10:32:15,019 : INFO : c_0000000000 : Coordinator - KEEPER EVENT SyncConnected - NodeChildrenChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:15,123 : INFO : TEST RUNNER : Added a resource
10:32:25,125 : INFO : TEST RUNNER : Test 2
10:32:25,127 : INFO : TEST RUNNER : Stopping client
10:32:26,006 : INFO : c_0000000000 : Coordinator - Rebalancing triggered
10:32:26,082 : INFO : c_0000000000 : Coordinator - Get clients and resources list
10:32:26,116 : INFO : c_0000000002 : Follower - Invoking OnUnassignment actions. Unassigned resources res5,res1,res2
10:32:26,116 : INFO : c_0000000002 : Follower - Removing barriers on resources res5,res1,res2
10:32:26,734 : INFO : c_0000000002 : Follower - Aborted removal of barriers due to cancellation of the client.
10:32:28,020 : INFO : c_0000000000 : Coordinator - Assign resources (res5,res4,res7,res6,res1,res0,res3,res2) to clients (c_0000000002,c_0000000000,c_0000000001)
10:32:28,774 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - NodeDataChanged - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/resources
10:32:29,766 : INFO : c_0000000002 : Follower - OnUnassignment complete
10:32:30,779 : INFO : c_0000000000 : Coordinator - Invoking OnUnassignment actions. Unassigned resources res4,res0
10:32:30,779 : INFO : c_0000000000 : Coordinator - Removing barriers on resources res4,res0
10:32:30,797 : INFO : c_0000000001 : Follower - Rebalancing triggered
10:32:30,800 : INFO : c_0000000001 : Follower - Invoking OnUnassignment actions. Unassigned resources res6,res3
10:32:30,800 : INFO : c_0000000001 : Follower - Removing barriers on resources res6,res3
10:32:30,862 : INFO : c_0000000000 : Coordinator - Removed barriers on 2 resources of 2
10:32:30,862 : INFO : c_0000000000 : Coordinator - OnUnassignment complete
10:32:30,862 : INFO : c_0000000000 : Coordinator - Putting barriers on resources res4,res1,res2
10:32:30,877 : INFO : c_0000000001 : Follower - Removed barriers on 2 resources of 2
10:32:30,877 : INFO : c_0000000001 : Follower - OnUnassignment complete
10:32:30,919 : INFO : c_0000000001 : Follower - Putting barriers on resources res7,res0
10:32:31,448 : INFO : c_0000000002 : The follower has exited for reason Cancelled
10:32:31,450 : INFO : c_0000000002 : Client terminating due: cancellation
10:32:31,450 : INFO : c_0000000002 : Client terminated
10:32:31,453 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res1
10:32:31,456 : INFO : c_0000000001 : Follower - Put barriers on 2 resources of 2
10:32:31,456 : INFO : c_0000000001 : Follower - Invoking OnAssignment with resources res7,res0
10:32:31,456 : INFO : c_0000000001 : Follower - OnAssignment complete
10:32:31,456 : INFO : c_0000000001 : Follower - Rebalancing complete
10:32:31,459 : INFO : TEST RUNNER : Stopped client
10:32:33,927 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res1
10:32:36,861 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res1
10:32:40,325 : INFO : c_0000000000 : Waiting for c_0000000002 to release its barrier on res1
10:32:41,460 : INFO : TEST RUNNER : Test 3
10:32:41,460 : INFO : TEST RUNNER : Stopping client
10:32:53,846 : WARN : c_0000000000 : Coordinator - Rebalancing cancelled
10:32:56,811 : INFO : c_0000000000 : The coordinator has exited for reason Cancelled
10:32:56,811 : INFO : c_0000000000 : Client terminating due: cancellation
10:32:56,811 : INFO : c_0000000000 : Client terminated
10:32:56,811 : INFO : TEST RUNNER : Stopped client
10:32:56,819 : INFO : c_0000000001 : Follower - KEEPER EVENT Disconnected - None
10:32:56,850 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - None
10:33:02,016 : INFO : c_0000000001 : Follower - KEEPER EVENT SyncConnected - NodeDeleted - /rebalanser/722bec51-5e25-47d2-94b6-7307a67b318e/clients/c_0000000000
10:33:02,808 : INFO : c_0000000001 : Follower - Invoking OnUnassignment actions. Unassigned resources res7,res0
10:33:02,808 : INFO : c_0000000001 : Follower - Removing barriers on resources res7,res0
10:33:02,845 : INFO : c_0000000001 : Follower - Removed barriers on 2 resources of 2
10:33:02,845 : INFO : c_0000000001 : Follower - OnUnassignment complete
10:33:02,855 : INFO : c_0000000001 : The follower has exited for reason PossibleRoleChange
10:33:02,888 : INFO : c_0000000001 : Looking for a next smaller sibling to watch
10:33:02,920 : INFO : c_0000000001 : Becoming coordinator
10:33:02,998 : INFO : c_0000000001 : Have successfully become the coordinator
10:33:02,999 : INFO : c_0000000001 : Coordinator - Rebalancing triggered
10:33:03,051 : INFO : c_0000000001 : Coordinator - Get clients and resources list
10:33:03,107 : INFO : c_0000000001 : Coordinator - Assign resources (res5,res4,res7,res6,res1,res0,res3,res2) to clients (c_0000000001)
10:33:03,131 : INFO : c_0000000001 : Coordinator - Putting barriers on resources res5,res4,res7,res6,res1,res0,res3,res2
10:33:03,386 : INFO : c_0000000001 : Coordinator - Put barriers on 8 resources of 8
10:33:03,386 : INFO : c_0000000001 : Coordinator - Invoking OnAssignment with resources res5,res4,res7,res6,res1,res0,res3,res2
10:33:03,386 : INFO : c_0000000001 : Coordinator - OnAssignment complete
10:33:03,386 : INFO : c_0000000001 : Coordinator - Rebalancing complete
10:33:08,479 : INFO : c_0000000001 : The coordinator has exited for reason Cancelled
10:33:08,479 : INFO : c_0000000001 : Client terminating due: cancellation
10:33:08,479 : INFO : c_0000000001 : No role - Invoking OnUnassignment actions. Unassigned resources res5,res4,res7,res6,res1,res0,res3,res2
10:33:08,479 : INFO : c_0000000001 : No role - Removing barriers on resources res5,res4,res7,res6,res1,res0,res3,res2
10:33:14,680 : INFO : c_0000000001 : No role - OnUnassignment complete
10:33:14,680 : INFO : c_0000000001 : Client terminated
TEST COMPLETE

This gives us even more perspective on what was going on. In fact, now we can see something seemly strange went on around TEST 1 and 2. Perhaps TEST is the wrong word, it should be ACTION perhaps. Anyway, test 1 involves adding a resource and this should trigger a rebalancing but no rebalancing was triggered. We see that with Test 0, we also add a resource and the rebalancing is immediate, so what is going on?

Each test is likely to test some kind of scenario and/or configuration. In this case we have a minimum rebalancing period of 20 seconds. Test 1 falls within that period and so no rebalancing is triggered. Internally, the coordinator is simply reposting the trigger event to itself and each second measures the time since the last rebalancing until it is time. But it would be good to know that is what is happening from the outside, we should add a log line similar to “Coordinator : Rebalancing event detected, delaying due to minimum rebalancing period”.

It is a good idea to include the test configuration as part of the logging of the test. So our final reconstituted log may look like:

10:31:31,889 Test run with rebalancing mode:ResourceBarrier
10:31:31,917 Test configuration:
{
  "GroupName": "722bec51-5e25-47d2-94b6-7307a67b318e",
  "OutputPath": "/home/jack/tmp/rebalanser-zk/output-722bec51-5e25-47d2-94b6-7307a67b318e.log",
  "SessionTimeout": "00:00:20",
  "ConnectTimeout": "00:00:20",
  "OnAssignmentDelay": "00:00:00",
  "MinimumRebalancingInterval": "00:00:20",
  "StartUpClientInterval": "00:00:00",
  "ClientCount": 3,
  "ResourceCount": 6,
  "TestDuration": "00:01:00",
  "CheckType": 1,
  "RandomiseInterval": false,
  "MaxInterval": "00:00:10",
  ... there are more configurations
}
... the combined log of the three nodes

Now we can easily review the test configurations along with the logs themselves.

Logging and Testing

When a test case fails, add that to the log. It will likely be the last entry in the log. If you are automating your tests, then gather the logs of various nodes, the test runner and combine them into a single log for review. When a test fails it may be simple or it may be hideously complex. We’re not talking about unit testing a single method with a single assert. We’re testing the system as a whole and it may be a simple crafted test or a long running randomized test with fault injection. Regardless of the type, the logging needs to allow you to reconsitute the environmental events, the actions and the decisions of each node.

You might also consider capturing host/container metrics such as CPU, disk IO, network IO, memory Usage etc that could be useful for correlation.

Reviewing Behaviour Via the Logs

Sometimes a suite of tests will pass, but you review the logs and you see something strange. May be your test missed something, or may be there is an optimization to be made. I have found a surprising number of issues by simply reading the logs meticulously. One example is what we saw in the logs above. When cancellation is issued, all further communications with ZooKeeper are aborted, raising a ZkOperationCancelledException if a request to ZooKeeper is attempted. But this could cause a rebalancing to get stalled for up to a minute if a node is shutdown mid rebalancing. It isn’t unsafe behaviour but it is not optimal.

Allow Hooks Into Your Logging

Make sure that you can customize the log output. Choosing the verbosity of course and even what/where data is logged. For example, when running integration tests, a special test logger is injected that adds the test configuration to the log, writes the logs to a specific directory where they can be collected and built into a report.

Log Levels

Try to avoid writing an ERROR entry when it is not one. Apache ZooKeeper is great a logging errors that are not really errors. Perhaps its a philosophical thing, but I treat an error as something exceptional, a bad thing happened. Logging an error because a connection failed, a ZooKeeper session expired etc will lead to confusion. These events are to be treated as normal behaviour. Yes a connection failed, so we reopened a new one, that is INFO.

An example of a real error in Rebalanser would be if we tried to create a barrier, but the parent barrier znode does not exist. That is an error as it should not happen and indicates something has gone awry in the system.

Another way of thinking about it is that there will be human operators of your software. The logs are for them as well, and when they see ERROR they see that as an indication to them that something has gone wrong in the system.

If your project is open source and used by others, then logging ERROR for normal situations will confuse your users and cause pointless tickets to be opened. So do yourself a favour, be careful what you log as an error.

Final Thoughts

Building a distributed system is hard and even simple ones will fail in unexpected ways. The logs will be the main clue as to what caused a failure. Whether you are developing the core product, testing it or resolvings bugs, the logs will be your goto resource. You are a detective and the code is the crime scene, a detective needs all the evidence he/she can get. Good logs will greatly speed up your investigation.

The way to making great logs is to depend on the logs during the development and test process. The debugger, if you use one, may be comfortable when running the logic of a single node, but it doesn’t help you write log entries that give insight into what a node was thinking when it did something.

What is great about using logging in this way is that it makes issue resolution in production a lot simpler. If you get an issue reported, then request the logs and perform the same process that you are accustomed to during development and testing. You might not have logs related to the wider environment, but by then you should be used to troubleshooting from only your logs.

We spent a bit of time with logging, which may or may not have surprised you, given I promised a post on testing. But I hope I’ve convinced you that without a good logging strategy you’re in deep water.

Next post will be about testing. Testing is a wide and massive subject. There are countless tools and techniques and a fair amount of controversy at the moment regarding unit vs integration testing. Luckily for me, the unit vs integration test debate is not so relevant in distributed systems as much of the bad behaviour is emergent where only system testing can give real confidence.

Links to the rest of this series:

Banner image credit: G. Brammer/ESO . Link to image.