Building A "Simple" Distributed System - The Protocol

In the last post we covered what our distributed resource allocation library, Rebalanser, should do. In this post we’ll look at a protocol that could achieve those requirements, always respecting our invariants (described in the last post).

A protocol is basically a set of rules which govern how each node in a Rebalanser group acts in order to achieve the desired behaviours. Each node must communicate with the others in such a way that it can achieve consensus about the resource allocations and also guarantee that it does not start accessing a resource until another node has stopped accessing it.

Decision one was not to implement an algorithm like Raft myself. Rebalanser is too simple to warrant the implementation of Raft and it can simply use a consensus/meta-data service like Apache ZooKeeper, Etcd or Consul instead. Using a service like this solves a few problems:

vastly simplified algorithm
node discovery
simpler network partition modelling

So our protocol has N nodes and a single registry. The registry is the centralized meta-data and consensus service.

I will now describe one of the protocols that could work. There are three different protocols that I identified and I have implemented two of them. The simplest is the one I describe in this post.

Protocol - Leader-based Resource Barrier

Decision 1 - Leader/follower.

Given that leader election is so simple with Apache ZooKeeper, Etcd and Consul, making one node be the leader does not add too much additional complexity. The benefits a leader brings are:

triggering a rebalancing is simpler. The leader just needs to watch for changes in the registry and then notify the other nodes of new allocations.
the resource allocation code is simpler, as the leader simply tells the followers what resources they get
adding minimum time limits between rebalancings is simpler, as the leader, who has detected the need for a rebalancing, simply holds off until the minimum period has passed

Decision 2 - Barriers on Resources

The idea is that before a node A can invoke the OnStart event in its host application, it must place a barrier on each of the resources it has been allocated. If node B still has a barrier on that resource, then node A must wait until it removes the barrier. When a rebalancing occurs and the leader has notified all nodes of their new allocations, each node performs the following actions:

Call OnStop in host application. The application code programmer should have code in their event handler that stops accessing all allocated resources. Note that we wait for the OnStop event handler to finish.
Remove all barriers from current resources.
Add barriers to new resources.
Call OnStart in host application which will start the application’s resource accessing logic.

This way there can be no deadlock and there can be no concurrent access of a single resource by two applications.

Fig 1. The registry stores all meta-data about a Rebalanser group and acts as a communication medium between nodes.

When a node starts, it registers itself with the Registry and gets an Id. It then becomes either a leader or a follower. Leaders monitor the nodes and resources and trigger a rebalancing when a change occurs. A rebalancing is simply where the leader calculates an even balancing of resources over the current nodes and then writes that map back to the Registry. Each follower is monitoring the allocations in the Registry and when it changes it starts the four step process of stopping and starting resource access.

When node A becomes leader it increments the term which is a monotonic counter, and monitors it for changes. If for any reason the term is modified, it means that another node believes it is leader. Then node A reverts to being a follower.

Fig 2. The interactions between nodes and the registry

Fig 3. A flow diagram that describes the sequence of actions. Failure can occur at any moment and is described in a separate diagram.

Note that:

the Monitor Trigger step is an event that can occur at any time. This makes each node a concurrent system in itself.
all yellow steps are abortable rebalancing steps. We never abort OnStart and OnStop steps which will execute application code. We can only recommend that programmers limit the amount of work that goes into their event handlers. Also that programmers MUST ensure that resource access has stopped at the end of OnStop event handler. If the programmer makes it an asynchronous execution then Invariant 1 could be violated.

Handling Failure

At any moment there can be a failure of a node, the network or the registry. The invariants must hold in all situations. Some extra rules that nodes must adher to are required.

Unreachable

The registry will remove a node id after a period of time without a keep alive from the node. When that happens its barriers are also removed. This will trigger a rebalancing and another node will get allocated the resources of the missing node. If the node failed and is no longer accessing the resources then no problem, but if the problem is that the missing node is on the other side of a network partition, and is still accessing the resource, then we have a problem.

The only way to avoid this is to ensure that the node detects that the Registry is unreachable and stops resource access before the registry removes the node. All the protocol need dictate is that this is the case, not how that is achieved.

Fig 4. Nodes stop resource access and revert to no role when detecting that the registry is unreachable.

The definition of unreachable, as far as our protocol is concerned, is a length of time that is shorter than the registry node expiry time period. In the implementation, we’ll want brief losses of connectivity to only require retries, in order to prevent cluster instability. It is unfortunate but our Invariant 1 is vulnerable to network partitions and some kind of delay in the partitioned node. A long GC for example could cause the node to realise that the registry is unreachable after the registry has already removed it. The application programmer will need to use configuration of the two time limits to reduce the likelihood of this occuring, making the node revert to no role using a far shorter time period than the registry.

Node Failure

Nodes should be able to die at any moment. If the leader dies while performing rebalancing then the protocol dictates that a follower becomes the new leader and performs a new rebalancing. Any in progress rebalancing is aborted. All that is required to abort an in progress rebalancing is that no more steps are taken. The stop then start design avoids any kind of deadlock or blocking behaviour. A series of rebalancings could be interrupted forever and the invariants will hold.

So node failure is not a problem as any time a node fails it triggers a rebalancing which is basically a wiping clean and start again process.

Changing resources at any time

Changing the resources also triggers a rebalancing. An administrator cannot guarantee that a rebalancing isn’t already in progress when they decide to add or remove resource identifiers in the registry. The protocol can handle the addition/removal of resources because when the leader makes the allocations, it does so with a snapshot of the resources at that moment. If a moment later, even before it has written the new allocations, a new reource is added, then all that can happen is a new rebalancing occurs. The new rebalancing may or may not interrupt the current one (depends on if a minimum period between rebalancings is configured).

One important part of the protocol is that barriers must be independent of resources. That is they are not simply a property of a resource. The first version of the protocol had the barrier as a property of the resource. So when an admin removed a resource, it removed the barrier implicitly. It turns out that this can lead to our “Invariant 1 - No concurrent access to a resource” to be violated. I only discovered this defect in the first version by model checking my TLA+ specification, which we’ll look at in the next post. We’ll cover the defect in that post.

Application Code Event Handler Failure

Rebalanser will execute application code when it fires the OnStart and OnStop events. If an unhandled error happens in this code then Rebalanser treats this as a node entering potentially inconsistent state. In this case it invoke OnStop (if that wasn’t what failed) and then if it has connectivity to the registry, informs the registry it is shutting down and reverts to no role. It then determines which role it should have and begins from scratch. This is not a perfect response but Rebalanser simply cannot know the state of the access to the resources currently allocated. It is the programmer’s job to ensure that no matter what, under failure, that resource access is stopped. Else invariant 1 will be violated, outside the control of Rebalanser.

Summary

So we have identified a protocol that will ensure our invariants as far as is possible. But how do we know there aren’t any avoidable corner cases that violate our invariants? The different permutations of all the different sequences of steps is infinite and perhaps there is something we have overlooked? Perhaps even some critical failing in our protocol that should be addressed before we travel too far down the implemention road.

This is where formal verification of our algorithm comes in. TLA+ is a high level language that uses mathematic notation to describe a system. It also has a model checker that explores the state space (all the permutations of sequences of actions) in a brute force manner. We can mathematically state our invariants and ensure that they hold in all the different permutations. But beware, TLA+ for a software developer without a strong math background (like me) is pretty hard core stuff. But, if you persevere, in a couple of weeks you can produce something useful. But it takes some mind bending and mental sweat. So next up is the TLA+ spec.

Part 1 - The What
Part 2 - Designing the protocol (this post)

Part 3 - Formally verifying the protocol with TLA+
Part 4 - The implementation
Part 5 - Logging
Part 6 - Testing the implementation (coming soon)

Banner image credit: V. Forchi/ESO. Link to image