Responsibility Boundaries in the Coordinated Progress model

Building on my previous work on the Coordinated Progress model, this post examines how reliable triggers not only initiate work but also establish responsibility boundaries. Where a reliable trigger exists, a new boundary is created where that trigger becomes responsible for ensuring the eventual execution of the sub-graph of work downstream of it. The boundaries can even layer and nest, especially in orchestrated systems that overlay finer-grained boundaries.

Failures happen and distributed business service architectures need ways of recovering from those failures so they can continue to make progress. In the Coordinated Progress model, this reliable progress is defined as the combination of 

  • Reliable Triggers, to ensure that work is initiated in a way that survives failure. Should the work fail before completion, it can be retriggered, with all the initial state of the first invocation.

  • Progressable Work, to ensure that subsequent invocations do not result in inconsistent state or duplicate state. Work either:

    • Progresses incrementally by durably logging progress so that after being re-triggered, it can resume from where it left off.

    • All work is re-executed in full, but is idempotent.

Fig 1. Reliable progress is made via reliable triggers and progressable work.

Not all units of work have their own reliable trigger, but reliable progress needs a reliable trigger somewhere, even if that's a human clicking a button. That human, in its role as button clicker, acts as the root of a potentially large responsibility boundary encompassing multiple services.

How a service uses reliable triggers to ensure reliability of itself and downstream work is the subject of this post. Let’s examine how a given service can utilize reliable triggers to ensure that a downstream service completes reliably.

Caller vs Callee vs Upstream Triggers

As a developer, it’s not enough to ensure your own service is reliable; you also need to think about whether your downstream invocations are truly reliable. How can a given service ensure the reliable execution of its own work, and of its downstream dependencies?

The answer is by using reliable triggers and progressable work. But who places the trigger? The figure below outlines three patterns in which a caller reliably invokes a callee. In each case, the callee fails the first two times but succeeds on the 3rd attempt.

Fig 2. Different ways of relying on reliable triggers to make downstream work reliable.

Let’s look at different trigger placement types when looking at any specific caller->callee pairing:

  1. Caller Placement. In this placement model, the caller delegates responsibility for invoking the callee service to a middleware, such as a queue, topic, or reliable RPC. Each acts as a reliable trigger for the callee service to perform its work. Once the middleware has acknowledged receipt of the message/RPC, the caller context is free to proceed, or if there’s nothing left to do, to terminate. This is ideal for triggering asynchronous work.

  2. Callee Placement. In this placement model, the downstream service (the callee) registers its invocation durably, then immediately responds to the caller with a success response. The callee can then proceed (having durably logged the intent to do the work), carrying out the work asynchronously. Should it fail before completing its work, the durably stored invocation state is used to retrigger the work. This is ideal for ensuring asynchronous work, triggered by unreliable RPC, is reliably executed.

  3. Upstream Placement. In this placement type, the caller->callee edge is unreliable, such as an RPC. Essentially, there is no actual placement of a trigger; instead, the caller relies solely on its own reliable trigger (or even an upstream reliable trigger if it has none itself) to ensure that the unreliable caller-to-callee invocation is eventually successful. Ideal for synchronous work, where we assume there is a reliable trigger upstream (even if that is a human).

In the case of Caller Placement, a middleware sits between the caller and the callee. It is the caller's responsibility to set a reliable trigger via this middleware. In the case of Callee Placement, there is no middleware between the caller and the callee, and the caller must assume the callee takes responsibility for itself.

Triggers and Boundaries

As we saw above, a node in the graph can perform a handoff of responsibility to a downstream node via a reliable trigger, or it can depend on its own trigger or even an upstream trigger. The point is that somewhere, a reliable trigger must exist.

Let’s take the following example of 6 nodes forming a synchronous graph of execution. If a node is hollow, it has no reliable trigger, and if it is opaque, it does have a reliable trigger.

Fig 3. Synchronous execution graph without a reliable trigger.

Node A (a function in a microservice) kicks off the work. But this work is not reliably triggered, if it fails, there is nothing and no one to retrigger it. Furthermore, none of the downstream steps are reliably triggered either. There is no responsibility boundary at all; any failure amongst these 6 nodes will leave the work partially completed, most likely causing a global inconsistent state.

Now, let’s place a reliable trigger for Node A. It could just be a human that clicked a button and waits to confirm the operation completes successfully, ready to try again if needed. Or it could be triggered from some other durable source, such as a queue, topic, database etc.

Fig 4. Synchronous execution graph with a root reliable trigger, constituting a responsibility boundary.

Things just got a lot better from a reliability perspective; we now have a responsibility boundary wrapping this entire graph. Should any of the nodes in this boundary fail, the system re-executes the entire operation, ensuring data integrity and task completion. While potentially resource-intensive, we prevent leaving the system in a partially executed or corrupted state (as long as each node implements a progressable work pattern). We might be able to leave things here, where we have clearly defined a responsibility boundary, should failures occur.

However, what if Nodes D and F are long-running and are therefore performed asynchronously? When D and F receive a request, they immediately acknowledge the request by sending a success response, then proceed to execute the work asynchronously. If the work then fails, we have a problem. This single responsibility boundary is predicated on upstream nodes detecting when downstream nodes fail. But with asynchronous work, this predicate is false. 

We can address this by having Node B send Node D a message over a queue or topic, and the same for Node C and F. Now D and F exist in their own responsibility boundary. Should either one fail before completion, they do not depend on Node A to retrigger the whole graph of execution.

Fig 5. Synchronous and asynchronous work in the execution graph is separated into different responsibility boundaries.

Nodes A, B, C, and E form a synchronous flow of execution. Synchronous flows don’t benefit from balkanized responsibility boundaries. Typically, synchronous work involves a single responsibility boundary, where the root caller is the reliable trigger.

But it might be acceptable to make this entire flow asynchronous.

Fig 6. The entire execution graph is executed asynchronously, with node (service) existing in its own boundary.

Now, all inter-node communication happens over queues or topics. Each node is wrapped in its own responsibility boundary, meaning failures only require retriggering one node in the graph.

It turns out that asynchronous two-way communication can be viewed as a special case of one-way communication (specifically, two one-way invocations). Viewed this way, we see that the callee happens to invoke a handler of the caller service (with the final response) as part of its asynchronous work.

Fig 7. The same execution graph from fig 6, where each two-way communication acts like two one-way communications.

The asynchronous response is received by a handler in the caller, and this handler context is a separate responsibility boundary from that of the initial calling context. 

Request/response over queues and topics requires wiring event handlers and correlation IDs, which can be more complex than just writing the synchronous version from figure 4. Durable execution engines (DEEs) are emerging as a way to build these kinds of flows using simpler procedural code. 

The execution graph with Durable Execution 

So far we’ve come from the mindset of individual services forming a graph of execution where it’s all about collaboration between services. When we use queues and topics for driving asynchronous work, this collaboration is event-driven. Each unit of work is triggered by a message on a queue or topic. 

Durable execution works differently. 

In durable execution, it’s not so much about collaboration between services, but a boss telling each subordinate service what to do. It is polling-driven, where the DEE polls (triggers and retriggers) each actor when they fail, serving cached values where prior executions have already made incremental progress to that point.

Fig 8. How a Durable Execution Engine (DEE) drives a work, with a polling and caching approach.

In the above figure, we have Function X (Fx) and Function Y(Fy). Fx is invoked (somehow) and it immediately registers its invocation with the DEE (caller trigger placement). Fx performs three blocks of progressable work using the DEE to register the results of each code block. Code block B requests that Function Y (Fy) do some work and send a response. With the DEE also mediating this communication, it looks something like this:

  1. Fx is triggered, and it registers its invocation with the DEE (caller trigger placement).

  2. Fx reaches code block B and instructs the DEE to execute Fy and return a response. Fx waits.

  3. The DEE invokes Fy, and either the DEE places a reliable trigger here, or Fy registers its invocation to place the trigger.

  4. Fy completes the work, providing the DEE with its response.

  5. Fx, which has been waiting, receives the response from the DEE, and it proceeds to code block C and terminates.

This all appears to be very synchronous in nature, but because the invocations are registered with the DEE (reliable triggers), progress is logged via the DEE, and communication is mediated by the DEE, synchronous-looking code can actually be extremely long-running and make progress despite multiple failures. 

For example, if Fx fails waiting for a response:

  1. The DEE will retrigger Fx. 

  2. Fx will reuse the cached result of code block A.

  3. Fx will request the work of block B again, and if Fy has completed, the DEE can serve Fx the cached response immediately.

  4. Fx then executes block C and terminates.

Likewise, if Fy fails in block B3, the DEE retriggers Fy, and Fy can use any cached results of blocks B1 and B2. In fact, both can constantly fail, and the DEE acts as the poller, reliably invoking the functions and providing the results of previously executed blocks, until the whole thing is finally completed.

The DEE is orchestrating everything, ensuring forward progress by polling each service repeatedly until a flow of work is complete. What does this do to our concept of responsibility boundaries? 

On one hand, Fx and Fy have their own reliable triggers (the DEE). But at the same time, the DEE ensures the entire downstream graph of Fx is reliably executed. Therefore, we can think of an orchestrated flow as one outer responsibility boundary that overlays finer-grained internal boundaries. Each micro-boundary acts as a local fault domain, so that, should a failure occur, the DEE only needs to retrigger a local part of the wider workflow.

Fig 9. Orchestrated flows consist of an outer responsibility boundary, that wraps nested boundaries based on reliable triggers.

This might seem overly theoretical, but it actually aligns with the discussion of choreography versus orchestration in Coordinated Progress. 

Choreography is about highly decoupled microservices that independently react to their input events as they arrive. There is no blocking or waiting, and all consumers operate independently of any upstream producers or subsequent downstream consumers. This lines up with each node in the graph operating in its own responsibility boundary. 

Orchestration is about centralized logic driving a procedural workflow (if-this-then-that). The orchestrator keeps track of which parts of the workflow have been completed, which are in progress, and which have yet to be started. It keeps track of the commands sent to the subordinate services as well as the responses from those services. It operates inside a single outer responsibility boundary.

Responsibility boundaries vs Business process boundaries

In choreographed, event-driven architectures, each node in the execution graph has its own reliable trigger, and so each node owns its own recovery. This creates fine-grained responsibility boundaries, which can offer strong decoupling and smaller fault domains. Failures are contained; retries are local. However, these boundaries may not line up with business workflow boundaries. A failure in a small boundary may leave the broader business workflow partially complete and harder to reason about.

In contrast, orchestrated workflows create coarse-grained responsibility boundaries that typically align with business workflows. A durable orchestrator owns retries, state, and the responsibility for the entire workflow’s success. It can also utilize finer-grained responsibility boundaries, allowing it to retrigger only small subgraphs of the entire workflow graph. This simplifies the developer’s mental model and makes recovery of complex processes more predictable.

But you can’t wrap an entire architecture in one responsibility boundary. As the boundary grows, so does the frequency of change, making coordination and releases increasingly painful. What should be loosely connected services become tightly coupled through shared state, shared versions, and shared failure modes. This defeats the point of distributed systems: flexibility, independence, and isolation. This leads me back to the idea of mixing choreography with orchestration for a best-of-both-worlds approach.

Final thoughts

Reliability isn’t free; someone must always be responsible for ensuring work is eventually completed. A reliable trigger marks the point where that responsibility begins or ends, shaping how systems recover from failure and continue making progress. Clear responsibility boundaries make these obligations explicit, providing answers when failures occur: where to look, who owns recovery, and how far the impact spreads. Without them, distributed systems remain fragile and unpredictable. 

Aligning responsibility boundaries with business process boundaries is often beneficial, but those boundaries can only grow so large before they become a liability. In reality, business processes are naturally divided into specific domains; organizations don’t run giant processes that encompass the entire business. Instead, processes are focused on particular areas, triggered by specific events, and may in turn trigger other processes. Responsibility boundaries can be aligned to these business boundaries, using a mix of choreography and orchestration.