The Durable Function Tree - Part 2

In part 1 we covered how durable function trees work mechanically and the importance of function suspension. Now let's zoom out and consider where they fit in broader system architecture, and ask what durable execution actually provides us.

Function Trees and Responsibility Boundaries

Durable function trees are great, but they aren’t the only kid in town. In fact, they’re like the new kid on the block, trying to prove themselves against other more established kids.

Earlier this year I wrote Coordinated Progress, a conceptual model exploring how event-driven architecture, stream processing, microservices and durable execution fit into architecture, within the context of multi-step business processes, aka, workflows. I also wrote about responsibility boundaries, exploring how multi-step work is made reliable inside and across boundaries. I’ll revisit that now, with this function tree model in mind.

In these works I described how reliable triggers not only initiate work but also establish responsibility boundaries. A reliable trigger could be a message in a queue or a function backed by a durable execution engine. The reliable trigger ensures that the work is retriggered should it fail.

Fig 1. A tree of work kicked off by a root reliable trigger, for example a queue message kicks off a consumer that executes a tree of synchronous HTTP calls. Should any downstream nodes fail (despite in situ retries), the whole tree must be re-executed from the top.

Where a reliable trigger exists, a new boundary is created, one where that trigger becomes responsible for ensuring the eventual execution of the sub-graph of work downstream of it. A tree of work can be arbitrarily split up into different responsibility boundaries based on the reliable triggers that are planted.

Fig 2. Nodes A, B, C, and E form a synchronous flow of execution. Synchronous flows don’t benefit from balkanized responsibility boundaries. Typically, synchronous work involves a single responsibility boundary, where the root caller is the reliable trigger. Nodes D and F are kicked off by messages placed on queues, each functioning as a reliable trigger.

Durable function trees also operate in this concept of responsibility boundaries. Each durable function in the tree has its own reliable trigger (managed by the durable execution engine), creating a local fault domain.

Fig 3. A durable function tree from part 1

As I explained in part 1: If func3 crashes, only func3 needs to retry, func2 remains suspended with its promise unresolved, func4's completed work is preserved, and func1 doesn't even know a failure occurred. 

The tree structure creates natural fault boundaries where failures are contained to a single branch and don't cascade upward unless that branch exhausts its retries or reaches a defined timeout. These boundaries are nested like an onion: each function owns its immediate work and the completion of its direct children.

Fig 4. A function tree consists of an outer responsibility boundary that wraps nested boundaries based on reliable triggers (one per durable function).

When each of these nodes is a fully fledged function (rather than a local-context side effect), A’s boundary encompasses B’s boundary, which in turn encompasses C's and so on. Each function owns its invocation of child functions and must handle their outcomes, but the DEE drives the actual execution of child functions and their retries. This creates a nested responsibility model where parents delegate execution of children to the DEE but remain responsible for reacting to results. In the above figure, if C exhausts retries, that error propagates up to B, which must handle it (perhaps triggering compensation logic) and resolving its promise to A (possibly with an error in turn). Likewise, as errors propagate up, cancellations propagate down the tree.

This single outer boundary model contrasts sharply with choreographed, event-driven architectures (EDA). In choreography, each node in the execution graph has its own reliable trigger, and so each node owns its own recovery. The workflow as a whole emerges from the collective behavior of independent services reacting to events as reliable triggers.

Fig 5. The entire execution graph is executed asynchronously, with each node existing in its own boundary with a Kafka topic or queue as its reliable trigger.

EDA severs responsibility completely, once the event is published, the producer has no responsibility for consumer outcomes. The Kafka topic itself is the guarantor in its role as the reliable trigger for each consumer that has subscribed to it. This creates fine-grained responsibility boundaries with decoupling. Services can be deployed independently, failures are isolated, and the architecture scales naturally as new event consumers are added.

If we zoom into any one node, that might carry out multiple local-context side effects, including the publishing of an event, we can view the boundaries as follows:

Fig 6. Each consumer is invoked by a topic event (a reliable trigger) and executes a number of local-context side effects.

If a failure occurs in one of the local side effects, the event is not acknowledged and can be processed again. But without durable execution’s memoization, the entire sequence of local side effects inside a boundary must either be idempotent or tolerate multiple executions. This can be more difficult to handle than implementing idempotency or duplication tolerance at the individual side effect level (as with durable execution).

What is the value-add of durable execution?

The bigger the responsibility boundary, the larger the graph of work it encompasses, the more tightly coupled things get. You can’t wrap an entire architecture in one nested responsibility boundary. As the boundary grows, so does the frequency of change, making coordination and releases increasingly painful. Large function trees are an anti-pattern. The larger the function tree the wider the net of coupling goes, the more reasons for a given workflow to change, with more frequent versioning. The bigger the tree the greater scope for non-determinism to creep in, causing failures and weird behaviors.

Ultimately, you can achieve multi-step business processes through other means, such as via queues and topics. You can wire up SpringBoot with annotations and Kafka.

// Spread across multiple handlers, files, possibly services
@KafkaListener("order-requests")
void step1() { 
    // Execute a local-context side effect
}

@KafkaListener("customer-responses") 
void step2() { 
    // Execute a local-context side effect
}

@KafkaListener("inventory-responses")
void step3() { 
    // Execute a local-context side effect
}

We can even wire up compensation steps.

@KafkaListener("payment-failures")
void onPaymentFailed(PaymentFailure failure) {
    // Execute a local-context compensation side effect
}

Kafka acts as the reliable trigger for each step in the workflow. I think that’s why I see many people asking what durable execution valuable? What is the value-add? I can do reliable workflow already, I can even make it look quite procedural, as each step can be programmed procedurally even if the wider flow is reactive.

The way I see it is that:

  • EDA focuses on step-level reliability (each consumer handles retries, each message is durable) with results in step decoupling. Because Kafka is reliable, we can build reliable workflows from reliable steps. Because each node in the graph of work is independent, we get a decoupled architecture.

  • Durable execution focuses on workflow-level reliability. The entire business process is an entity itself (creating step coupling). It executes from the root function down to the leaves, with visibility and control over the process as a whole. But it comes with the drawback of greater coupling and the thorn of determinism. As long as progress is made by re-executing a function from the top using memoization, the curse of determinism will remain. Everything else can hopefully be abstracted.

We can build reliable workflows the event-driven way or the orchestration way. For durable execution engines to be widely adopted they need to make durability invisible, letting you write code that looks synchronous but survives failures, retries, and even migration across machines. Allowing developers to write normal looking code (that magically can be scheduled across several servers, suspending and resuming when needed) is nice. But more than that, durable execution as a category should make workflows more governable—that is the true value-add in my opinion.

In practice, many organizations could benefit from a hybrid coordination model. As I argued in the Coordinated Progress series, orchestration (such as durable functions) should focus on the direct edges (the critical path steps that must succeed for the business goal to be achieved). An orders workflow consisting of payment processing, inventory reservation, and order confirmation form a tightly coupled workflow where failure at any step means the whole operation fails. It makes sense to maintain this coupling.

But orchestration shouldn't try to control everything. Indirect edges (such as triggering other related workflows or any number of auxiliary actions) are better handled through choreography. Workflows directly invoking other workflows only expands the function tree. Instead an orchestrated order workflow can emit an OrderCompleted event that any number of decoupled services and workflows can react to without the orchestrator needing to know or care.

Fig 7. Orchestration employed in bounded contexts (or just core business workflow) with events as the wider substrate.

Note also that workflows invoking other workflows directly can also be a result of the constrained workflow→step/activity model. Sometimes it might make sense to split up a large monolithic workflow into a child workflow, yet, both workflows essentially form the critical path of a single business process.

Final thoughts

The durable function tree in summary:

  • Functions call functions, each returning a durable promise

  • Execution flows down; promise resolution flows back up

  • Local side effects run synchronously; remote side effects enable function suspension

  • Continuations are implemented via re-execution + memoization

  • Nested fault boundaries: 

    • Each function ensures its child functions are invoked

    • The DEE drives progress

    • Parents functions handle the outcomes of its children

The durable function tree offers a distinct set of tradeoffs compared to event-driven choreography. Both can build reliable multi-step workflows; the question is which properties matter more for a given use case.

  • Event-driven architecture excels at decoupling: services evolve independently, failures are isolated, new consumers can be added without touching existing producers. With this decoupling comes fragmented visibility as the workflow emerges from many independent handlers, making it harder to reason about the critical path or enforce end-to-end timeouts.

  • Durable function trees excel at governance of the workflow as an entity: the workflow is explicit, observable as a whole, and subject to policies that span all steps. But this comes with coupling as the orchestrated code must know about all services in the critical path. Plus the curse of determinism that comes with replay + memoization based execution.

The honest truth is you don't need durable execution. Event-driven architecture also has the same reliability from durability. You can wire up a SpringBoot application with Kafka and build reliable workflows through event-driven choreography. Many successful systems do exactly this.

The real value-add of durable execution, in my opinion, is treating a workflow as a single governable entity. For durable execution to be successful as a category, it has to be more than just allowing developers to write normal-ish looking code that can make progress despite failures. If we only want procedural code that survives failures, then I think the case for durable execution is weak.

When durable execution is employed, keep it narrow, aligned to specific core business flows where the benefits of seeing the workflow as a single governable entity makes it worth it. Then use events to tie the rest of the architecture together as a whole.