In my last post I wrote about why and where determinism is needed in durable execution (DE). In this post I'm going to explore how workflows can be formed from trees of durable function calls based on durable promises and continuations.
Here's how I'll approach this:
Part 1
Building blocks: Start with promises and continuations and how they work in traditional programming.
Making them durable: How promises and continuations are made durable.
The durable function tree: How these pieces combine to create hierarchical workflows with nested fault boundaries.
Function trees in practice: A look at Temporal, Restate, Resonate and DBOS.
Responsibility boundaries: How function trees fit into my Coordinated Progress model and responsibility boundaries
Value-add: What value does durable execution actually provide?
Architecture discussion: Where function trees sit alongside event-driven choreography, and when to use each.
Introduction
At their core, most durable execution frameworks organize work as hierarchical trees of function calls. A root function invokes child functions, which may invoke their own children, forming a tree. In some frameworks such as Temporal, the workflow task is the parent function and each activity is a leaf function in a two-level tree. Other frameworks support arbitrary function trees where each function returns a durable promise to its caller. When a child function completes, its promise resolves, allowing the parent to resume. The durable execution engine manages this dance of invoking functions, caching results, handling retries, and suspending functions that are waiting on remote work.
I'll refer to this pattern as the durable function tree, though it manifests differently across frameworks.
In this series, I use the term side effect as any operation whose result depends on the external world rather than the function’s explicit inputs. That includes the obvious mutations such as writing to a database or sending an email, but also non-mutating operations whose results are not guaranteed to be the same across re-execution (such as retrieving a record from a database). Even though these operations look like pure reads, they are logically side effects because they break determinism (ah yes, the curse of determinism) as the value you obtain today may differ from the value you obtain when the function is retried tomorrow. So in these posts, side effect means: Anything external that must be recorded (and replayed) because it cannot be deterministically re-executed.
Promises and Futures: A Quick Refresher
Promises and futures are programming language constructs that act as handles or placeholders for a future result. They are coordination primitives.
The promise and the future are closely related concepts:
A promise is a write-once container of a value, where the writer sets the value now or at some point in the future. Setting the value resolves the promise.
A future is a read-only interface of the promise, so the bearer can only check if it is resolved or not.
Fig 1. The promise/future as a container for a future value.
As a container for a future value, the bearer can await the promise/future which will block until it has been resolved. While technically distinct (a promise is writable, a future is read-only), most languages and frameworks blur this distinction. For simplicity, I'll use the term "promise" throughout.
Here's the basic pattern in pseudocode:
Promise<Customer> getCustomer(int id) {
Promise<Customer> promise = new Promise();
runAsync(() -> {
customer = db.getCustomer(id);
promise.resolve(customer);
})
return promise;
}
The function creates a promise, launches asynchronous work that will eventually resolve it, and returns immediately. The caller can await the promise right away or continue with other work:
// Await immediately customer = getCustomer(id).await(); // Or defer the await var custPromise = getCustomer(id); // ... do other work ... customer = custPromise.await();
Developers are generally comfortable with functions returning promises: invoke a function, get a handle, await its result. Usually we're waiting for some IO to complete or a call to an API. In fact, when a function executes, it might be the root of a tree of function calls, each passing back promises/futures to its caller, forming a promise chain.
Continuations: What Happens Next
Promises and continuations are related but distinct concepts:
A promise is a synchronization primitive for a value that doesn't yet exist
A continuation is a control-flow primitive representing what the program should do once that value exists
In JavaScript-style APIs, continuations appear explicitly in then, catch, and finally:
foo().then(x =>
bar(x).then(y =>
baz(y)
)
)
Modern async/await syntax hides this continuation-passing behind synchronous-looking code:
const x = await foo(); const y = await bar(x); return baz(y);
When execution hits await, the current function suspends and everything after the await becomes the continuation. The code that will resume once the promise resolves.
Making Promises Durable
A durable promise is a promise that survives process crashes, machine failures, and even migration to different servers. We can model this as a durable write-once register (WOR) with a unique, deterministic identity.
The key properties:
Deterministic identity: The promise ID is derived deterministically from the execution context (parent function ID, position in code, explicitly defined by the developer).
Write-once semantics: Can only be resolved once.
Durable storage: Both the creation and resolution are recorded persistently.
Globally accessible: Any service that knows the promise ID can resolve it or await it.
The durable execution engines (DEEs) generally implement this logical WOR by recording two entries in the function's execution history: one when the promise is created, another when it's resolved. This history is persisted and used to reconstruct state during replay. There are also additional concerns such as timeouts and cancellation of the promise, beyond its creation and resolution.
When you write:
def process_order(order, customer_id):
customer = await get_customer(customer_id)
# ... rest of logic
Behind the scenes, the framework SDK:
Checks if a durable promise for get_customer(231) already exists.
If resolved: returns the stored value immediately.
If unresolved or doesn't exist: executes (or re-executes) the work.
Durable Continuations Through Re-execution and Memoization
Traditional promises suspend execution by capturing the call stack and heap state. Everything is still running in a single process. Durable execution engines typically don't do this as capturing and persisting arbitrary program state is complex and fragile.
Instead, they implement continuations through replay and memoization:
The function executes from the top
Each await checks if its durable promise is already resolved
If yes: use the stored result and continue (this is fast, it’s just a lookup)
If no: execute the work, resolve the promise, record the result
On failure: restart from step 1
Consider this example:
def process_order(order):
customer = await get_customer(order.customer_id) # Promise 1
inventory = await check_inventory(order.item_id) # Promise 2
payment = await charge_customer(customer, order.amount) # Promise 3
await send_confirmation(customer, order) # Promise 4
First execution:
Executes get_customer, resolves Promise 1, stores result
Executes check_inventory, resolves Promise 2, stores result
Starts charge_customer, crashes mid-execution
Second execution (after crash):
Re-runs from top
get_customer: Promise 1 already resolved → returns stored result instantly
check_inventory: Promise 2 already resolved → returns stored result instantly
charge_customer: Promise 3 unresolved → executes the work
Completes successfully
This is why determinism matters (from the previous post). The function must take the same code path on replay to encounter the same promises in the same order. If control flow were non-deterministic, replayed execution might skip a promise or try to await a different promise entirely, breaking the memoization.
Let’s now introduce the durable function tree.
Durable Function Trees and Promise Chains
Durable functions can call other durable functions, creating trees of execution. Each function invocation returns a durable promise to the caller.
Fig 2. A tree of function calls, returning durable promises.
Execution flows down the tree; promise resolution flows back up.
This produces a tree of function calls, where each function is a control flow which executes various side effects. Side effects can be run from the local context or from a reliable remote context (such as another durable function), and the difference matters.
Local-Context Side Effects | Remote-Context Side Effects
-------------------------------|----------------------------
Runs in-process | Runs elsewhere
Cannot suspend | Enables suspension
Retries via parent | Retries independently
Local-context side effects run within the function's execution context:
Database queries
S3 operations
HTTP calls to external APIs
Local computations with side effects
Local-context side effects have these characteristics:
Execute synchronously (even if using async syntax, the result is received by the same context)
Cannot be retried independently (only by replaying the parent function)
Require the function to keep running (e.g., maintaining a TCP connection for a database response)
Remote-context side effects run in a separate reliable context:
Another durable function.
A durable timer (managed by the DEE).
Work queued for external processing with an attached durable promise for the 3rd party to resolve.
Remote-context side effects behave differently:
Can be retried independently of the caller.
Continues progressing even if the caller crashes or suspends.
The caller awaits a promise, not a direct response. It is asynchronous, the caller context that receives the result may be a re-execution running on a different server, hours, days or months later.
The distinction between local and remote matters because remote-context side effects create natural suspension points, which become important for durable function trees.
Let’s use the tree from fig 2. It has a mix of local-context side effects (such as db commands and HTTP calls) and remote-context side effects, aka, calls to other durable functions (or timers).
func1
├─ local: db.getCustomer(id)
├─ remote: func2
│ ├─ local: db.checkInventory(item)
│ ├─ remote: func3
│ │ └─ local: http.paymentAPI.charge(amount)
│ └─ remote: func4
│ ├─ local: s3.uploadInvoice(invoice)
│ └─ local: db.updateOrder(order)
└─ local: db.logAudit(event)
When a function is waiting only on promises from remote side effects, it can be suspended (meaning terminated, with all execution state discarded). The function doesn't need to sit in memory burning resources while waiting hours or days for remote work to complete.
Fig 3. Our durable function tree seen as a tree with local-context and remote-context side effects
Let’s imagine that the payment provider is down for two hours, so func3 cannot complete. The execution flow of the tree:
Func1 runs:
Executes getCustomer (local work, cannot suspend here)
Invokes func2, and receives a durable promise.
There is no other local work to run right now. Only waiting on remote-context side effects.
Func1 suspends—completely terminated, no resources held
Func2 runs:
Executes checkInventory (local work, cannot suspend here)
Invokes func3 and func4, receiving durable promises.
There is no other local work to run right now. Only waiting on remote-context side effects.
Func2 suspends—completely terminated, no resources held
Func3 runs (concurrently with func4)
Payment provider down, so fails payment.
Func3 is retried repeatedly by the DEE.
Two hours later, func3 completes, resolves the promise
Func4 runs (concurrently with func3)
Executes uploadInvoice (local work, cannot suspend here)
Executes updateOrder (local work, cannot suspend here)
Resolves its promise.
Func2 resumes–re-executed from the top by the DEE.
checkInventory: already resolved → instant return
Func3: already resolved → instant return
Func4: already resolved → instant return
Resolve promise to func1.
Func1 resumes–re-executed from the top by the DEE.
getCustomer: already resolved → instant return
func2: already resolved → instant return with result
Continues to logAudit (local work) and completes.
Without suspension, either:
The whole tree would need to be re-executed from the top repeatedly until func3 completes after two hours.
Or, each function in the tree, from func3 and up, would need to retry independently every few minutes for those two hours the payment provider is down just to check if their child promises have been resolved.
With function suspension, we avoid the need to repeatedly retry for long time periods and only resume a function once its child promise(s) has been resolved, all the while consuming zero resources while waiting.
Local side effects don't allow suspension because the function must remain running for the side effect to complete. You can't suspend while waiting for a database response: the TCP connection would be lost and the response would never arrive. The same goes for API calls that are not directly managed by the durable execution engine, these are treated like any other locally-run side effect.
What makes this durable function tree structure particularly powerful for fault tolerance is that each node can fail, retry, and recover independently without affecting its siblings or ancestors. If func3 crashes, only func3 needs to retry:
func2 remains suspended.
func4's completed work is preserved.
func1, also suspended, and doesn't even know a failure occurred.
The tree structure creates natural fault boundaries: failures are contained to a single branch and don't cascade upward until that branch exhausts its retries or reaches a defined timeout. This means a complex workflow with dozens of steps can have a single step fail and retry hundreds of times without forcing the entire workflow to restart from scratch. Portions of the tree can remain suspended indefinitely, until a dependent promise allows resumption of the parent function.
Function Trees in Practice
Different durable execution engines make different choices about tree depth and suspension points.
Temporal
Temporal uses a two-layer model where workflows orchestrate activities. The workflow is the root function (run as a workflow task) and each activity is a leaf function (each run as a separate activity task). Each activity is considered a single side effect. Child workflows add depth to the tree as a parent workflow can trigger and await the result of the child.
Fig 4. Temporal’s two layer workflow→activity model.
Because each activity is a separately scheduled task that could run on any worker, for the workflow task, activities are remote-context side effects, which allows the workflow task to be suspended. In fact, if a workflow has three activities to execute, then the workflow will be executed across four workflow tasks in order to complete (as the first three workflow tasks end up suspending on an activity invocation).
1. Workflow task 1 - Activity 1: Invoke - Suspend (complete workflow task) 2. Activity 1 task: executes, completes. 3. Workflow task 2 - Activity 1: already completed → instant return - Activity 2: Invoke - Suspend (complete workflow task) 4. Activity 2 task: executes, completes. And so on…
Fig 5. Workers poll Temporal Server task queues for tasks, and then execute those tasks. Activity are invoked via commands which Temporal Server derives events and tasks from.
Even when an activity fails, Temporal re-executes the parent workflow from the top, which re-encounters the failed activity. In Temporal, the workflow task, run on workers, drives forward progress. If an activity needs to be scheduled again, that is driven from the workflow task. In turn, Temporal detects the need to reschedule a workflow task when an activity times out (rather than detecting the error directly).
Temporal is a push/pull model where:
Workers pull tasks (workflow/activity) from task queues in Temporal Server.
Workers drive forward progress by pushing (sending) commands to Temporal Server (which in turn leads to the server creating more tasks to be pulled).
Restate
Restate supports arbitrary tree depth, functions calling functions calling functions. Each function execution can progress through multiple side effects before suspending when awaiting remote Restate services (durable functions), timers, or delegation promises. Failed functions are retriggered independently by Restate Service rather than requiring parent re-execution.
Where Temporal drives progress of an activity via scheduling a workflow task, Restate drives progress by directly invoking the target function from the engine itself. This makes sense as there is no separate workflow and activity task specialization. If func1 is waiting on func2, then func1 can suspend while Restate executes (and possibly retries) func2 independently until it completes or reaches a retry or time limit, only then waking up func1 to resume.
Therefore we can say Restate is purely a push model. Restate Server acts as a man-in-the-middle, invoking functions, and functions send commands and notifications to Restate Service which it reacts to. In its man-in-the-middle position, it can also subscribe to Kafka topics and invoke a function for each event.
Fig 6. Invocations are driven by Restate Service. Functions will suspend when they await Restate governed remote side effects (and no local side effects). Restate detects when a suspended function should be resumed and invokes it. Note this diagram omits the notifications send from the Restate client back to Restate server related the start and end of each local side effect.
Resonate is definitely worth a mention here too, it falls into the arbitrary function tree camp, and is going further by defining a protocol for this pattern. The Resonate model looks the simplest (everything is a function, either local or remote), though I haven’t played with it yet. I recommend reading Dominik Tornow’s writing and talks on this subject matter of distributed async/await as trees of functions returning promises.
DBOS
DBOS has some similarities with Temporal in that it is also a two-level model with workflows and steps, except most steps are local-context (run as part of the parent function). DBOS workflows mostly operate as a single function with local-context side effects, except for a few cases like durable timers, which act as remote-context side effects and provide suspension points. A DBOS workflow can also trigger another workflow and await the result, providing another suspension point (as the other workflow is a remote-context side effect). In this way, DBOS can form function trees via workflows invoking workflows (as workflows are basically functions).
DBOS also uses a worker model, where workers poll Postgres for work (which is similar to Temporal workers polling task queues). Because steps are local-context side effects (such as db commands, API calls) a typical workflow does not suspend (unless it awaits a timer or another workflow). This differentiates itself from Temporal, which schedules all activities as remote-context side effects (activity tasks are run as an independent unit of work on any worker).
Fig 7. DBOS workers poll Postgres for work. Functions will suspend when they await a timer or another DBOS workflow. The logic is mostly housed in the DBOS client, where the polling logic can detect when to resume a suspended workflow.
Despite their differences, Temporal, Restate and DBOS suspend execution for the same fundamental reason: the distinction between locally-run and remotely-run side effects. Temporal makes activities explicitly remote but only ever one layer deep; Restate and DBOS generally make side effects local-context but support remote-context in the form of timers and other durable workflows/functions.
A Continuum of Constrained to Compositional Trees
Durable execution frameworks sit on a continuum from “more constrained” to “more flexible compositional” models:
On the left, frameworks like Temporal and DBOS use two distinct abstractions: workflows (control flow logic) and activities/steps (side effects). Activities/steps are terminal leaves; only workflows can have children. This constraint provides helpful structure. It's clear what should be a workflow (multi-step coordination) versus an activity (a single unit of work). The tradeoff is less flexibility: if your "single unit of work" needs its own sub-steps, you must either break it into multiple activities or promote it to a child workflow.
On the right, frameworks like Resonate treat everything as functions calling functions. There's no distinction between "orchestration" and "work". Any function can call any other function to arbitrary depth. This provides maximum composability but requires discipline to avoid overly complex trees.
Restate kind of straddles both as it offers multiple building blocks, it’s harder to pin down Restate on this continuum.
All positions on this continuum support function trees, the difference is how much structure the framework imposes versus how much freedom it provides. Constrained models offer guardrails against complexity; forcing you to think in terms of workflows and steps. Resonate and Restate provide more flexibility, functions calling functions, but inevitably this requires a bit more discipline.
Next in part 2
Using what we’ve covered in part 1, in part 2 we’ll take a step back and:
Look at durable execution compares to event-driven architecture in terms of fault domains/ responsibility boundaries.
Ask the question: what does durable execution actually provide us that we can’t achieve by other means?
Finally, look at how does durable execution fits into the wider architecture, including event-driven architecture.