Demystifying Determinism in Durable Execution

Determinism is a key concept to understand when writing code using durable execution frameworks such as Temporal, Restate, DBOS, and Resonate. If you read the docs you see that some parts of your code must be deterministic while other parts do not have to be.  This can be confusing to a developer new to these frameworks. 

This post explains why determinism is important and where it is needed and where it is not. Hopefully, you’ll have a better mental model that makes things less confusing.

We can break down this discussion into:

  1. Recovery through re-execution.

  2. Separation of control flow from side effects.

  3. Determinism in control flow

  4. Idempotency and duplication tolerance in side effects

This post uses the term “control flow” and “side effect”, but there is no agreed upon set of terms across the frameworks. Temporal uses “workflow” and “activity” respectively. Restate uses the terms such as “handler”,  “action” and “durable step”. Each framework uses different vocabulary and have varying architectures behind them. There isn’t a single overarching concept that covers everything, but the one outlined in this post provides a simple way to think about determinism requirements in a framework agnostic way.

1) Recovery through re-execution

Durable execution takes a function that performs some side effects, such as writing to a database, making an API call, sending an email etc, and makes it reliable via recovery (which in turn depends on durability).

For example, a function with three side effects:

  1. Step 1, make a db call.

  2. Step 2, make an API call.

  3. Step 3, send an email.

If step 2 fails (despite in situ retries) then we might leave the system in an inconsistent state (the db call was made but not the API call).

In durable execution, recovery consists of executing the function again from the top, and using the results of previously run side effects if they exist. For example, we don’t just execute the db call again, we reuse the result from the first function execution and skip that step. This becomes equivalent to jumping to the first unexecuted step and resuming from there.

Fig 1. A function is retried, using the results of the prior partial execution where available.

So, durable execution ensures that a function can progress to completion via recovery, which is a retry of the function from the top. Resuming from where we left off involves executing the code again but using stored results where possible in order to resume from where it failed. In my Coordinated Progress model, this is the combination of a reliable trigger and progressable work.

2) Control flow is separate from side effects

A function is a mix of executing control flow and side effects. The control flow itself may include state, and branches (if/then/else) or loops execute based on that state. The control flow decides which side effects to execute based on this looping and branching.

Fig 2. Control flow and side effects

In Temporal, the bad_login function would be a workflow and the block_account and send_warning_email would be activities. The workflow and activity work is separated into explicit workflow and activity tasks, possibly run on different workers. Other frameworks simply treat this as a function and wrap each side effect to make it durable. I could get into durable promises and continuations but that is a topic I will cover in a future post.

So let’s look at another example.

def process_order(order, customer_id):
    order_date = now()
    customer = get_customer(customer_id)

    if order_date <= order.promo.end_date:
        charge_card_with_disc(order.amount, order.promo.disc, customer)
    else:
        charge_card(order.amount, customer)

    send_receipt_email(order, customer)

First we retrieve a customer record, then we check if we’re inside of the promo end date, if so, charge the card with a 10% discount, else charge the full amount. Finally send a receipt email. This introduces a bug that we’ll cover in the next section.

Fig 3. process_order function as a mix of control flow (green) and side effects (grey)

Durable execution treats the control flow differently from the side effects, as we’ll see in sections 3 and 4.

3) Deterministic control flow

Determinism is required in the control flow because durable execution re-executes code for recovery. While any stored results of side effects from prior executions are reused, the control flow is executed in full.

Let’s look at an example:

Fig 4. Double charge bug because of a non-deterministic if/else

In the first execution, the current time is within the promo date, so the then-branch is executed, charging the card with the discount. However, on the second invocation, the current time is after the promo end date, causing the else-branch to execute, double charging the customer.

Fig 5. A non-deterministic control flow causes a different branch to execute during the function retry.

This is fixed by making the now() deterministic by turning it into a durable step whose result is recorded. Then the second time it is executed, it returns the same datetime (it becomes deterministic). The various SDKs provide deterministic dates, random numbers and UUIDs out of the box.

id = ctx.random.uuid4()
now = ctx.random.now()

Another fun example is if we make the decision based on the customer record retrieved from the database.

def process_order(order, customer_id):
    customer = get_customer(customer_id)

    if customer.points >= order.points_value:
        pay_with_points(order.points_value, customer)
    else:
        charge_card(order.amount, customer)

    send_receipt_email(order, customer)

In this variant, the decision is made based on the loyalty points the customer currently has. Do you see the problem?

If the send email side effect fails, then the function is retried. However, the points value of the order was deducted from the customer in the last execution, so that in execution 2, the customer no longer has enough loyalty points! Therefore the else-branch is executed, charging their credit card! Another double payment bug.

We must remember that the durable function is not an atomic transaction. It could be considered a transaction which has guarantees around making progress, but not one atomic change across systems.

We can fix this new double charge bug by ensuring that the same customer record is returned on each execution. We can do that by treating the customer record retrieval as a durable step whose result will be recorded.

Fig 6. Make the customer retrieval deterministic if the control flow depends on it.

Re-execution of the control flow requires determinism: it must execute based on the same decision state every single time and it must also pass the same arguments to side effect code every single time. However, side effects themselves do not need to be deterministic, they only require idempotency or duplication tolerance.

4) Side effect idempotency and duplication tolerance

Durable execution re-executes the control flow as many times as is needed for the function to make progress to completion. However, it typically avoids executing the same side effects again if they were previously completed. The result of each side effect is durably stored by the framework and a replay only needs the stored result.

Therefore side effects do not need to be deterministic, and often that is undesirable anyway. A db query that retrieves the current number of orders or the current address of a customer may return a different result every time. That’s a good thing, because the number of orders might change, and an address might change. If the control flow depends on the number of orders, or the current address, then we must ensure that the control flow is always returned the same answer. This is achieved by storing the result of the first execution, and using that result for every replay (making the control flow deterministic).

Now to the idempotency. What if a side effect does complete, but a failure of some kind causes the result to not be stored by the framework? Well, the durable execution framework will replay the function, see no stored result and execute the side effect again. For this reason we want side effects to either be idempotent or otherwise tolerate running more than once. For example, we might decide that sending the same email again is ok. The cost of reliable idempotency might not be worth it. On the other hand, a credit card payment most definitely should be idempotent.

Implicit vs explicit control flow / side effect separation

Some frameworks make the separation of control flow from side effects explicit, namely, Temporal. In the Temporal programming model, the workflow definition is the control flow and each activity is a side effect (or some sort of non-deterministic operation).

Other frameworks such as Resonate and Restate are based on functions which can call other functions which can result in a tree of function calls. Each function in this tree has a portion of control flow and side effects (either executed locally or via a call to another function).

Fig 7. A tree of function calls, with control-flow in each function.

The same need for determinism in the control flow is needed in each of these functions. This is guaranteed by ensuring the same inputs, and the replacement of non-deterministic operations (such as date/times, random numbers, ids, retrieved objects) with deterministic ones.

Conclusions

Our mental model is built on separating a durable function into the control flow and the side effects. Some frameworks actually explicitly separate the two (like Temporal) while others are more focused on composable functions.

The need for determinism in control flow is a by-product of recovery being based on retries of the function. If we could magically reach into the function, to the exact line to resume from, reconstructing the local state and executing from there, we wouldn’t need deterministic control flow code. But that isn’t how it works. The function is executed again from the top, and it better make the same decisions again, or else you might end up with weird behaviors, inconsistencies or even double charging your customers.

The side effects absolutely can and should be non-deterministic, which is fine because they should generally only be executed once, even if the function itself is executed many times. For those failure cases where the result is not durably stored, we rely on idempotency or duplication tolerance.

This is a pretty generalized model. There are a number of nuances and differences across the frameworks. Some of the examples would actually result in a non-determinism error in Temporal, due to how it records event history and expects a matching replay. The developer must learn the peculiarities of each framework. Hopefully this post provides a general overview of determinism in the context of durable execution.