In part 2, we built a mental framework using a graph of nodes and edges to represent distributed work. Workflows are subgraphs coordinated via choreography or orchestration. Reliability, in this model, means reliable progress: the result of reliable triggers and progressable work.
In part 3 we refine this graph model in terms of different types of coupling between nodes, and how edges can be synchronous or asynchronous. Let’s set the scene with an example, then dissect that example with the concepts of coupling and communication styles.
E-commerce Example: Choreography vs Orchestration
Let's say we have these services:
Inventory Service (checks/reserves stock)
Payment Service (processes payment)
Shipping Service (arranges delivery)
Notification Service (sends confirmations)
Choreography Example
Each service reacts independently to events:
Compensations are also event-driven:
Failure handling: If a payment fails, the Payment Service publishes PaymentFailed. The Inventory Service listens for this and releases the reserved stock. The Shipping Service ignores the order since it never sees both required events.
Key traits in action:
Decoupled: Payment Service doesn't know or care about Inventory Service
Temporal decoupling: If Shipping Service is down, events wait in the queue.
Reliable progress: Durable events act as reliable triggers of work.
Emergent workflow: Easy to add new services (like Fraud Detection) that react to existing events.
Orchestration Example
A central Order Orchestrator manages the entire flow:
async def process_order(order_data): | |
try: | |
# Step 1: Reserve inventory | |
stock_reservation = await inventory_service.reserve_stock(order_data.items) | |
# Step 2: Process payment | |
payment_result = await payment_service.charge_card( | |
order_data.card, order_data.total | |
) | |
# Step 3: Arrange shipping | |
shipping_info = await shipping_service.schedule_delivery( | |
order_data.address, stock_reservation.warehouse | |
) | |
# Step 4: Send notifications | |
await notification_service.send_confirmation( | |
order_data.email, payment_result, shipping_info | |
) | |
except PaymentFailedException: | |
# Compensation: release reserved stock | |
await inventory_service.release_stock(stock_reservation.id) | |
await notification_service.send_failure_email(order_data.email) | |
raise |
Key traits in action:
Centralized control: Easy to see the entire workflow in one place.
Clear compensation: If payment fails, the orchestrator explicitly releases stock.
Reliable progress: DEE ensures the workflow resumes exactly where it left off after crashes.
Tight coupling: Orchestrator must know about all participating services.
The Contrast
Adding a new direct requirement (e.g., fraud check)
Choreography: Add the Fraud Service that listens to OrderPlaced, publishes FraudCheckPassed/Failed. Update Payment Service to wait for fraud clearance. No changes to other services.
Orchestration: Modify the central orchestrator code to add the fraud check step. Deploy the new version carefully to handle in-flight workflows (versioning is one serious challenge for orchestration-based workflows).
Adding indirect dependencies
The fraud check is probably connected by a direct edge, i.e, it is necessary for the order process to complete. Now let’s add some tangential actions, such as updating a CRM, a reporting service and logging the order in an auditing service. Each gets added one at a time over a period of weeks.
Choreography: No changes to existing services. Make the CRM, reporting service and auditing service listen for the order_placed and order_cancelled events.
Orchestration: Modify the central orchestrator code to add the call to the CRM and deploy. Followed by the financial reporting service, then the auditing service.
Debugging a failed order
Choreography: Check logs across multiple services, correlate by order ID, reconstruct the event flow.
Orchestration: Look at the orchestrator's execution history.
As you can see from this limited example, there are a wealth of pros and cons to consider.
Now let's use this example, along with additional concepts around coupling and communication styles, to refine our mental framework further.
Refining the model further
Coupling Types
Coupling is a key consideration when designing service-to-service communication. It comes in different forms, most notably design-time coupling and runtime coupling.
Design-time coupling refers to how much one service (a node in our graph) must know about another in order to interact. RPC introduces strong design-time coupling: the caller must know the callee's interface, expected request structure, response shape, and often its semantic behavior (such as throttling and latency profile). Even changes to the internal implementation of the callee can break the caller if not carefully abstracted.
Code that triggers work via RPC must be changed every time a work dependency changes, is added or removed.
Events also introduce design-time coupling, primarily around shared schemas. However, they also reduce coupling in other forms. For example, in a choreographed architecture, producers don’t need to know who consumes the event or how it’s processed. This allows new consumers to be added later without changes to the producer. Services evolve more independently, as long as schema evolution practices (e.g. schema versioning, compatibility guarantees) are respected. For example, the service emitting the PaymentProcessed event doesn’t need to know whether shipping, analytics, or notification systems will consume it, or even whether those systems exist yet. This contrasts with orchestration, which may have to know about these services and have to be updated when new dependencies are added (like in our e-commerce example).
Runtime coupling is about whether services need to be up and available at the same time. RPC is tightly runtime-coupled: if service A calls service B synchronously, B must be up, fast, and reliable. Chains of RPC calls can create fragile failure modes, where one slow or unavailable service can stall or crash an entire request path.
As Leslie Lamport famously said:
“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” — Leslie Lamport
He must have been thinking about RPC call chains.
But request/response implemented over queues can suffer from similar runtime coupling, if the timing expectation is the same as RPC. We’ll get into timing expectations in the next section.
In contrast, publish-subscribe using events is free from runtime coupling between services. A service can emit an event and move on, trusting that any consumers (if they exist) will eventually process it when ready.
Both RPC and events come with coupling trade-offs. Events reduce direct entanglement between services, especially at runtime, while still requiring careful attention to schema contracts. RPC is simpler in some use cases, but creates tighter coupling that can make systems more brittle over time.
Sync vs Async: A Communication Continuum
My colleague Gunnar Morling wrote a nice piece Synchrony Budget that is relevant here.
Communication between services isn’t just synchronous vs asynchronous, it exists on a continuum of timing/response expectations. I will differentiate here between a response and a confirmation.
I use the term confirmation for an immediate reply of “I’ve got this, trust me to get it done”. A service that carries out the work asynchronously, will still respond with a 200 to confirm to the caller that they received the request successfully. Kafka will send a producer acknowledgement once the batch has been committed.
I use the term response for a reply (that might come later and possibly even through a different communication medium) that contains data associated with the requested work now done.
Responses sit on a continuum.
At one end is immediate response, where the caller needs an answer now to proceed. At the other end is no response at all, where the sender simply informs the system of something that happened. In between lies relaxed response timing, where the result is needed eventually, but not right away.
RPC is typically on the left. It’s used when the caller needs to block and wait for the result, with a short latency requirement. One example is “reserve inventory before payment” in a synchronous checkout flow. This tight timing requirement makes the caller innately dependent on the callee's availability and latency, making an RPC a valid choice.
Events sit on the right. They’re used when a service simply emits a fact (e.g. “order placed”), without expecting a reply. Consumers handle the event independently and asynchronously.
Relaxed timing responses live in the middle. Some use asynchronous RPC (like invoking a function and getting a callback later, e.g. via a webhook or polling), and some use queue-based patterns. Asynchronous RPC is still runtime-coupled, whereas queues act as the fault-tolerant middleware that will eventually deliver the request to the destination.
While RPC and events are often presented as opposites, real systems use hybrids. RPC can be made asynchronous. Events can be used for request/response with correlation IDs. But starting with this continuum helps frame the key trade offs.
The Reliable RPC Curve Ball
We defined Reliable RPC previously as being “delivered” by a fault-tolerant middleware, such as a Durable Execution Engine (DEE). This gives RPC some queue-like properties. For one, it reduces runtime coupling between the sender and receiver. If the receiver is down when the sender sends it, no worries, the RPC will eventually get delivered to the receiver, just like an event/command on a queue. Using the graph terminology from parts 1 and 2, it has turned an ephemeral edge into a reliable one (a reliable trigger). In fact, I tend to think of Reliable RPC as another implementation of point-to-point queues, which are either one-way or request/response. Of course, this is most useful in relaxed timing scenarios.
But what about resumability? If a Reliable RPC is like a synchronous RPC but with reliable delivery (of both request and response), then what about the sender’s context and resuming once the result has been received?
The sender is keeping a bunch of variables and context in its stack memory, and if the host fails, then it cannot resume once the Reliable RPC has completed.
Let’s take an example of a two step process:
Make a call to reserve stock.
Make a call to the payment service.
This is a relaxed timing scenario, the user has been told the order is in progress. Let’s say that the 1st call is made, the stock gets reserved, but the calling context dies before the response is received. How does this work get resumed?
Resumability with asynchronous RPC or queues/events
Asynchronous RPC and request/response-over-queues address this resumability in the above scenario in the following ways:
The response is not directly received by the calling context, but via an event handler receiving a response from a queue or a webhook RPC. So it doesn’t matter that the original context is dead. It could be a completely different instance of the application that receives the response.
The state necessary for moving onto step 2 is either contained in the response, or a correlation id is used in the request and the response, and the necessary state was written to a data store, using the correlation id as the key. So the receiver can retrieve the necessary state.
The response handler then calls the payment service.
The downside of this approach is the code is spread across response handlers, making it more complex to write, read and debug.
Reliable RPC + Resumability
With queues or webhooks, the response arrives out-of-band, and correlation IDs must be managed manually. Some DEEs eliminate this complexity by letting you write code as if it were synchronous. The framework persists intermediate state and retriggers the function for it to make progress.
The DEE acts as the reliable delivery middleware for requests and responses (like a queue), but also abstracts away all the response handler stuff.
All the code exists in one function and if the function fails, the DEE invokes it again, but crucially, the code is written using the DEE SDK and the SDK silently retrieves the state regarding the function progress (including prior responses) so that code essentially resumes from where it left off (by using persisted state to skip prior work).
Determinism is required for this strategy to work. For example, if the function starts by generating a random UUID as the identifier for the work (such as an order), then a second invocation would generate a different UUID. This is a problem if half the steps have been carried out already, using a different UUID. To cover those needs the SDK will provide a deterministic UUID (basically it durably stored the result of the first UUID generation) or datetime or integer etc.
Reliable RPC coupled with resumability simplifies the code (no callbacks, response handlers, correlation ids etc). All the code can exist in one function, but can resume from a different application instance after failure. With a reductionist mindset, we could say that Reliable RPC + Resumability is a convenience solution to avoid needing to build complex asynchronous response handling. But this is not a trivial aspect at all. It can make the developer’s life easier and make for more readable code.
Coming back to Direct vs Indirect edges
Bringing this back to graph model, not all edges are created equal.
Direct edges represent dependencies where the workflow cannot proceed without some kind of response. In the e-commerce flow, the order cannot be completed without reserving inventory and processing the payment first. These are blocking dependencies, though the timing requirements could be short (ideal for RPC) or long (best suited to reliable forms of communication such as a queue, event stream or Reliable RPC).
When a workflow has blocking dependencies, the coupling between steps already exists at the business logic level. The question becomes whether to make this coupling explicit through orchestration or implicit through event choreography.
The coupling cost of orchestration may be worth paying because these services already have a runtime dependency, they need each other to function. Making this explicit through orchestration may reduce overall system complexity compared to managing the same coordination through distributed event flows.
However, even for a core workflow of direct edges, the decoupled nature of choreography might win out. It may simply not make sense for one piece of a wider workflow to be modeled as an orchestrated workflow, with different technologies, support issues, versioning strategies and deployment procedures. Also, team autonomy and team boundaries may cross such a workflow such that the decoupling of choreography is still best.
Indirect edges represent actions that are operationally or even strategically important but don't block the immediate workflow from completing successfully. The reliable indirect edge ensures that important work eventually gets carried out. In our e-commerce example, CRM updates, financial reporting, and audit logging might be business-critical or legally required, but the customer's order can be fulfilled even if these systems are temporarily unavailable.
Indirect edges will also often cross more granular organizational boundaries, where the cost of inter-team coordination is higher. The order processing logic is likely focused on a small number of teams in the software development org, whereas financial and auditing services/software systems are likely managed by a different set of teams, potentially under different management hierarchies.
For indirect edges, events match this natural business decoupling because:
Separate concerns are decoupled:
If CRM updates, financial reporting, and audit logging are separate business concerns, they shouldn't be embedded in order processing logic.
Every time a new system needs to know about orders (fraud detection, analytics, customer success tools), the core orchestrator would need modification, deployment, and versioning.
Reduces coordination overhead: Teams owning peripheral systems can independently subscribe to order events without requiring changes from the order processing team.
Prevents scope creep: The core workflow stays focused on its essential purpose rather than accumulating tangential responsibilities.
The principle is that orchestration should be limited to direct edge subgraphs i.e., the minimal set of services that must coordinate to complete the core business function. Everything else should use choreography to preserve the business-level decoupling that already exists.
In Part 4, we’ll finish the series with some last reflections and a loose decision framework.
Coordinated Progress series links: