In this last post, we’ll review the mental framework and think about how that can translate to a decision framework.
Reliability and execution models
Microservices, functions, stream processors and AI agents represent nodes in our graph. An incoming edge represents a trigger of work in the node, and the node must do the work reliably. I have been using the term reliable progress but I might have used durable execution if it hadn’t already been used to define a specific type of tool.
In this light, stream processors like Flink and Kafka Streams are particularly interesting. They’ve had durable execution built in from the start. The model assumes that processing logic is applied to an ordered, durable log of events. Progress is explicitly tracked through durable checkpoints or changelogs. Failures are expected and routinely recovered from. Many of the promises of durable execution are simply native assumptions in the stream processing world.
You can think of stream processors as always-on orchestrators for data-in-motion:
The durable trigger is the event log itself.
The progressable work is defined through operators, keyed state, and fault-tolerant state backends.
This makes stream processing a powerful, production-tested example of how durable execution/reliable progress can be built into the foundation of an execution model.
Microservices and functions have not historically had a durable execution/reliable progress foundation built into it’s execution model (except may be for some actor frameworks). Reliable triggers exist in the form of queues and event streams, but progressable work has been limited to idempotency. The Durable Execution category aims to close this reliable progress gap for microservices and functions. It can provide an additional reliable trigger in the form of Reliable RPC and provides progressable work by durably persisting all progress, with resume functionality.
The question then comes down to preferences and constraints:
What’s your coding style preference?
What's your coupling tolerance?
What infrastructure dependencies can you tolerate?
Stream processors like Flink and Kafka Streams use a continuous dataflow programming model, with durability "batteries included". You likely already have a dependency on an event streaming system like Kafka. Flink is an additional runtime to manage, while Kafka Streams is just a library but still comes with some operational challenges related to state management.
Imperative microservices and functions offer familiar procedural programming but typically lack native durability mechanisms. To achieve reliable progress, you can:
Couple reliable triggers (like events from queues/topics) with manual progressability patterns like idempotency and spreading logic across response handlers.
Or, add a durable runtime in the form of a Durable Execution Engine for automatic state persistence and resumability (another infrastructure dependency).
Start with a mental framework
Building reliable distributed systems requires thoughtful choices about coordination and progress. Architecture decisions are often about managing complexity in systems that span teams, services, and failure domains. We can simplify the decision making process with a mental framework based on a graph of nodes, edges, and workflows. The reliability of distributed work in this graph comes from coordinated reliable progress (of reliable triggers, progressable work, choreography and orchestration).
To summarize our model, the graph is composed of:
Nodes: Microservice functions, FaaS functions, stream processing jobs and AI agents.
Edges are RPC, queues, event streams. These vary widely in semantics, some are ephemeral, others durable, which affects reliability. There are direct and indirect edges.
Workflows are sub-graphs (or connected components as in graph theory) of the Graph.
Coordination strategies shape how workflows are built and maintained:
Choreography (reactive, event-driven) provides high decoupling and flexibility.
Orchestration (centralized, procedural) offers greater clarity and observability.
Reliable progress hinges on two core concepts:
Durable Triggers: Work must be initiated in a way that survives failure (e.g., Kafka, queues, reliable RPC).
Progressable Work: Once started, work must be able to make progress under adverse conditions via replayability using patterns such as idempotency, atomicity, or the ability to resume from saved state.
Coupling is an ever present property that must be balanced with other needs and constraints.
Apply the model via your own decision framework
Using this graph model, we can apply a loose decision framework:
Ask whether an edge should be a reliable trigger, and if so, what form serves your reliability and coupling requirements? Is it a direct or indirect edge?
Ask whether a node needs progressable work capabilities, and whether idempotency, transactions, or durable state persistence best fits your context? What programming style are you comfortable with? What infrastructure dependencies are you willing to take on?
Consider the coordination trade-offs, does this workflow need choreography's flexibility or orchestration's clarity? Is there a core workflow in here, with only direct edges that can be spun out? Is the graph highly connected and dynamic?
Consider the programming model. Do you prefer procedural code or does a more continuous dataflow programming model suit you, or the problem being solved?
Also consider the complexity of the code written by developers and also the total complexity of the system. Durable execution can make some code simpler to write, but it’s also another middleware to support, with failure modes of its own, just like a distributed queue or event stream.
Your specific needs will be specific, complex, with a number of constraints. But you can use this mental framework to define your own more rigorous decision framework to make more informed and balanced architecture decisions.
I hope this graph model has been useful for you, no matter your preferences regarding the coordination, communication and programming styles.
A theme I had not expected when I started out writing this analysis was the unifying thread of durability.
Durability is behind the idea that distributed work should not vanish or halt when something fails. Whether in communication (reliable triggers), in execution (progressable work), or in system design (trees and logs), durability underpins coordination and recoverability. Durability isn’t just about data, but about progress too. It’s a foundational property that can be built into functions, microservices, and stream processors alike, the only question is in what form.
Coordinated Progress series links: