Coordinated Progress – Part 1 – Seeing the System: The Graph

At some point, we’ve all sat in an architecture meeting where someone asks, “Should this be an event? An RPC? A queue?”, or “How do we tie this process together across our microservices? Should it be event-driven? Maybe a workflow orchestration?” Cue a flurry of opinions, whiteboard arrows, and vague references to sagas.

Now that I work for a streaming data infra vendor, I get asked: “How do event-driven architecture, stream processing, orchestration, and the new durable execution category relate to one another?

These are deceptively broad questions, touching everything from architectural principles to practical trade-offs. To be honest, I had an instinctual understanding of how they fit together but I’d never written it down. Coordinated Progress is a 4-part series describing how I see it, my mental framework, and hopefully it will be useful and understandable to you.

I anchor the mental framework in the context of workflow, using the term in a broad sense to mean any distributed work that spans multiple services (such as a checkout flow, a booking process, or a loan application pipeline). Many people use the term “saga” to describe long-running workflows that span multiple services and require coordination and compensation. This analysis uses the more general term workflow to capture a broader class of distributed work.

Series TLDR

Modern systems are no longer built as monoliths, they are sprawling graphs of computation, stitched together by APIs, queues, and streams; implemented across microservices, functions, stream processors, and AI agents. Complex workflows cross service boundaries, requiring coordination that must be both reliable and understandable, as well as flexible and adaptable.

Within these graphs, the concepts of coordination and reliable progress are critically important.

Coordination strategies shape how workflows are built and maintained:

  • Choreography (reactive, event-driven) provides high decoupling and flexibility.

  • Orchestration (centralized, procedural) offers greater clarity and observability.

But not all edges in the graph are equal. Some edges are direct, defining the critical path of a workflow where failure means failure. Others are indirect, triggering auxiliary actions in adjacent or even far away services. A good mental model distinguishes between the two: orchestration should focus on direct edges, while choreography handles both naturally.

Reliable progress hinges on two core concepts:

  • Durable Triggers: Work must be initiated in a way that survives failure (e.g., Kafka, queues, reliable RPC).

  • Progressable Work: Once started, work must be able to make progress under adverse conditions via replayability using patterns such as idempotency, atomicity, or the ability to resume from saved state.

While stream processors (e.g. Flink, Kafka Streams) and event-driven systems based on queues and topics (e.g. Kafka, RabbitMQ) have durability built in, imperative code typically does not. Durable Execution Engines (DEEs), such as Temporal, Restate, DBOS, Resonate, and LittleHorse (among many others), aim to fill that gap in the world of imperative functions. They provide varying tooling and language support for adding durable triggers and progressable work to imperative, procedural code.

This analysis constructs a conceptual framework for understanding both coordination and reliable progress in modern distributed architectures composed of microservices, functions, stream processing, and AI agents, including new building blocks made available by DEE frameworks.

It’s graphs all the way down

Neo: The Graph.

Morpheus: Do you want to know what it is?

Neo: Yes.

Morpheus: The Graph is everywhere. It is all around us. Even now, in this very server room. You can see it when you look at your IDE or when you design your Flink topology. You can feel it when you work on microservices... when you handle incidents... when you deploy.

At every level of abstraction, computation reveals itself as a graph. A function in a microservice contains a control flow graph (comprising branches, loops, and conditionals) that describes its execution logic. A Flink job is explicitly a directed graph of operators and stateful nodes connected by streams. A workflow that ties together multiple services, whether via orchestration or choreography, is also a graph, one that represents dependencies, event flows, or command sequences.

Coordination plays a critical role in this graph of graphs and is also present at every layer. Within a single Flink job, it is Flink itself that coordinates work across multiple task managers. Within a microservice, the executable code acts as a linear or concurrent coordination of multiple steps that may invoke other services or data systems. However, it is the coordination required for workflows across multiple systems that presents the largest challenge. Distributed coordination across heterogeneous systems, programming models and environments is a strategic concern for organizations with far-reaching consequences.

This graph, made up of workflows spanning multiple systems, can make discussing topics of programming models and coordination methods confusing. For example, one consumer in an event-driven architecture may execute its code procedurally in an imperative style, but play the role of a node in a reactive architecture. A workflow can be triggered by an event, which acts as a reliable trigger for retries, or it might be triggered by an ephemeral RPC. The types of nodes and edges of the graph matter.

Nodes, edges, and sub-graphs

While everything is a graph (even code), for this analysis, let’s confine things so that:

  • Nodes are microservice functions, FaaS functions, stream processing jobs and AI agents.

  • Edges are RPC, queues, event streams. These vary widely in semantics, some are ephemeral, others durable, which affects reliability.

  • Workflows are sub-graphs (or connected components as in graph theory) of the Graph.

The Graph. Nodes connected by edges. Workflows as sub-graphs.

Direct / indirect edges

What constitutes a workflow is debatable, especially with the broad meaning I’m using in this analysis. But I like to think of workflow by the types of edges, which can be direct or indirect. Edges also have other properties, such as request/response vs one-way, and synchronous or asynchronous, but for now we’ll keep the model simple and think about whether edges are direct or indirect.

Direct edges trigger work that is central to the goal being performed. Using an example of an Order Placed workflow. There might be a set of microservices to handle the payment, reservation of stock, initiation of shipment preparation, which are all directly tied to the order. These are all connected by direct edges and form the core order placed workflow.

Indirect edges trigger tangential, auxiliary work, such as notifying the CRM for customer management, a finance/reporting system or auditing service for compliance in the order workflow. An indirect edge could even trigger a secondary core workflow, such as a just-in time inventory process.

Whether an edge is direct or indirect will influence what kind of communication medium is chosen (more on that in parts 2 and 3).

Coordination

Workflows require coordination, whether that coordination is just a dance between independent services or something more centralized and controlled. There are two main coordination strategies: choreography and orchestration. 

Choreography: Event-driven workflow (reactive). Highly decoupled microservices that independently react to their input events as they arrive. There is no blocking or waiting, and all consumers operate independently of any upstream producers or subsequent downstream consumers.

Coordination via publish-subscribe semantics. The entire impact of an upstream event can spread far, and is dynamic over time. The boundaries of any given workflow within that wider event flow can be soft and hard to define (but with low coupling).

Orchestration: Procedural workflow (if-this-then-that). Logic is centralized in some kind of orchestrator (even a microservice or function), issuing commands and awaiting responses from subordinate worker microservices. The orchestrator keeps track of which parts of the workflow have been completed, which are in progress, and which have yet to be started. It keeps track of the commands sent to the subordinate services as well as the responses from those services.

Coordination via procedural orchestration semantics. The entire impact of an upstream workflow can spread far, as individual nodes can still emit events. The boundaries of a given workflow are clearly encoded in the orchestration code, albeit at the cost of increased coupling.

We’ll cover choreography and orchestration in more detail in part 2.

The Role of Stream Processing in the Graph

Stream processing frameworks like Apache Flink and Kafka Streams can be thought of as microservices with a configurable blend of continuous dataflow programming model and reactive event handling, designed to transform and react to streams of events in real time. Like microservices, stream processors form logical graphs of computation, using branching, joining, and aggregation to process data. However, their programming model is more constrained, being optimized for data-centric transformations of event streams rather than complex control flow or handling individual requests on demand.

In the context of workflows and sagas, stream processors fit naturally into event-driven choreography as nodes in the event graph, not only performing transformations or enrichments, but also taking on roles that overlap with those traditionally handled by microservices, including stateful business logic and triggering downstream effects.

Just as modern microservices are decomposed into bounded contexts following domain-driven design principles, so too should stream processors be scoped narrowly. Embedding an entire business workflow (e.g., shopping cart checkout, payment, shipping, fulfillment) into a single Flink or Kafka Streams job is generally discouraged. Instead, stream processors work best as individual nodes in a choreographed system, each independently reacting to events.

Beyond participating as choreographed actors, stream processors play two valuable roles in saga and workflow architectures:

  • Real-time triggers for workflows: Detecting event patterns (e.g., "user added to cart but didn’t check out in 1 hour") and emitting signals to start or branch workflows.

  • Aggregated state for decisions: Continuously computing derived state (e.g., fraud scores, user behavior patterns) that orchestrators or services can query to guide workflow logic.

In summary, stream processing can replace traditional microservices in choreographed workflows and enhance orchestrated workflows with real-time insights, triggers, and data transformations. However, one stream processing job would rarely include an entire workflow, just as a microservice would not.

Durability as a First-Class Concern

In distributed systems, durability is not just about data but about progress. A workflow that performs critical operations must either complete or fail in a controlled, recoverable way. Durable coordination ensures that steps don’t vanish into the void after a crash or network fault. No matter the execution model (procedural, event-driven, or dataflow), durability is the mechanism that transforms ephemeral logic into reliable systems.

Choreography in the form of event-driven architectures (EDA) offers durability by default. Events are stored durably in a queue or log (e.g., Kafka), enabling reactive systems to recover from crashes, replay history, and trigger retries. Each service reacts independently, and progress is tracked implicitly in the event stream. In this model, the event log acts as both coordination medium and source of truth, encoding the causal structure of the system’s behavior.

Imperative code, by contrast, lacks built-in durability. A service running procedural logic (e.g., "do A, then B, then C") typically stores its state in memory and relies on external systems for persistence of selective state. When a crash occurs mid-execution, everything in the call stack is lost unless explicitly saved. This gap gave rise to the Durable Execution product category, which brings event-log-like durability to imperative workflows. Durable Execution Engines (such as Temporal, Restate, DBOS) persist the workflow’s progress, key variables, intermediate results, responses from other services, and so on, allowing it to be retried, resuming exactly where it left off. 

@durable_workflow
async def refund_order(context, order_id):
await context.call("cancel_shipping", order_id)
try:
await context.call("refund_payment", order_id)
except:
await context.call("escalate_issue", order_id)

Durable Execution Engines are, in effect, the Kafka of imperative coordination (aka coordination via orchestration).

Durability is also foundational in stream processing. Frameworks like Apache Flink and Kafka Streams include native durability through state persistence mechanisms (such as checkpointing, changelogs, and recovery logs), ensuring that event transformations and stateful aggregations survive failures. While the paradigm is data-centric and continuous, the core concept is the same: progress is recorded durably so that computation can continue reliably.

Ultimately, everything is also a log. Whether it's a sequence of domain events, a durable workflow history, or a changelog backing a stream processor, the underlying idea is the same: encode system activity as a durable, append-only record of what has happened (and possibly what should happen next). 

Making durability a first-class concern allows systems to be:

  • Recoverable after crashes.

  • Observable through replayable history.

  • Reliable across asynchronous boundaries.

  • Composable across execution models.

This perspective is a start, but we need to break this down into more precise terms by creating a simple model for thinking about reliable execution and coordinated progress. Let’s do this in part 2.

Coordinated Progress series links:

  1. Seeing the system: The Graph

  2. Making Progress Reliable

  3. Coupling, Synchrony and Complexity

  4. A Loose Decision Framework