Processing Pipelines Series - Concepts

In this series we'll look at few different technologies we can use to process streams of data in processing pipelines and directed acyclic graphs (DAGs). First we'll look at real-time processing and computation running on a single machine using .NET Core based libraries such as TPL Dataflow and Reactive Extensions. Then we'll move onto distributed systems like Kafka Streams. We'll take a single scenario and build it out multiple times, each with a different technology.

The scenario we'll use is inspired by a recent project where a program had to read metrics in real-time from a very complex and large piece of machinery. It had to perform various computations and output multiple feeds of data across network sockets to other applications on the network. While this is just one scenario, I think it represents a few interesting challenges. Feel free to skip the scenario description, it just gives some interesting and fun background to our technical challenge.

The Scenario

Our organization has multiple astronomical observatories situated in remote mountainous areas. Each observatory has complex machinery that needs monitoring both locally and from a centralized cloud platform. Connectivity to the cloud is patchy so we don't publish data in real-time to the cloud but record data locally and upload data periodically.

From inside the facility, engineers can monitor the machinery in both real-time and see statistics published in time windows. The metrics published by the machinery are published as as a byte array which must be decoded in order to obtain meaningful data. While the real-time and statistics feeds are important, our first priority is to record the data to disk. A separate batch process uploads the binary files to the cloud periodically where the centralized platform can consume the data. The centralized platform allows the global infrastructure team to monitor all installations and perform analytics on the data such as predictive maintenance.

The machinery publishes messages at a rate of 100/second. Each message contains readings from multiple sensors such as temperature, pressure, vibration etc. Each message contains the current sensor values for around 100 sensors, meaning total metrics per second reaches 10000. We perform batch inserts of the decoded data to a local database periodically and generate statistics based on multiple short time windows. Both the real-time and the statistics feeds are published over network sockets.

So that is our scenario that we'll build repeatedly with different technologies.

Processing Pipelines - Important Concepts

When considering any processing pipeline we need to take into account certain risks and constraints.

Fast Producer, Slow Consumer

When your producer of data publishes faster than your pipeline can process it then you have a problem. How you deal with that depends on multiple factors:

Is the excess bursty or constant?
Do you have plenty of memory or are you in a memory constrained environment?
Do you have a "real-time" constraint on your output data, or can you accept some lag?
Does your output data have ordering constraints, or can we output loosely ordered data?
Can we drop data or must we guarantee it gets processed?

Buffers

Buffers between stages can help deal with bursty traffic loads without needing to either slow down the producer or drop data. However, if your producer is constantly producing data faster than you can process it you'll end up out of memory. Buffers can also fill up adding large latencies to the processing times. So buffers are best for handling either bursty incoming data or variable timing in your processing stages. For example, perhaps the network gets saturated sometimes slowing down a pipeline stage.

Buffers can either be unbounded or bounded. Unbounded means there is no logical limit on the size, the only limit is the memory available. Bounded buffers have a logical limit that cannot be exceeded.

Large buffers can complicate a safe shutdown of the pipeline as we may be forced to wait a long time for the data in all the buffers to be processed.

Back-Pressure and Load-Shedding

If you have no buffers, or your buffers are full then something has to be sacrificed. Either we need to slow down the producer (back-pressure) or we need to discard data (load-shedding).

Back-pressure is the concept of the producer getting backwards pressure from the pipe it shoves data down, forcing it to slow down until the pressure alleviates. Back pressure is a good option when data loss is unacceptable and the extra latency is not an issue.

Load-shedding means we discard data in order to keep up. This can be a great option for metrics that are published at high velocity. We might receive a metric at a 1000 updates per second, but only need 10 per second in order to meet our SLA. We can even prioritize our load-shedding intelligently so we only discard less valuable data.

Concurrency

Concurrency can help speed up our pipeline so we can keep up with our fast producer. There are diferent types of concurrency for different types of load.

If a stage needs to do IO, such as network or disk activity, then we can use asynchrony to make sure the CPU is able to work while we wait. We can use async/await or plain old Task based concurrency.

If a stage needs to do a lot of CPU intensive work then we can parallelize the work over multiple cores.

But concurrency adds it own problems:

Data ordering. Data that is processed in parallel/asynchronously can end up out of order.
Complexity. Concurrency can often lead to complexity and strange behaviours. Great care needs to be taken.

How we manage concurrency all depends on the underlying tech. With Kafka Streams we'll uses multiple topic partitions to parallelize the work while maintaining correctly ordered data. In TPL Dataflow and Reactive Extensions we'll make use of async/await and each library's specific concurrency model.

Graceful Shutdown

At some point the application will need to stop and do a graceful shutdown. How you do that depends on:

the technology
your concurrency model
whether you can drop data or not

Technologies like TPL Dataflow and Reactive Extensions (Rx) both have the concept of completing a source with the option to propagate that signal down the pipeline. We'll be looking at how to safely shutdown our pipelines in an orderly and controlled manner.

Error Handling

Errors can occur at any point in the pipeline. If an error occurs during a certain stage, do we:

Simply log it and continue?
Pass down an error to the next stage?
Do retries?

We'll look at the error handling specifics of each technology.

Technologies Covered in this Series

For sure we will cover:

TPL Dataflow
Reactive Extensions (Rx.NET)
Kafka Streams

We may cover:

Akka.NET. I have not used Akka.NET before but it has been on my todo list for years. So if I can I will see if an actor pattern could be a good fit.
ZeroMQ. I love ZeroMQ, though it is not overly used in the .NET world, but I am interested in seeing if I could build a reliable pipeline with it.

Scenario Constraints

We'll close off with a summary of the constraints of our scenario. We'll be using this list when building out the pipeline with each technology:

Must write each raw message to disk with an expected message rate of one message per 10 milliseconds.
Data must be written to disk in the correct order.
The real-time feed must preserve correct ordering though data can be discarded if the network or receiver cannot support the data velocity.
Best effort reliability, latency and ordering with statistics feeds and database persistance. Load shedding can be used but the producer cannot be slowed down.
Statistics to be published for 1 second and 30 second intervals (non-overlapping windows is acceptable).

We'll cover the TPL Dataflow library first, in the next part.

Posts in series: