Incremental Jobs and Data Quality Are On a Collision Course - Part 1 - The Problem

Big data isn’t dead; it’s just going incremental

If you keep an eye on the data space ecosystem like I do, then you’ll be aware of the rise of DuckDB and its message that big data is dead. The idea comes from two industry papers (and associated data sets), one from the Redshift team (paper and dataset) and one from Snowflake (paper and dataset). Each paper analyzed the queries run on their platforms, and some surprising conclusions were drawn – one being that most queries were run over quite small data. The conclusion (of DuckDB) was that big data was dead, and you could use simpler query engines rather than a data warehouse. It’s far more nuanced than that, but data shows that most queries are run over smaller datasets.

Why?

On the one hand, many data sets are inherently small, corresponding to things like people, products, marketing campaigns, sales funnel, win/loss rates, etc. On the other hand, there are inherently large data sets (such as clickstreams, logistics events, IoT, sensor data, etc) that are increasingly being processed incrementally.

Why the trend towards incremental processing?

Incremental processing has a number of advantages:

It can be cheaper than recomputing the entire derived dataset again (especially if the source data is very big).
Smaller precomputed datasets can be queried more often without huge costs.
It can lower the time to insight. Rather than a batch job running on a schedule that balances cost vs timeliness, an incremental job keeps the derived dataset up-to-date so that it’s only minutes or low-hours behind the real world.
More and more software systems act on the output of analytics jobs. When the output was a report, once a day was enough. When the output feeds into other systems that take actions based on the data, these arbitrary delays caused by periodic batch jobs make less sense.

Going incremental, while cheaper in many cases, doesn’t mean we’ll use less compute though. The Jevons paradox is an economic concept that occurs where technological advancements leading to increased efficiency in the use of a resource lead to a paradoxical increase in the overall consumption of that resource, rather than a decrease. Greater resource efficiency leads people to believe that we won’t use as much of the resource, but the reality is that this often causes more consumption of the resource due to greater demand.

Using this intuition of the Jevons Paradox, we can expect this trend of incremental computation to lead to more computing resources being used in analytics, rather than less.

We can now:

Run dashboards with lower refresh rates.
Generate reports sooner.
Utilize analytical data in more user-facing applications.
Utilize analytical data to drive actions in other software systems.

As we make incremental, lower latency analytics workloads more cost-efficient, the demand for those workloads will undoubtedly increase (by finding new use cases that weren’t economically viable before). The rise of GenAI is another driver of demand (though definitely not making analytics cheaper!).

Many data systems and data platforms already support incremental computation:

Real-time OLAP:

ClickHouse/Apache Pinot/Apache Druid all provide incremental precomputed tables.

Cloud DWH/lake house

Snowflake materialized views.
Databricks DLT.
DBT incremental jobs.
Apache Spark jobs.
Incremental ingestion jobs.

Stream processing

Apache Flink.
Spark Structured Streaming
Materialize (a streaming database that maintains materialized views over streams).

I’ve been writing about the table formats regularly this year, and one of the topics that most interested me was the subject of change queries and incremental reads. While there is varied support across the table formats, they will converge eventually. It is clear how important incremental computation is, and it will only grow over time.

While the technology for incremental computation is already largely here, many organizations aren’t actually ready for a switch to incremental from periodic batch.

The collision course

Modern data engineering is emancipating ourselves from an uncontrolled flow of upstream changes that hinders our ability to deliver quality data. – Julien Le Dem

The collision:

Bad things happen when uncontrolled changes collide with incremental jobs that feed their output back into other software systems or pollute other derived data sets. Reacting to changes is a losing strategy. — Jack Vanlightly

Many, if not most, organizations are not equipped to realize this future where analytics data drives actions in other software systems and is exposed to users in user-facing applications. A world of incremental jobs raises the stakes on reliability, correctness, uptime (freshness), and general trustworthiness of data pipelines. The problem is that data pipelines are not reliable enough nor cost-effective enough (in terms of human resource costs) to meet this incremental computation trend.

We need to rethink the traditional data warehouse architecture where raw data is ingested from across an organization and landed in a set of staging tables to be cleaned up serially and made ready for analysis. As we well know, that leads to constant break-fix work as data sources regularly change, breaking the data pipelines that turn the raw data into valuable insights. That may have been tolerable when analytics was about strategic decision support (like BI), where the difference of a few hours or a day might not be a disaster. But in an age where analytics is becoming relevant in operational systems and powering more and more real-time or low-minute workloads, it is clearly not a robust or effective approach.

The ingest-raw-data->stage->clean->transform approach has a huge amount of inertia and a lot of tooling, but it is becoming less and less suitable as time passes. For analytics to be effective in a world of lower latency incremental processing and more operational use cases, it has to change. The barrier to improving data pipeline reliability and enabling more business-critical workloads mostly relates to how we organize teams and the data architectures we design. The technical aspects of the problem are well-known, and long-established engineering principles exist to tackle them.

In part 2, I’ll describe how we can apply long-standing software engineering principles to the data analytics space to make the incremental computation trend more successful and more relevant to the enterprise as a whole.