Incremental Jobs and Data Quality Are On a Collision Course - Part 2 - The Way Forward

In part 1, I discussed how big data isn’t dead per se; it’s just going incremental. I also described the collision course between lower-latency analytics and data quality.

Bad things happen when uncontrolled changes collide with incremental jobs that feed their output back into other software systems or pollute other derived data sets. Reacting to changes is a losing strategy. — Jack Vanlightly

My argument was that the current data warehouse architecture of ingest-raw-data->stage->clean->transform is unsuitable when business expectations of reliability and timeliness are higher. The current approach leaves data teams widely exposed to changes in source data systems. Teams have the constant drain of reacting to change because they have little to no control over the timing or shape of those changes. When an unexpected change occurs, it can either cause a pipeline to break, interrupting the incremental flow of derived datasets, or worse, it can fail silently and spread bad data across a chain of incremental jobs.

The fact that we have vendors selling data canaries that warn when bad data arrives is a symptom of this situation. It’s all about reacting to changes fast and recovering fast to minimize the damage – not just damage to the pipelines but also the damage to credibility. Trust takes a long time to build but can be lost in a single data quality/accuracy incident. We’re just bandaging the symptom, not solving the problem.

So what should we do instead?

This is less of a technology problem and more of a structural problem. We can’t just add some missing features to data tooling; it’s about solving a people problem, how we organize together, how team incentives line up, and also about applying well-established software engineering principles that are still to be realized in the data analytics space.

The thing we’re missing right now is that the very foundations that analytics is built on are not stable. The onus is on the data team to react quickly to changes in upstream applications and databases. This is clearly not going to work for analytics built on incremental jobs where expectations of timeliness are more easily compromised. Even for batch workloads, the constant break-fix work is a drain on resources and also leads to end users questioning the trustworthiness of reports and dashboards.

The current approach of reacting to changes in raw data has come about largely because of Conway’s Law: how the different reporting structures have isolated data teams from the operational estate of applications and services. Without incentives for software and data teams to cooperate, data teams have, for years and years, been breaking one of the cardinal rules for how software systems should communicate. Namely, they reach in to grab the private internal state of applications and services. In the world of software engineering, this is an anti-pattern of epic proportions!

It’s all about ”coupling”

I could make a software architect choke on his or her coffee if I told them my service was directly reading the database of another service owned by a different team.

Why is this such an anti-pattern? Why should it result in spilled coffee and dumbfounded shock? It’s all about coupling. This is a fundamental property of software systems that all software engineering organizations take heed of.

When services depend on the private internal workings of other services, even small changes in one service's internal state can propagate unpredictably, leading to failures in distant systems and services. This is the principle of coupling, and we want low coupling. Low coupling means that we can change individual parts of a system without those changes propagating far and wide. The more coupling you have in a system, the more coordination and work are required to keep all parts of the system working. This is the situation data teams still find themselves in today.

For this reason, software services expose public interfaces (such as a REST API, gRPC, GraphQL, a schematized queue or a Kafka topic), carefully modeled, stable, and with careful evolution to avoid breaking changes. A system with many breaking changes has high coupling. In a high coupling world, every time I change my service, I force all dependent services to update as well. Now we either have to perform costly coordination between teams to update services (at the same time), or we get a nasty surprise in production.

That is why in software engineering, we use contracts, and we have versioning schemes such as SemVer to govern contract changes. In fact, we have multiple ways of evolving public interfaces without propagating those changes further than they need to. It’s why services depend on contracts and not private internal state.

Not only do teams build software that communicates via stable APIs, but the software teams collaborate to provide those APIs that the various teams require. This need for APIs and collaboration has only become larger over time. The average enterprise application or service used to be a bit of an island: it had its ten database tables and didn't really need much more. Increasingly, these applications are drawing on much richer sets of data and forming much more complex webs of dependencies. Given this web of dependencies between applications and services, (1) the number of consumers of each API has risen, and (2) the chance of some API change breaking a downstream service has also risen massively.

Stable, versioned APIs between collaborating teams are the key.

Data products (seriously)

This is where data products come in. Like or loathe the term, it’s important.

Rather than a data pipeline sucking out the private state of an application, it should consume a data product. Data products are very similar to the REST APIs on the software side. They aren’t totally the same, but they share many of the same concerns:

Schemas. The shape of the data, both in terms of structure (the fields and their types) and the legal values (not null, credit card numbers with 16 numbers, etc).
Careful evolution of schemas to prevent changes from propagating (we want low coupling). Avoiding breaking changes as much as humanly possible.
Uptime, which for data products becomes “data freshness”. Is the data arriving on time? Is it late? Perhaps an SLO or even an SLA determines the data freshness goals.

Concretely, data products are consumed as governed data-sharing primitives such as Kafka topics for streaming data and Iceberg/Hudi tables for tabular data. While the public interface may be a topic or a table, the logic/infra that produces the topic or table may be varied. We REALLY don’t want to just emit events that are mirrors of the private schema of the source database tables (due to the high coupling it causes). Just as REST APIs are not mirrors of the underlying database, the data product also requires some level of abstraction and internal transformation. Gunnar Morling wrote an excellent post on this topic, focused on CDC and how to avoid breaking encapsulation.

These data products should be capable of real-time or close to real-time because downstream consumers may also be real-time or incremental. As incremental computation spreads, it becomes a web of incremental vertices with edges between them: a graph of incremental computation that is spread across the operational and analytical estates. While the vertices and edges are different from the web of software services, the underlying principles for building reliable and robust systems are the same – low coupling architectures based on stable, evolvable contracts.

Because data flows across boundaries, data products should be based on open standards, just as software service contracts are built on HTTP and gRPC. They should come with tooling for schema evolution, access controls, encryption/data masking, data validation rules, etc. More than that, they should come with an expectation of stability and reliability – which comes about from mature engineering discipline and prioritizing these much-needed properties.

These data products are owned by the data producers rather than the data consumers (who have no power to govern application databases). It’s not possible for a data team to own the data product whose source is another team’s application or database and expect it to be both sustainable and reliable. Again, I could make a software architect choke on their coffee, suggesting that my software team should build and maintain a REST API (we desperately need) that serves the data of another team’s database.

Consumers don’t manage the APIs of source data; it’s the job of the data owner, aka the data producer. This is a hard truth for data analytics but one that is unquestioned in software engineering.

The challenge ahead

What I am describing is Shift Left applied to data analytics. The idea of shifting left is acknowledging that data analytics can’t be a silo where we dump raw data, clean it up, and transform it into something useful. It’s the way it has been done for so long with multi-hop architectures it’s really hard to consider something else. But look at how software engineers build a web of software services that consume each other's data (in real-time) – software teams are doing things very differently.

It’s great that the data ecosystem has been adopting more and more engineering principles, but this last one is perhaps the most important but also most difficult to take on. The most challenging aspect of Shift Left is that it changes roles and responsibilities that are now ingrained in the enterprise. This is just how things have been done for a long time. That’s why I think Shift Left will be a gradual trend as it has to overcome this huge inertia.

Organizations that shift responsibility for data to the left will build data analytics pipelines that source their data from reliable, stable sources. Rather than sucking in raw data from across the enterprise and dealing with change as it happens, we can now build lower latency, incremental analytics workloads that are robust in the face of changing applications and databases. The challenges of data modeling haven’t gone away, but the foundations that our data modeling is based on will now be far more stable.

The role of these data-driven systems has gone from reporting alone to now including running-the-business applications. Delaying the delivery of a report for a few hours was tolerable, but in operational systems, hours of downtime can mean huge amounts of lost revenue, so the importance of building reliable (low-coupling) systems has increased.

What is holding back analytics right now is that it isn’t reliable enough, it isn’t fast enough, and has the constant drain of reacting to change (with no control over the timing or shape of those changes). Ultimately, it’s about solving a people problem and applying sound engineering practices to create robust, low-coupling data architectures that can be fit for purpose for more business-critical workloads.

The trend of incremental computation is great, but it only raises the stakes.

Note: We need more to be written about the hands-on of building data products, which I plan to focus on in the coming months. In the meantime I can recommend:

My colleague Adam Bellemare has written a number of articles [1, 2, 3] and books [4] related to data products and engineering principles behind streaming.
Gunnar Morling has written two great blog posts [5, 6] on CDC, the outbox pattern and avoiding breaking encapsulation which are highly relevant to the topic of data products.