The curse of Conway and the data space

This post was triggered by and riffs on the “Beware of silo specialisation” section of Bernd Wessely’s post Data Architecture: Lessons Learned. It brings together a few trends I am seeing plus my own opinions after twenty years experience working on both sides of the software / data team divide.

Conway’s Law:

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.
— Melvin Conway

This is playing out worldwide across hundreds of thousands of organizations, and it is no more evident than in the split between software development and data analytics teams. These two groups usually have a different reporting structure, right up to, or immediately below, the executive team.

This is a problem now and is only growing.

Jay Kreps remarked five years ago that organizations are becoming software.

It isn’t just that businesses use more software, but that, increasingly, a business is defined in software. That is, the core processes a business executes—from how it produces a product, to how it interacts with customers, to how it delivers services—are increasingly specified, monitored, and executed in software.
— Jay Kreps

The effectiveness of this software is directly tied to the organization's success. If the software is dysfunctional, the organization is dysfunctional. The same can play out in reverse, as organizational structure dysfunction plays out in the software. All this means that a company that wants to win in its category can end up executing poorly compared to its competitors and being too slow to respond to market conditions. This kind of thing has been said umpteen times, but it is a fundamental truth.

When “software engineering” teams and the “data” teams operate in their own bubbles within their own reporting structures, a kind of tragic comedy ensues where the biggest loser is the business as a whole.

The winds of change are blowing

There are more and more signs that point to a change in attitudes to the current status quo of “us and them”, of software and data teams working at cross purposes or completely oblivious to each other’s needs, incentives, and contributions to the business's success. There are three key trends that have emerged over the last two years in the data analytics space that have the potential to make real improvements. Each is still quite nascent but gaining momentum:

  • Data engineering is a discipline of software engineering.

  • Data contracts and data products.

  • Shift Left.

After reading this article, I think you’ll agree that all three are tightly interwoven.

Data engineering is a discipline of software engineering

Data engineering has evolved as a separate discipline from that of software engineering for numerous reasons:

  • Data analytics / BI, where data engineering is practiced, has historically been a separate business function from software development. This has caused a cultural divergence where the two sides don’t listen to or learn from each other.

  • Data engineering solves a different set of problems from traditional software development and thus has different tools.

  • Data engineering has changed dramatically over the last 25 years. Many new problems arose that required rethinking the technologies from the ground up, which resulted in a long, chaotic period of experimentation and innovation.

The dust has largely settled, though technologies are still evolving. We’ve had time to consolidate and take stock of where we are. The data community is starting to realize that many of the current problems are not actually so different from the problems of the software development side. Data teams are writing software and interacting with software systems just as software engineers do.

The types of software can look different, but many of the practices from software engineering apply to data and analytics engineering as well:

  • Testing.

  • Good stable APIs.

  • Observability/monitoring.

  • Modularity and reuse.

  • Fixing bugs late in the development process is more costly than addressing them early on.

It’s time for data and analytics engineers to identify as software engineers and regularly apply the practices of the wider software engineering discipline to their own sub-discipline. 

Data Contracts and Data Products

Data contracts exploded onto the data scene in 2022/2023 as a response to the frustration of the constant break-fix work of broken pipelines and underperforming data teams. It went viral and everyone was talking about data contracts, though the concrete details of how one would implement them were scarce. But the objective was clear: fix the broken pipelines problem.

Broken pipelines for many reasons:

  • Software engineers had no idea what data engineers were building on top of their application databases and therefore provided no guarantees around table schema changes nor even warned of impending changes that would break the pipelines (usually because they had no idea).

  • Data engineers had been largely unable (due to organizational dysfunction or organizational isolation) to develop healthy peer relationships with the software teams they depend on. Or if relationships could be built, there wasn’t buy-in from software team leaders to help data teams get the data they needed beyond giving them database credentials. The result was to just reach in and grab the data at the source, breaking the age-old software engineering practice of encapsulation in the process (and suffering the results).

I recently listened to Super Data Science E825 with Chad Sanderson, a big proponent of data contracts. I loved how he defined the term:

My definition of data quality is a bit different from other people’s. In the software world, people think about quality as, it’s very deterministic. So I am writing a feature, I am building an application, I have a set of requirements for that application and if the software no longer meets those requirements that is known as a bug, it’s a quality issue. But in the data space you might have a producer of data that is emitting data or collecting data in some way, that makes a change which is totally sensible for their use case. As an example, maybe I have a column called timestamp that is being recorded in local time, but I decide to change that to UTC format. Totally fine, makes complete sense, probably exactly what you should do. But if there’s someone downstream of me that’s expecting local time, they’re going to experience a data quality issue. So my perspective is that data quality is actually a result of mismanaged expectations between the data producers and data consumers, and that is the function of the data contract. It’s to help these two sides actually collaborate better with each other.
— Chad Sanderson

What constitutes a data contract is still somewhat open to interpretation and implementation regarding actual concrete technology and patterns. Schema management is a central theme, though only one part of the solution. A data contract is not only about specifying the shape of the data (its schema); it’s also about trust and dependability, and we can look to the REST API community to understand this point:

  • REST APIs are regularly documented via OpenAPI, a REST API specification tool. This is essentially the schema of the request and the response, as well as the security schemes.

  • REST APIs are versioned, and great care is taken to version them without making breaking changes. When breaking changes do occur, the API releases a new major version. The topic of API versioning is deep, with a long history of debate about which options are best. But the point is that the software engineering community has thought long and hard about how to evolve APIs.

  • A REST API that is constantly changing and releasing new major versions due to breaking changes is a poor API. Organizations that publish APIs for their customers must ensure that not only do they create a well-modeled and specified API, but a stable one that does not change too frequently.

In software engineering, when Service A needs the data of Service B, what Service A absolutely doesn’t do is just access the private database of Service B. What happens is the following:

  1. The engineering leaders/teams of the two services open a line of communication, likely a physical conversation to begin with.

  2. The team of Service A arranges for a well-designed interface for Service B that doesn’t break the encapsulation of Service A. This may result in a REST API, or perhaps an event stream or queue that Service B can consume.

  3. The team of Service A commits to maintaining this API/stream/queue going forward. This involves the discipline of evolving it over time, providing a stable and predictable interface for Service B to use. Some of this maintenance can fall on a platform team whose responsibility is to provide building block infrastructure for development teams to use.

Why does the team of Service A do this for the team of Service B? Is it out of altruism? No. They collaborate because it is valuable for the business for them to do so. A well-run organization is run with the mantra of #OneTeam, and the organization does what is necessary to operate efficiently and effectively. That means that team Service A sometimes has to do work for the benefit of another team. It happens because of alignment of incentives going up the management chain.

It is also well known in software engineering that fixing bugs late in the development cycle, or worse, in production, is significantly more expensive than addressing them early on. It is disruptive to the software process to go back to previous work from a week or a month before, and bugs in production can lead to all manner of ills. A little upfront work on producing well-modeled, stable APIs makes life easier for everyone. There is a saying for this: an ounce of prevention is worth a pound of cure.

These APIs are contracts. They are established by opening communication between software teams and implemented when it is clear that the ROI makes it worth it. It really comes down to that. It generally works like this inside a software engineering department due to the aligned incentives of software leadership.

Data products

The term API (or Application Programming Interface) doesn’t quite fit “data”. Because the product is the data itself, rather than interface over some business logic, the term “data product” fits better. The word product also implies that there is some kind of quality attached, some level of professionalism and dependability. That is why data contracts are intimately related to data products, with data products being a materialization of the more abstract data contract.

Data products are very similar to the REST APIs on the software side. It comes down to the opening up of communication channels between teams, the rigorous specification of the shape of the data (including the time zone from Chad’s words earlier), careful evolution as inevitable changes occur, and the commitment of the data producers to maintain stable data APIs for the consumers. The difference is that a data product will typically be a table or a stream (the data itself), rather than an HTTP REST API, which typically drives some logic or retrieves a single entity per call.

Another key insight is that just as APIs make services reusable in a predictable way, data products make data processing work more reusable. In the software world, once the Orders API has been released, all downstream services that need to interact with the orders sub-system do so via that API. There aren’t a handful of single-use interfaces set up for each downstream use case. Yet that is exactly what we often see in data engineering, with single-use pipelines and multiple copies of the source data for different use cases.

Simply put, software engineering promotes reusability in software through modularity (be it actual software modules or APIs). Data products do the same for data.

Shift Left

Shift Left came out of the cybersecurity space. Security has also historically been another silo where software and security teams operate under different reporting structures, use different tools, have different incentives, and share little common vocabulary. The result has been a growing security crisis that we’ve become so used to now that the next multi-million record breach barely gets reported. We’re so used to it that we might not even consider it a crisis, but when you look at the trail of destruction left by ransomware gangs, information stealers, and extortionists, it’s hard to argue that this should be business as usual.

The idea of Shift Left is to shift the security focus left to where software is being developed, rather than being applied after the fact, by a separate team with little knowledge of the software being developed, modified, and deployed. Not only is it about integrating security earlier in the development process, it's also about improving the quality of cyber telemetry. The heterogeneity and general "messiness" of cyber telemetry drive this movement of shifting processing, clean up, and contextualization to the left where the data production is. Reasoning about this data becomes so challenging once provenance is lost. While cyber data is unusually challenging, the lessons learned in this space are generalizable to other domains, such as data analytics.

The similarity of the silos of cybersecurity and data analytics is striking. Silos assume that the silo function can operate as a discrete unit, separated from other business functions. However, both cybersecurity and data analytics are cross-functional and must interact with many different areas of a business. Cross-functional teams can’t operate to the side, behind the scenes, or after the fact. Silos don’t work, and shift-left is about toppling the silos and replacing them with something less centralized and more embedded in the process of software development.

The data analytics silo is so engrained that the current practices are not questioned. Bernd Wessely wrote a fantastic article on TowardsDataScience about the silo problem. It’s a member-only article (and well worth the Medium subscription), so I hope he doesn’t mind me quoting him here:

It made us believe that ingestion is the unavoidable first step of working with data, followed by transformation before the final step of data serving concludes the process. It almost seems like everyone accepted this pattern to represent what data engineering is all about.

The fact that we have to extract data from a source application and feed it into a data processing tool, a data or machine learning platform or business intelligence (BI) tools to do something meaningful with it, is actually only a workaround for inappropriate data management. A workaround necessary because of the completely inadequate way of dealing with data in the enterprise today.
— Bernd Wissely

The sad thing is that none of this is new. I’ve been reading articles about breaking silos all my career, and yet here we are in 2024, still talking about the need to break them! But break them we must!

If the data silo is the centralized monolith, separated from the rest of an organization’s software, then shifting left is about integrating the data infrastructure into where the software lives, is developed, and operated.

Service B didn’t just reach into the private internals of Service A; instead, an interface was created that allowed Service A to get data from Service B without violating encapsulation. This interface, an API, queue, or stream, became a stable method of data consumption that didn’t break every time Service A needed to change its hidden internals. The burden of providing that interface was placed on the team of Service A because it was the right solution, but there was also a business case to do so. The same applies with Shift Left; instead of placing the ownership of making data available on the person who wants to use the data, you place that ownership upstream to where the data is produced and maintained. 

At the center of this shift to the left is the data product. The data product, be it an event stream or an Iceberg table, is often best managed by the team that owns the underlying data. This way, we avoid the kludges, the rushed, jerry-rigged solutions that bypass good practices.

To make this a reality, we need the following:

  • Communication and alignment between the parties involved. It takes a level of business maturity to get there, but until we do, we’ll be talking about breaking the silos in ten or twenty years' time or until AI replaces us all.

  • Technological solutions to make it easier to produce, maintain, and support data products. 

We see a lot happening in this space, from catalogs, governance tooling, table formats such as Apache Iceberg, and a wealth of event streaming options. There is a lot of open source here but also a large number of vendors. The technologies and practices for building data products are still early in their evolution, but expect this space to develop rapidly.

Conclusions

You'd think that the majority of data platform engineering is solving tech problems at large scale. Unfortunately it's once again the people problem that's all-consuming. — Birdy

Organizations are becoming software, and software is organized according to the communication structure of the business; ergo, if we want to fix the software/data/security silo problem, then the solution is in the communication structure.

The most effective way to make data analytics more impactful in the enterprise is to fix the Conway’s Law problem. It has led to both a cultural and technological separation of data teams from the wider software engineering discipline, as well as weak communication structures and a lack of common understanding. 

The result has been:

  1. Poor cooperation and coordination between the two sides, leading to:

    1. Kludgey integrations between the operational plane (the software services) and the data analytics plane.

    2. Constant break-fix work in the analytics plane in response to changes made in the operational plane.

  2. The huge number of great practices that software engineers use to make software development less costly and more reliable is overlooked.

The barriers to achieving the vision of a more integrated software and data analytics world are the continued isolation of data teams and the misalignment of incentives that impede the cooperation between software and data teams. I believe that organizations that embrace #OneTeam, and get these two sides talking, collaborating, and perhaps even merging to some extent will see the greatest ROI. Some organizations may already have done so, but it is by no means widespread.

Things are changing; attitudes are changing. Data engineering is software engineering, data contracts/products, and the emergence of Shift Left are all leading indicators.


A final note. I’ve started a publication called Humans of the Data Sphere (HOTDS). It aims to bring together people and insights from across the data ecosystem: the transactional, operational side as well as the data analytics and AI side. My hope is that it can bring together both software engineers, database specialists, data/analytics, and AI/ML engineers together to understand what is happening across the whole data sphere. I think some common understanding goes a long way.