ELT is a bridge between silos. A world without silos is a graph.
I’ve been banging my drum recently about the ills of Conway’s Law and the need for low-coupling data architectures. In my Curse of Conway and the Data Space blog post, I explored how Conway’s Law manifests in the disconnect between software development and data analytics teams. It is a structural issue stemming from siloed organizational designs, and it not only causes inefficiencies and poor collaboration but ultimately hinders business agility and effectiveness.
I generally advocate for breaking down silos between software development and data analytics teams by improving collaboration, aligning team incentives, and adopting engineering practices like data contracts and data products. In a nutshell, "Shift Left" applied to data analytics. I also wrote about how the rise of incremental processing (across all data platforms) only makes it more important to do this.
With all that in mind, let’s now look at ELT (and its cousins ETL and Reverse ETL) where the letters stand for “Extract”, “Load”, “Transform”.
The ELT mindset
In a world of silos, with software in one silo and data analytics in another, ELT makes a lot of sense.
We see Conway’s Law played out right there in that diagram. The E in ELT is also a symptom of the disconnect between software and data teams.
I looked at some dictionary definitions of Extract:
To pull or take out forcibly
To obtain by much effort from someone unwilling
If you extract something from a place, you take it out or pull it out.
This matches with the practice of ELT/ETL quite nicely. The Extract phase of an ELT/ETL pipeline pulls data out from source (often without the knowing participation of the data owner).
The Extract phase is the result of software teams not knowing or caring what data teams are doing. Without alignment, data teams get database credentials and simply extract the raw data. Once it’s extracted, engineering principles start being applied, but by now, it’s too late.
It’s Conway’s Law in action.
The tools, the practices, all built around the premise that software and data teams don’t work closely together. That software teams won’t provide “data APIs” to data teams, even though they provide software APIs to fellow software teams.
But even that aside, if we look at the world today, it is much more complex than what ETL was originally designed for. It’s not just moving data from many relational databases to one data warehouse. Data doesn’t just live in the operational and the analytics estates; we now have SaaS representing a third data estate. Data flows across regions and clouds, from backend systems to SaaS and vice versa. There are probably 100x more applications now than there used to be. Organizations are becoming software, with ever more complex webs of relationships between software systems. ETL, ELT, and Reverse ETL are looking at this problem from a silo mindset, but modern data architectures need to think in graphs.
The Graph mindset
Now, let’s imagine that data teams were aligned with software teams, and everyone was invested in building low-coupling software and data architectures.
First of all, without the silos, I think we would stop seeing the world as two silos connected by ELT jobs where data is extracted then landed, and instead see the world as a graph of derived data sets where each node in the graph “consumes” from the nodes it is connected to.
By creating data products, we move from extraction to consumption.
I looked up some definitions of “consume”:
to utilize as a customer (consume goods and services)
to use fuel, energy, time, or a product,
We also use the term to describe the usage of APIs and queues/topics. Software that uses an API is an API consumer. Software that reads from a queue, is a queue consumer. Note that APIs, queues and Kafka topics are all sharing primitives. They abstract the source data, providing a public version of private data.
One person's consumer is another’s producer, as applications tend to consume data, but also produce new data sets that others consume. This naturally forms a network or graph of data producers and consumers. The data of each node in this graph is inherently write-once, consume many times, as published datasets are reusable by many consumers.
The idea of data analytics as a silo is further weakened when we consider that the line between analytics and applications is blurring. Applications now have analytics embedded (analytical queries with operational SLAs). AI and machine learning are largely about automating decision-making and pushing this decision-making down to a more operational level. More and more, analytics as a silo seems contrived only by Conway’s Law and inertia.
In this new mindset of consuming data products, with the silos gone or weakened, we can embrace graph thinking, or the graph mindset.
The graph mindset is about:
More consumption.
Less extraction.
More data modeling over high-quality data sets.
Less landing raw data, less data cleaning, and less reacting to breaking changes.
More incremental processing, using incremental or streaming data sources.
Less batch processes that require a validation phase before further processing can be performed (but by no means an end to batch).
More utilization of analytics in operational use cases.
Less analytics as only a tool for strategic decision support (BI).
More end-to-end data quality.
Less fixing broken pipelines after a data quality issue.
…and finally, more team collaboration and alignment, less working at cross-purposes or in a bubble.
Conclusions
Once you’ve seen that Conway’s Law exists, you see how it has manifested itself in myriad ways in the data analytics space. From team culture, to tooling, to practices, and the pains. The ELT mindset is plainly a repercussion of this disconnect between software and data teams.
Do we need to throw away all our ETL/ELT tools? I don’t think so, but we need to dismantle our current thinking! Let’s stop talking about ETL and ELT and start talking about consuming data sources and building reliable, derived data sets through good data modeling and data platform tooling. Let’s stop extracting data and switch to a consumption model where data teams spend time on data modeling, rather than firefighting. Data practitioners just get busy building their portion of a data graph that spans the entire organization.
Reality check – we’re still in the aspirational phase of Shift Left. There is a long road ahead. There are still a lot of legacy infrastructure and integrations based on FTP! Most data teams are still isolated! It’s a messy world out there. But the organizations that embrace this graph mindset the most will be the organizations that get the most out of their data, the most out of AI, the most ROI, and will see the best business outcomes.
I’ll leave you with this wise quote:
“Consume”, there is no “Extract” – Data Yoda