Towards composable data platforms

This is post contains a mix of technology insight with some Confluent strategy commentary.

Technology changes can be sudden (like generative AI) or slower juggernauts that kick off a slow chain reaction that takes years to play out. We could place object storage and its enablement of disaggregated architectures in that latter category. The open table formats, such as Apache Iceberg, Delta Lake, and Apache Hudi, form part of this chain reaction, but things aren’t stopping there.

I’ve written extensively about the open table formats (OTFs). In my original Tableflow post, I wrote that shared tables were one of the major trends, enabled by the OTFs. But why is it that OTFs make for a good sharing primitive? I have been focused mainly on the separation of compute and storage. That OTFs allow for a headless architecture where different platforms can bring their own compute to the same data. This is all true.

But we can also view OTFs as enabling a kind of virtualization. In this post, I will start by explaining my take on OTFs and virtualization. Finally, I’ll bring it back to Confluent, the Confluent/Databricks partnership, and the future of composable data platforms.

Table Virtualization

Virtualization in software refers to the creation of an abstraction layer that separates logical resources from their physical implementation. The abstraction may allow one physical resource to appear as multiple logical resources, or multiple physical resources to appear as a single logical resource.

At every layer of computing infrastructure - from storage arrays that pool physical disks into flexible logical volumes, to network overlays that create programmable topologies independent of physical switches, to device virtualization that allows hardware components to be shared and standardized - virtualization provides a powerful separation of concerns. This abstraction layer lets resources be dynamically allocated, shared, and managed with greater efficiency. There are many types of hardware and software virtualization, and even data can be virtualized.

The term data virtualization has been around for at least two decades, mainly in the data integration space. In the data analytics world, data virtualization has been a compute abstraction over distributed data sources. Presto was the first open-source analytics project I am aware of to use the term. That type of data virtualization had limited impact and adoption because it was too high up the stack. You had to use Presto/Trino, which is just one hammer (and often a good one!) in a large toolbox.

The OTFs introduce a new abstraction layer that can be used to virtualize table storage. The key is that it allows for the separation of data from metadata and shared storage from compute. Through metadata, one table can appear in two data platforms, without data copying. To avoid overloading data virtualization anymore, I will use the term Table Virtualization.

*Fig 1. Table virtualization, enabled by shared storage and open table formats*

Table virtualization manifests in several key ways:

Single Source of Truth. An OTF is a protocol, an agreed-upon method of organizing data into data files, metadata files, index files, and directories. This is the single source of truth. This is the physical data that exists only in one location. All logical representations interact with this single source. Changes made through any logical representation are reflected in all others.
Neutral ground: By leveraging Cloud Service Provider (CSP) object storage, the CSP becomes the neutral ground where different vendors can interoperate without concerning themselves with responsibility for complex storage management.
Metadata-driven collaboration: Catalogs form the top layer; they are pure metadata. Tables can be federated across multiple catalogs, exposed as the platform sees fit.
Ownership separated from presentation: One platform may ultimately control the cloud account where the data of a given table is stored, or it could even be a customer’s account where the data is stored. One platform may act as the primary catalog for a table. This can be abstracted from users, who only see tables.
Standardization: Collaboration occurs because of standardization. Both parties embrace a common open standard. Multiple compute/query engines can speak the same language.

While attempts to virtualize data within compute engines never gained widespread adoption, data virtualization has found its natural home at the storage layer, where it can serve the diverse ecosystem of query engines and data platforms.

Thinking about OTFs in terms of table virtualization can be helpful in the context of collaboration between data platforms. Being able to surface the same table in two platforms, as if they were native in both platforms is powerful. With that capability, there is no need to move tabular data between platforms, only expose key tables that can feed the other platform.

Tables are not typically data-sharing primitives. Tables are used as a long-term store of data, used for a specific purpose by a single system. Table storage platforms face the opposite dynamic to streaming platforms - they accumulate these valuable data assets, which are more valuable in combination with other data assets, creating strong gravitational pulls. This data gravity easily leads to resistance to sharing. It’s easy to raise the walls and focus on getting data in, and not necessarily making it easy to share.

But customers often don’t want one data platform. A large portion of Snowflake and Databricks customers actually run workloads on both platforms. At the same time, customers also don’t want a swarm of platforms. The Modern Data Stack (MDS) failed to sustain itself. Few wanted to compose a data architecture from 10-15 different vendors. People want to choose a small number of trusted vendors, and they want them to work together without a lot of toil and headaches.

Table virtualization provides us with a bridge between the natural tension of data gravity and the need for ecosystem collaboration. Table virtualization can elegantly reduce this conflict by allowing the physical data to remain within a gravity well while enabling multiple logical access points (as virtualized tables) in other platforms. This approach works within the constraint that many platforms want to maintain control over their valuable data assets, while still enabling the collaborative ecosystem that modern data architectures require. In effect, table virtualization transforms "gravitational" data assets into a composable resource, where historically, data sharing might otherwise have been hard or impractical.

Data gravity is real, and pretending it doesn’t exist is like trying to swim against the current. We need data gravity platforms to interoperate, and table virtualization looks like the ideal solution.

Ownership and access

One table can be exposed in two data platforms, but typically, one platform owns the data. The owner platform considers this a native table, and the other platform may choose to consider it an “external” table. It doesn’t have to be this way, but this may be the common case.

This brings us to the question of ownership and access:

Can both platforms write to the same table?
Who maintains the catalog? Iceberg requires a catalog for consistent reads and writes. We can’t have two primary catalogs, or else the table will become inconsistent.
How do vendors handle support issues for shared tables?
Sharing is two-way.

Out -> An owner platform exposes table metadata for other platforms.
In -> A platform allows users to add “external” tables from other platforms.

Databricks allows for “out” sharing via its Delta Sharing protocol. Snowflake allows for “in” sharing via external Iceberg tables.

*Fig 2. It takes two platforms to tango.*

We are seeing a common pattern regarding ownership and access among the big players. BigQuery and Snowflake, so far, operate with the rules that external tables are read-only, and native tables are exposed to other platforms as read-only. In other words, if you want to write to a Snowflake table, you do it directly via Snowflake APIs, and Snowflake won’t write to another platform’s table. Not all vendors need to choose this approach, but it is understandable why BigQuery and Snowflake would operate this way. For one, letting other platforms write to your own OTF tables, which you have a support team behind, could be risky and costly. Likewise, writing to an external table, without any control over the running of that table could also be risky. Does the other platform even expose a catalog for external writers? Platforms that operate at scale want stability and control.

Also, don’t forget data gravity. Supporting outward and inward writes dilutes data gravity, which is a real concern to tabular data platforms.

This read-only limitation is not actually so bad. We don’t usually need two platforms to write to the same table. Instead, we can allow different platforms to maintain derived data sets based on the shared data of each platform. The virtualized table is a sharing primitive for one platform to share its insights with another platform.

This way, we can have bidirectional flows, or even graphs, of data enrichment and derived data sets between platforms.

*Fig 3. Native and external tables, still a powerful model for collaboration.*

Plugging in the operational streaming world

So far, I’m describing how analytics platforms can be composed to make it easier for customers on multiple platforms (such as Snowflake and BigQuery, or Snowflake and Databricks) to compose a data architecture. But this is only encompassing the analytical estate. What about the operational estate, where the data-sharing primitive is the event stream? How can we compose the data assets of the operational and analytical estates?

This is what Tableflow has been developed for–the bidirectional materialization of streams as (virtual) tables and (virtual) tables as streams.

*Fig 4. Stream/table materialization working in combination with table virtualization.*

This stream-to-table materialization uses table virtualization as the sharing mechanism between the streaming and analytics platforms. It gives us some elementary building blocks for better data composability across an entire organization.

Taking another step closer to the graph mindset

Recently, I advocated for shifting our mindset away from seeing the world as two silos connected by ELT jobs where data is extracted and then landed; instead seeing the world as a graph of derived data sets.

*Fig 5. From the the ELT mindset (extract from left side to deposit on right side), towards the graph mindset (a graph of derived data sets).*

In this model, we primarily focus on how to compose data. It becomes less about how we extract data and move data and more about how we want to compose different types of data, across different platforms and different forms (such as streams and tables).

This is an attractive idea. If we can pull it off, we can open up data like never before. There are a few barriers to achieving this goal, some rooted in Conway’s law, and some technological. What I am describing in this post, are the technological means to achieving this vision.

Bringing it back to Confluent strategy

As I warned at the beginning of the post, I do want to talk about Confluent.

Last week the CEOs of Confluent and Databricks announced their strategic partnership, centered around building a bidirectional data flow between both platforms. First, Confluent topics will appear directly as Delta tables in Databricks, using a tight integration between Confluent Tableflow and Databricks Unity catalog. Later, Delta tables will be exposed directly in Confluent as Kafka topics, likely as change streams of mutable tables.

Confluent and Databricks will provide a unified product experience while remaining separate vendors, each focusing on their core strengths. It’s actually quite elegant and there are few examples where two platforms were so made for collaboration in such a cohesive and coherent manner.

What makes this possible is precisely what this blog post has been explaining:

Stream-to-table materialization
Table virtualization

*Fig 6. Planned bidirectional flow between Confluent and Databricks. Kafka topics exposed as Delta tables and vice versa.*

With Confluent’s Tableflow and Databricks Unity catalog, the analytics plane just gets tabular data (as Delta tables) from the operational plane appearing in the Unity catalog. The operational plane will get Kafka topics of analytics-derived data appearing in the Stream Catalog.

Confluent and Databricks are two great examples of platforms that are complimentary and should be composable. This partnership enables composability, with each platform focusing on its core strengths and values, benefiting joint customers who just want them to work together coherently with as little BS work as possible to make it happen.

Stream-to-table materialization and table virtualization represent a fundamental shift in how we think about data integration and interoperability. Streams and virtualized tables are the two data-sharing primitives that make composable data architectures possible. Streams do what they've always done best - connecting systems of record with real-time events. But virtualized tables are the missing piece, turning tabular data that wants to stay put in gravity wells into a composable resource that can be shared between platforms. Together, they give us the building blocks to start fitting data platforms together like Lego bricks, rather than being glued together with ETL.