Forget the table format war; it’s open vs closed that counts

Apache Iceberg is a hot topic right now, and looks to be the future standard for representing tables in object storage. Hive tables are overdue for a replacement. People talk about table format wars: Apache Iceberg vs Delta Lake vs Apache Hudi and so on, but the “war” at the forefront of my mind isn’t which table format will become dominant, but the battle between open vs closed – open table formats vs walled gardens.

The two walled gardens I am predominantly thinking about are the two most famous data platforms in the data analytics world: Snowflake and Databricks, though we can also include their equivalents in the hyperscalers.

These platforms have traditionally been closed platforms:

Snowflake gave you an SQL interface. Other tools couldn’t query your data directly, all access was through the SQL layer.
Databricks, though it coined the lakehouse term, was also closed to the outside world; only its compute could query your data.

Databricks coined the term lakehouse. If we go back a while, we can see that “lakehouse” meant a data warehouse experience built on top of object storage using an open table format.

“A lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses. Lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse directly on top of low cost cloud storage in open formats.” – What is a lakehouse? (Databricks blog in 2020)

The technology underpinning the cloud data warehouse (e.g. Snowflake) and the lakehouse are not so different from one another. Both leverage a compute layer to query columnar data files in object storage. The compute layer differs somewhat (there are so many query engines these days), but the storage is not so different. A critical difference between Snowflake and Databricks at that time was that Databricks gave their storage layer a name and open-sourced it as Delta Lake. Databricks made the separation of compute and storage explicit in their external messaging to the world. The message was that they were built on open-source data storage.

But that didn’t make the platform open in the sense of it being accessible. You couldn’t use your own query engine, only Databricks compute. It made sense, they didn’t make money on storage, they made money on compute. This is changing now with their serverless offerings, but in the beginning, they were the original BYOC vendor, and only Amazon made money from the data stored on S3.

Snowflake never lifted the covers on its own storage layer, except to write about the high-level design of the platform in an industry paper. Imagine if Snowflake had named their storage sub-system and open-sourced it? How different would Snowflake and Databricks really have been? Both competing against each other, each platform built on an open-source storage layer, with only their own compute layer able to query it.

What is a lakehouse then?

Ok, so is a lakehouse really just about a data warehouse experience using open table formats as the storage layer? It has to be more than that.

For my part, I had internalized the lakehouse as being an open system of tables. Open not just because the storage layer is open-source, but because it is literally open to all and any query engines! I wrote this point in my Tableflow blog post earlier this year:

“With storage and compute disaggregated, each organization gets to choose the query engines/BI tools that suit their needs, putting more power into the hands of the organization to choose where they place their data and how they access it.” – Tableflow: the stream/table, Kafka/Iceberg duality

This is the promise of the open table formats after all – a common standard for interoperability between compute engines. What is the benefit of a standardized open-source storage layer if no one but the vendor can access it?

Ideally, a lakehouse needs both an open-source storage layer (such as Iceberg/Delta) and open access. It is about a shared storage layer that democratizes access, giving the organization a degree of freedom they didn’t have before.

Do you see where I’m going now? Are we being distracted by the table format war when the real question is how “open” do we want our data to be?

Open vs closed

Since the initial offerings, Databricks added external tables (not managed by Databricks) and Delta Sharing (a read-only interface to managed Delta Lake tables for external compute engines). This has improved the openness of the platform, but I still can’t read and write to Delta tables using an external compute engine.

Snowflake is adopting Apache Iceberg in a big way and opening up. They have open-sourced the Apache Polaris catalog and offer it as a managed service. Not only that, but Snowflake is creating connectors for external compute engines such as Trino and Spark to be able to query Snowflake-managed tables.

Like Databricks, Snowflake’s managed Iceberg tables are read-only for external compute engines (despite Iceberg supporting concurrent writers of different engines). But it’s a start.

Is Iceberg the driver for this opening up, or is it the epic battle between the two platforms? My feeling is that it's the latter, and Iceberg arrived at the right time for Snowflake to make a move into the lakehouse arena. But Iceberg is gaining momentum by itself, and Iceberg demand will begin to dictate the product strategy of these platforms going forward.

As the table formats mature and become more stable with more features, organizations will demand more openness from these data platforms. How far this openness goes will likely be dictated by competing forces of the market (competitor offerings and customer expectations) vs the vendor's own product strategy and the value it provides the customer.

There will likely be many data lakehouse vendors offering a fully open platform where they manage the underlying tables for you, and provide some form of governance. These platforms will add pressure onto the semi-closed platforms to open up further – openness will be a differentiator for some.

The future

I don’t expect either Snowflake or Databricks to ever open up fully. I expect they will always guide customers onto their closed platform pieces, but at least customers are being given a choice. There are, after all, some operational benefits to a closed system; with greater control comes some amount of improved reliability and simplicity. Snowflake is a great example; it’s very much like the Apple iPhone of data platforms; closed but simple and polished. People love their iPhones.

In a future world where the lakehouse is a commodity, the lakehouse itself will become a feature (not a product) of a larger data platform. The value that providers such as Snowflake and Databricks bring is not limited to tables, so I believe the semi-closed data platform is here to stay. It’s a nice ideal to think that Iceberg et al will result in totally democratized access to data, but the big data platforms guard their data gravity well — the open-table formats with open access threaten to steal some of that gravity away. The tension will be on how much value the data platforms can provide to compensate for being less open than fully open lakehouse vendors.

What do you think? Will open table formats bring a new era of data freedom, or will we fall into a balance somewhere between open and closed?