Table format interoperability, future or fantasy?

In the world of open table formats (Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon, etc), an emerging trend is to provide interoperability between table formats by cross-publishing metadata. It allows a table to be written in table format X but read in format Y or Z.

Cross-publishing is the idea of a table having:

  • A primary table format that you write to.

  • Equivalent metadata files of one or more secondary formats that allow the table to be read as if it were of that secondary format. 

This allows you to write the Parquet files only once (the most costly bit) and reuse them for the secondary formats. Writing the metadata once per supported table format is comparatively cheap and is typically published asynchronously. This makes one table appear as any table format you want. Write in one format, read as another - table-format interoperability. Apache XTable (incubating) and Delta Lake UniForm are two projects that helps you perform this cross-publishing.

But cross-publishing across table formats doesn't come without cost. You choose one primary format (eg. Iceberg) and write the data once, then choose secondary formats (eg: Delta or Hudi) to write metadata. But here's the kicker: How many awesome features do you lose in the process?

  • Iceberg has hidden partitioning, Delta and Hudi don't. If Iceberg is the primary, but you need to support Delta, oops, sorry, you can’t use hidden partitioning.

  • Delta has liquid partitioning, can Iceberg as a secondary benefit from that, or be compatible with it?

  • Merge-on-read. No support for this yet because, oops, everyone did it a little differently. We’d have cross-publish the small data files and the delete/DV files too. Oops, now compaction needs to run for secondary formats - this is getting expensive.

  • Iceberg has equality deletes, oops, none of the others do.

  • Hudi and Delta materialize CDC files but with different formats.

  • Paimon has partial updates and also uses LSM trees… Not possible to just cross-publish.

  • …and so on

Don’t we end up with a least-common-denominator world? Or a poor imitation for the unlucky secondary format? Or worse, having to rewrite files and run redundant housekeeping jobs at the data layer. It all seems complex and costly.

Cross-publishing seems to promise seamless interoperability between table formats, but the truth is that the features of the table formats vary, and you can't support them all while writing the data once.

What are the alternatives? Here are some thoughts.

Interoperability via multi-format support in the compute layer

If all the compute engines support all table formats, why bother cross-publishing at all? You have an Iceberg table or a Hudi table, so what? If BigQuery speaks Iceberg, Delta, and Hudi, do we really need to cross-publish? A table is written as Iceberg, then consumed by a compute/query engine as Iceberg, perhaps even joined with a Delta table. Let the compute engine utilize all the features of each table format, instead of a least common denominator or a second-class imitation of one if it's the secondary table format.

Consolidation - the market decides

The second alternative will be dictated by the market. In the end, we will likely see some consolidation, and the table formats will be like file formats: predominantly Parquet (for now), with some ORC on the side. Continued fragmentation with interoperability seems like a poorer way forward. Not to mention the cost of these fat libraries that must be reimplemented in each language. With a consolidated table format ecosystem, the fierce competition is left to the compute/query engines. The community as a whole can then evolve the shared table storage standards forward, ensuring interoperability between data platforms, rather than between table formats.

I don’t have a favorite, honestly. I’ve spent the last four months getting to know each format, and each has some really nice aspects and features. However, I also don’t think we need four or more formats. The capabilities will converge, leaving perhaps two formats that take 99% of the market.

I don’t want to pick the winners, but here’s what I think will influence the outcome:

  • Technology: How each format evolves. In this fierce table format battle, having the right features is important. But the tech alone isn’t everything.

  • Compute-engine support: Some of the differences in (claimed) performance come down to how well each compute engine can leverage the features of each table format. There is still so much low-hanging fruit left to pick.

  • Openness: The more open the format, the better. Table formats are about building shared table storage, avoiding vendor lock-in. This whole thing about the headless data architecture where you bring your own compute over shared storage.

  • Business: How Snowflake, Databricks, Confluent, Google Cloud, AWS and Azure put their weight behind the formats will count.

  • Grassroots: Never discount the impact of a few famous use cases inside companies like Netflix, Uber, Apple, Airbnb, and so on.

The future I see is one where data platforms compete, and data interoperability comes not from table format interoperability but from using a common utility standard for representing tables. In networking, we’ve got TCP and UDP as the main standards. In data formats, right now at least, we have Parquet and ORC. Companies are not competing at these levels, nor are they trying to be interoperable across these standards. 

Alignment at the data layer

The third alternative is to align the table formats at the data layer so that cross-publishing can utilize the vast majority of features, support merge-on-read without rewriting delete/DV files, and so on. If cross-publishing table formats ever really works well, it will be because the remaining table formats will have standardized some things, like partitioning, clustering, delete files and so on. There is also the potential for common standards for things like secondary indexing. This is similar to the standardized protocols that sit above TCP and UDP like DNS or BGP, supporting interoperability and core workflows, but currently there is no standardization mechanism like RFC’s for the open table format’s data layer.

But if all that did happen, why have a bunch of competing formats at all? 

Final thoughts

We have all these competing formats; some customers use Iceberg, some use Delta, and some use Hudi, but they want interoperability. Cross-publishing seems like a way to get there; after all, you only need to write some additional metadata, not the data itself. But it has real costs, and it’s not clear to me that cross-publishing has a long-term future. Over the next couple of years, while the table formats go through their rapid maturing phase, we may see cross-publishing like XTable/UniForm play a role for organizations that need to consume table format X but are currently offered Y. However, in the long term, either the table formats align their internals better to make cross-publishing metadata work well, or the compute engines will take on interoperability, and the market will pick a table format winner or two.

We’re still in the early adoption phase of the table formats; it’s the Wild West in shared table storage land.