Tableflow: the stream/table, Kafka/Iceberg duality

Confluent just announced Tableflow, the seamless materialization of Apache Kafka topics as Apache Iceberg tables. This announcement has to be the most impactful announcement I’ve witnessed while at Confluent. This post is about why Iceberg tables aren’t just another destination to sync data to; they fundamentally change the world of streaming. It’s also about the macro trends that have led us to this point and why Iceberg (and the other table formats) are so important to the future of streaming.

The world is changing

The data infrastructure world is undergoing multiple paradigm shifts simultaneously, all driven by S3 and the other cloud object storage services (GCS, Azure Blob Storage). While the trends we are seeing now have been in the making for a number of years, it seems that 2024 will be the year that the shift really gathers pace. As they say, change comes slowly, then all at once.

There are three principal paradigm shifts going on:

  1. The rise of object storage as the primary storage layer for cloud-based data systems.

  2. Tables as a sharing primitive, enabled by open table formats, with Apache Iceberg being the market leader.

  3. Streaming as a generalization of batch.

Paradigm-shift #1 - Object storage. In data system design, cost is a kind of unstoppable force that rewrites architectures every decade or so, and that is happening right now. AWS, GCP and Azure have been consistent in driving down the costs and driving up the reliability of their object storage services over time. S3 et al are the only game in town and the cheapest form of storage; hence any system has to build on it and cannot really be cheaper or more reliable. You can't compete with S3 because AWS doesn't provide the primitives in the cloud to rebuild it. Your only other choice is to get some space in a co-lo and start racking your own hard drives and solid-state drives. S3 may or may not be the ideal storage interface for a diverse set of data systems from a design point of view, but in terms of economics, it is unbeatable and inevitable.

Paradigm-shift #2 - Tables as a sharing primitive. Before the open table formats, such as Apache Iceberg, the most common way of moving data between the operational estate (think microservices, back-end systems, OLTP databases, etc) and the analytics estate (think data warehouses, OLAP, etc) has been Apache Kafka and ETL jobs. Kafka topics act as primitives for sharing data between systems and even between organizations. Iceberg et al have added another solid sharing primitive ideal for the analytics estate with its multitude of query engines and BI tools that can read and write to these tables. The first manifestation of this new sharing primitive is the commoditization of the data lake and the rise of the headless data architecture. With storage and compute disaggregated, each organization gets to choose the query engines/BI tools that suit their needs, putting more power into the hands of the organization to choose where they place their data and how they access it. Data warehouse vendors like Snowflake are also joining the fray by adding the ability to query “external” tables using the Iceberg table format.

Paradigm-shift #3 - streaming as a generalization of batch. So far, we see that Iceberg is the bridge between these first two technology shifts, but it also plays a key role in making the third practical. The phrase “batch is a special case of streaming” was coined almost a decade ago, but in reality, it has struggled to be realized. Unbounded data is a superset of bounded data, and batch jobs and streaming jobs both have windows, with the former having a global window. While logically sound, it has been difficult to pull off when the historical source and the real-time source come from different systems. Enter Iceberg tables which can act as both tables and streams serving both representations, bringing this same principled generalization present in the stream processing layer but playing out in the storage tier.

One thing we've realized is that to really make stream processing powerful, it should also encompass the set of capabilities of a batch processing system: the ability to manage massive tables of state and the ability to reprocess massive historical data as logic changes. The hard part has been combining the two modes without a jarring switch over from one mode to another. 

The things we can do with the stream-table duality

The stream-table duality is the concept that a stream can be projected as a table, and the changes applied to a table can be materialized as a stream. The stream and table are two sides of the same coin. Iceberg brings another level to this duality as it can represent the same dataset as both a table and a stream. The wonderful thing about how Iceberg metadata works is that by controlling the snapshot references, we can create multiple representations of the same underlying data. It’s possible for one data consumer to treat a table as an unbounded stream while another sees a point-in-time bounded table or even for a stream to diverge with both branches sharing a common ancestor.

What is really powerful is the merging of the Kafka topic and the Iceberg table into one coherent dataset. The operational estate of microservices and OLTP databases can continue to produce to streams with the Kafka API as they have done for a long time. Data consumers can choose to consume that stream using the Kafka API or as an Iceberg table using SQL. Consumers in the operational estate will likely carry on with the Kafka API, whereas tooling in the analytics estate can choose the Kafka API for optimized latency and Iceberg for SQL-based workloads. Because the underlying Iceberg table data files are Parquet, we now have “columnar streams”, where the stream processor only needs to read the columns needed, greatly increasing the efficiency of large jobs. Flink can filter, transform, aggregate and join streams, writing them back to streams via the Kafka API or directly materializing the results as Iceberg tables. This opens up the reuse of data by both sides of the organization, whether they prefer the Kafka API or Iceberg.

Integrating Iceberg into the Kora storage engine itself

Next I want to write about the motivations for integrating Iceberg into the Kora storage engine itself, rather than just relying on yet another connector. It is a key point to understand as we can’t simply replace all connectors with deep integration into the storage engine. So why build in Iceberg into the storage engine, why go multi-modal?

Using an Iceberg connector has its limits and forces you to choose between two competing concerns:

  1. If you prioritize latency, then you write early, which means many small Parquet files. This becomes inefficient for reads which must now deal with masses of small files.

  2. If you prioritize read efficiency then you accumulate data until you can write to large Parquet files, but this introduces much higher latency.

What we really want to do is optimize for both latency and read efficiency by being able to write small files and use compaction to merge these small files into larger read-optimized files in the background. If you have read my recent write-up on ClickHouse, you’ll recognize this pattern. Write small for latency, then compact to large in the background for read efficiency.

Using an Iceberg connector only solves half the problem, and the much easier half at that. It also means duplicating data, as Kora already aggressively tiers data to object storage. 

By integrating Iceberg into the storage engine as a first-class storage primitive, we can do amazing things. Firstly, we can replace the native tiered storage format with Parquet files and Iceberg metadata directly. This creates a zero-copy storage representation of a stream as a single coherent dataset across Kora brokers and object storage. The Kora storage engine handles this Iceberg/Parquet storage tiering, object storage file compaction, retention, storage optimization, and schema evolution between Kafka topic schemas and Iceberg tables. Data consumers can access the tiered data directly as Iceberg tables, or consume the entire stream from Kora brokers using the Kafka API.

There is a big difference between having an Iceberg connector and a multi-modal storage engine. Once you have this multi-modal storage engine, you can think of the rest as “apps” that sit on top.

In this layered, modular platform, the “apps” on top just get stream and table abstractions.

Final thoughts

Data streaming vendors, such as Confluent, are fundamentally different from many others in the data infrastructure sector. Streaming vendors don’t rely on data gravity—data comes in and then leaves again. The streaming platform is a data platform, but it’s an open one by design. It makes money by enabling the sharing of data by opening it up to other systems. Where the shared table paradigm shift can threaten some portion of the data infrastructure sector, for streaming, it only represents an opportunity.

Kafka has this unique position where it is used heavily in the operational estate but also used as a way of getting data into the analytics estate. Multimodal streams that speak the language of both sides of the organization bring the analytics and operational sides closer together. At first streams will materialize as tables, but Tableflow will later get the ability to materialize Iceberg tables as Kafka topics. True bi-directional communication between the operational and analytics estates.

This trivially makes reuse possible because the lowest friction path is that of simply enabling Iceberg on a Kafka topic or enabling Kafka on an Iceberg table. No ETL, no data duplication but a bi-directional multimodal stream/table duality.