I’ve created this page to make it easier for me to share links about my writing on table format internals. Currently, it includes Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon.
The teacher's nemesis
A few months ago I wrote Learning and Reviewing System Internals - Tactics and Psychology. One thing I touched on was how it is necessary to create a mental model in order to grok a codebase, or learn how a complex system works. The mental model gets developed piece by piece, using a layer of abstractions.
Today I am also writing about mental models and abstractions, but from the perspective of team/project leaders and their role in onboarding new team/project members. In this context, the team lead and senior engineers are teachers and how effective they has a material impact on the success of the team. However, there are real challenges and leaders can fail without being aware of it, with potentially poor outcomes if left unaddressed.
The curse of Conway and the data space
Conway’s Law:
"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure."
This is playing out worldwide across hundreds of thousands of organizations, and it is no more evident than in the split between software development and data analytics teams. These two groups usually have a different reporting structure, right up to, or immediately below, the executive team.
This is a problem now and is only growing.
Table format interoperability, future or fantasy?
In the world of open table formats (Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon, etc), an emerging trend is to provide interoperability between table formats by cross-publishing metadata. It allows a table to be written in table format X but read in format Y or Z.
Cross-publishing is the idea of a table having:
A primary table format that you write to.
Equivalent metadata files of one or more secondary formats that allow the table to be read as if it were of that secondary format.
Table format comparisons - Change queries and CDC
This post, and its associated deep dives, will look at how changes made to an Iceberg/Delta/Hudi/Paimon table can be emitted as a stream of changes. In the context of the table formats, it is not a continuous stream, but the capability to incrementally consume changes by performing periodic change queries.
These change queries can return full Change Data Capture (CDC) data or just the latest data written to the table. When people think of CDC, they might initially think of tools such as Debezium that read the transaction logs of OLTP databases and write a stream of change events to something like Apache Kafka. From there the events might get written to a data lakehouse. But the lakehouse table formats themselves can also generate a stream of change events that can be consumed incrementally. That is what this post is about.
BYOC, not “the future of cloud services” but a pillar of an everywhere platform
In 2023, I wrote a long post about why I don’t think the future of cloud data services is BYOC but large-scale multi-tenant SaaS. BYOC stands for Bring Your Own Cloud, and is the practice of deploying a co-managed service into a customer VPC. It’s somewhere between self-hosted and fully-managed SaaS. In my previous writing, I wrote in detail about the drawbacks of this deployment model from the perspective of both the vendor and the customer.
Since then, I’ve been involved in multiple calls with customers and prospective customers, where BYOC has been a large discussion point. When we lost a deal to a BYOC competitor, there were often valid reasons. A year on, my position on BYOC hasn’t really changed, though I would clarify that my position has been focused on a BYOC flavor where the vendor co-manages a complex, stateful single-tenant service. Confluent could have decided to package up Confluent Platform, its single-tenant self-hosted service, put it on Kubernetes with an operator and give it to customers as BYOC. But it wasn’t the right route for building out a BYOC offering at scale. Then Warpstream came along and showed another way of doing BYOC; one that avoids many of the pitfalls that make scaling a BYOC fleet so difficult.
In this post, I will reflect on my last year of customer conversations, movements in the market, Confluent’s acquisition of Warpstream, and its embrace of BYOC as a third deployment model.
Constraints breed innovation and so does tenacity
A number of years ago, I got a bit addicted to coding challenges on HackerRank. It was kind of intense, fun learning the algorithms, frustrating seeing my early attempts crash and burn but ultimately triumphant when I finally got the thing to run in under N number of seconds that was demanded.
Something that has always stuck with me from those days was how easy it would have been to settle if I hadn’t known it was possible. In real life, on each iteration, as I slowly improved the running time, I could have settled. But in a coding challenge, if the target was 4 seconds then I knew it was possible, and even though it seemed impossible after my initial attempts, I carried on. I carried on despite frustration and exasperation as my attempts continued to fail.
Another example of this was the 1 Billion Row Challenge. You could see other people’s results, so you knew you could do better.
In the real world it's so easy to settle. You write an algorithm, come up with an architecture, design a protocol and so on; it works and has reasonable performance, reasonable properties. In the real world sometimes that's enough and you move on. But sometimes perhaps it’s worth striving and not settling for your early ideas. What if, instead of accepting your design you ask yourself for something better, may be something out of left field? May be add a constraint that if you could pull off would be amazing. May be it requires some extra reading, like when I would dip into my various algorithms books for another way, another strategy.
It’s something I seriously think about whenever it comes to software design and implementation. It’s a sort of ongoing epiphany that surfaces anytime I need to design something - remember HackerRank. I ask, well that’s good but what if you had to come up with something better than this?
Table format comparisons - Streaming ingest of row-level operations
In the previous post, I covered append-only tables, a common table type in analytics used often for ingesting data into a data lake or modeling streams between stream processor jobs. I had promised to cover native support for changelog streams, aka change data capture (CDC), but before I do so, I think we should first look at how the table formats support the ingestion of data with row-level operations (insert, update, delete) rather than query-level operations that are commonly used in SQL batch commands.
Table format comparisons - Append-only tables and incremental reads
This post is about how the table formats support append-only tables and incremental reads. Streaming is becoming more and more important in the data analytics stack and the table formats all have some degree of support for streaming. One of the pillars of a streaming workload based on table formats is the append-only table. There are other pillars, such as changelog streams, and I’ll cover those in another post.
Incremental reads allow compute engines to perform repeated queries that return new records or changes to records that have occurred since the last query was executed. Basically, a table client polls the table on an interval, receiving the latest data on each occasion. Much like a Kafka consumer, albeit with a lot more end-to-end latency.
Table format comparisons - How do the table formats represent the canonical set of files?
This is the first in a series of short comparisons of table format internals. While I have written in some detail about each, I think it’s interesting to look at what is the same or similar and what sets them apart from each other.
Question: How do the table formats represent the canonical list of data and delete files?
All the table formats store references to a canonical set of data and delete files within a set of metadata files. Each table format takes a slightly different approach but I’ll classify them into two categories:
The log of deltas approach (Hudi and Delta Lake)
The log of snapshots approach (Iceberg and Paimon)