This is the Apache Paimon deep dive associated with the table format comparisons - Change queries and CDC blog post, which looks at how each table format supports change queries, including full CDC. This is not a how-to guide but an examination of Apache Paimon capabilities through the lens of its internal design. Note that this post is not about CDC log ingestion but about Paimon’s support for querying the changes of the table itself.
Change query support in Apache Hudi (0.15)
This is the Apache Hudi deep dive associated with the table format comparisons - Change queries and CDC blog post, which looks at how each table format supports change queries tables, including full CDC. This is not a how-to guide but an examination of Apache Hudi capabilities through the lens of its internal design.
Change query support in Delta Lake (3.2.0)
This is the Delta Lake deep dive associated with the table format comparisons - Change queries and CDC blog post, which looks at how each table format supports change queries tables, including full CDC. This is not a how-to guide but an examination of Delta Lake capabilities through the lens of its internal design.
Change query support in Apache Iceberg v2
This is the Apache Iceberg deep dive associated with the Table format comparisons - Change queries and CDC blog post, that looks at how each table format supports change queries tables, including full CDC. This is not a how-to guide but an examination of Iceberg capabilities through the lens of its internal design.
Understanding Apache Iceberg’s Consistency Model Part 3
Understanding Apache Iceberg’s Consistency Model Part 2
Understanding Apache Iceberg's Consistency Model Part 1
Apache Iceberg is the last table format I am covering in this series and is perhaps the most widely adopted and well-known of the table formats. I wasn’t going to write this analysis originally as I felt the book Apache Iceberg: The Definitive Guide was detailed enough. Now, having gone through the other formats, I see that the book is too high-level for what I have been covering in this series—so here we go—a deep dive into Apache Iceberg internals to understand its basic mechanics and consistency model.
Understanding Apache Paimon's Consistency Model Part 3
In this final part of this Apache Paimon series, I’ll go over the formal verification with Fizzbee.
Normally I use TLA+ for formal verification but this time I decided to try out Fizzbee, a language and model checker that maps closely to TLA+ semantics but uses a subset of Python called Starlark. Fizzbee is still relatively immature but it shows a lot of potential. I’ll be writing about my experiences with Fizzbee in a future blog post.
Understanding Apache Paimon's Consistency Model Part 2
Understanding Apache Paimon's Consistency Model Part 1
Apache Paimon is an open-source table format that has come after the more established Apache Iceberg, Delta Lake and Apache Hudi projects. It was born in the Apache Flink project where it was known as Flink Table Store, but has since spun out as a top-level Apache project. When I first started digging into Paimon I remarked that if Iceberg, Delta and Hudi had a baby, it might be Paimon. But Paimon has a number of its own innovations that set it apart from the Big Three table formats.
Understanding Delta Lake's consistency model
A few days ago I released my analysis of Apache Hudi’s consistency model, with the help of a TLA+ specification. This post will do the same for Delta Lake. Just like the Hudi post, I will not comment on subjects such as performance, efficiency or how use cases such as batch and streaming are supported. This post focuses solely on the consistency model using a logical model of the core Delta Lake protocol.
Understanding Apache Hudi's Consistency Model Part 3
In part 1 we built a logical model for how copy-on-write tables work in Apache Hudi, and posed a number of questions regarding consistency with regard to types of concurrency control, timestamp monotonicity and more. In part 2 we studied timestamp collisions, their probabilities and how to avoid them (and be conformant to the Hudi spec). In part 3, we’ll be focusing on the results of model checking the TLA+ specification, and answering those questions.
Understanding Apache Hudi's Consistency Model Part 2
In part 1 we built up an understanding of the mechanics of Hudi Copy-on-write tables, with a special regard to multi-writer scenarios, using a simplified logical model. In this part we’ll look at:
Understanding why the Hudi spec instructs the use of monotonic timestamps, by looking at the impact of timestamp collisions.
The probability of collisions in multi-writer scenarios where writers use their local clocks as timestamp sources.
Various options for avoiding collisions.
Understanding Apache Hudi's Consistency Model Part 1
Apache Hudi is one of the leading three table formats (Apache Iceberg and Delta Lake being the other two). Whereas Apache Iceberg internals are relatively easy to understand, I found that Apache Hudi was more complex and hard to reason about. As a distributed systems engineer, I wanted to understand it and I was especially interested to understand its consistency model with regard to multiple concurrent writers. Ultimately, I wrote a TLA+ specification to help me nail down the design and understand its consistency model.