October 9, 2024

Table formats

Change query support in Apache Paimon (0.8)

October 9, 2024

Table formats

This is the Apache Paimon deep dive associated with the table format comparisons - Change queries and CDC blog post, which looks at how each table format supports change queries, including full CDC. This is not a how-to guide but an examination of Apache Paimon capabilities through the lens of its internal design. Note that this post is not about CDC log ingestion but about Paimon’s support for querying the changes of the table itself.

Jack Vanlightly

September 30, 2024

Table formats

Change query support in Apache Hudi (0.15)

Jack Vanlightly

September 30, 2024

Table formats

This is the Apache Hudi deep dive associated with the table format comparisons - Change queries and CDC blog post, which looks at how each table format supports change queries tables, including full CDC. This is not a how-to guide but an examination of Apache Hudi capabilities through the lens of its internal design.

Jack Vanlightly

September 24, 2024

Table formats

Change query support in Delta Lake (3.2.0)

Jack Vanlightly

September 24, 2024

Table formats

This is the Delta Lake deep dive associated with the table format comparisons - Change queries and CDC blog post, which looks at how each table format supports change queries tables, including full CDC. This is not a how-to guide but an examination of Delta Lake capabilities through the lens of its internal design.

Jack Vanlightly

September 23, 2024

Table formats

Change query support in Apache Iceberg v2

Jack Vanlightly

September 23, 2024

Table formats

This is the Apache Iceberg deep dive associated with the Table format comparisons - Change queries and CDC blog post, that looks at how each table format supports change queries tables, including full CDC. This is not a how-to guide but an examination of Iceberg capabilities through the lens of its internal design.

Jack Vanlightly

August 6, 2024

Table formats

Understanding Apache Iceberg’s Consistency Model Part 3

Jack Vanlightly

August 6, 2024

Table formats

In this final part of my Apache Iceberg series on its consistency model, I’ll cover the formal specification I wrote for it and the results of model checking.

Jack Vanlightly

August 5, 2024

Table formats

Understanding Apache Iceberg’s Consistency Model Part 2

Jack Vanlightly

August 5, 2024

Table formats

In this post, we will explore Apache Iceberg's concurrency control and data conflict checks which provide compute engines with support for offering transactions with Serializable and Snapshot Isolation. We will focus on multi-writer scenarios, as we have throughout this table-format series.

Jack Vanlightly

July 30, 2024

Table formats

Understanding Apache Iceberg's Consistency Model Part 1

Jack Vanlightly

July 30, 2024

Table formats

Apache Iceberg is the last table format I am covering in this series and is perhaps the most widely adopted and well-known of the table formats. I wasn’t going to write this analysis originally as I felt the book Apache Iceberg: The Definitive Guide was detailed enough. Now, having gone through the other formats, I see that the book is too high-level for what I have been covering in this series—so here we go—a deep dive into Apache Iceberg internals to understand its basic mechanics and consistency model.

Jack Vanlightly

July 3, 2024

Table formats

Understanding Apache Paimon's Consistency Model Part 3

Jack Vanlightly

July 3, 2024

Table formats

In this final part of this Apache Paimon series, I’ll go over the formal verification with Fizzbee.

Normally I use TLA+ for formal verification but this time I decided to try out Fizzbee, a language and model checker that maps closely to TLA+ semantics but uses a subset of Python called Starlark. Fizzbee is still relatively immature but it shows a lot of potential. I’ll be writing about my experiences with Fizzbee in a future blog post.

Jack Vanlightly

July 3, 2024

Table formats

Understanding Apache Paimon's Consistency Model Part 2

Jack Vanlightly

July 3, 2024

Table formats

In part 1, we built a mental model for the basic mechanics of primary key tables in Apache Paimon. Now we’ll dig into the consistency model which allows Paimon to support concurrency and parallelism.

Jack Vanlightly

July 3, 2024

Table formats

Understanding Apache Paimon's Consistency Model Part 1

Jack Vanlightly

July 3, 2024

Table formats

Apache Paimon is an open-source table format that has come after the more established Apache Iceberg, Delta Lake and Apache Hudi projects. It was born in the Apache Flink project where it was known as Flink Table Store, but has since spun out as a top-level Apache project. When I first started digging into Paimon I remarked that if Iceberg, Delta and Hudi had a baby, it might be Paimon. But Paimon has a number of its own innovations that set it apart from the Big Three table formats.

Jack Vanlightly

April 29, 2024

Table formats

Understanding Delta Lake's consistency model

Jack Vanlightly

April 29, 2024

Table formats

A few days ago I released my analysis of Apache Hudi’s consistency model, with the help of a TLA+ specification. This post will do the same for Delta Lake. Just like the Hudi post, I will not comment on subjects such as performance, efficiency or how use cases such as batch and streaming are supported. This post focuses solely on the consistency model using a logical model of the core Delta Lake protocol.

Jack Vanlightly

April 24, 2024

Table formats

Understanding Apache Hudi's Consistency Model Part 3

Jack Vanlightly

April 24, 2024

Table formats

In part 1 we built a logical model for how copy-on-write tables work in Apache Hudi, and posed a number of questions regarding consistency with regard to types of concurrency control, timestamp monotonicity and more. In part 2 we studied timestamp collisions, their probabilities and how to avoid them (and be conformant to the Hudi spec). In part 3, we’ll be focusing on the results of model checking the TLA+ specification, and answering those questions.

Jack Vanlightly

April 24, 2024

Table formats

Understanding Apache Hudi's Consistency Model Part 2

Jack Vanlightly

April 24, 2024

Table formats

In part 1 we built up an understanding of the mechanics of Hudi Copy-on-write tables, with a special regard to multi-writer scenarios, using a simplified logical model. In this part we’ll look at:

Understanding why the Hudi spec instructs the use of monotonic timestamps, by looking at the impact of timestamp collisions.
The probability of collisions in multi-writer scenarios where writers use their local clocks as timestamp sources.
Various options for avoiding collisions.

Jack Vanlightly

April 24, 2024

Table formats

Understanding Apache Hudi's Consistency Model Part 1

Jack Vanlightly

April 24, 2024

Table formats

Apache Hudi is one of the leading three table formats (Apache Iceberg and Delta Lake being the other two). Whereas Apache Iceberg internals are relatively easy to understand, I found that Apache Hudi was more complex and hard to reason about. As a distributed systems engineer, I wanted to understand it and I was especially interested to understand its consistency model with regard to multiple concurrent writers. Ultimately, I wrote a TLA+ specification to help me nail down the design and understand its consistency model.