Table format comparisons - How do the table formats represent the canonical set of files?

This is the first in a series of short comparisons of table format internals. While I have written in some detail about each, I think it’s interesting to look at what is the same or similar and what sets them apart from each other. These comparisons should be light on judgments and opinions, and focus on objective facts about how they work. I don’t mean to participate in any table format battles!

I’ve previously written about table format internals in my series on the consistency models of the following table formats:

Disclaimer: I work at Confluent and we’re investing heavily in the materialization of Kafka topics as Apache Iceberg tables (Tableflow). We didn’t choose Iceberg because we believe it’s better from a technology perspective, but simply because that is what customers are asking for. I’m not going to weigh in on table format wars here. All my writing on the table formats should be objective and for the love of understanding technology alone.

I use the terms copy-on-write (COW) and merge-on-read (MOR) a few times. Just remember that COW is copy-on-write and MOR is merge-on-read.

Question: How do the table formats represent the canonical list of data and delete files?

All the table formats store references to a canonical set of data and delete files within a set of metadata files. Each table format takes a slightly different approach but I’ll classify them into two categories (though it is still a little messier than this):

  • The log of deltas approach (Hudi and Delta Lake)

  • The log of snapshots approach (Iceberg and Paimon)

The log of deltas approach

The log of deltas approach consists of writing new changes as log entries that reference only what has changed, such as added and deleted files, changes to schema etc. 

  • Delta Lake calls the log the Delta Log and each entry a Delta Entry.

  • Apache Hudi calls the log the Timeline and each entry an Instant.

Delta Lake

The Delta Log contains an ordered sequence of log entries as JSON files. 

Each log entry is of a specific action, such as:

  • Change metadata

  • Add and remove files

  • Add CDC file

  • and more

Fig 1. Depicts a Delta Log with a set of four “Add/Remove Files” actions.

Each Delta Log entry and its log-prefix represent the table at that point in time. A write operation produces an “Add/Remove Files” log entry containing a set of files that were added and a set of files that were logically deleted. To know the canonical set of files, a reader reads all the log entries up to the entry that corresponds to the table version it wishes to read and rolls it up into a single logical snapshot of the table. Delta Lake keeps the cost of this log reading process small by periodically writing a checkpoint (a Parquet file) to the log that consists of a rolled-up snapshot of the metadata (including all the live data and deletion vector files) at that point in time.

Fig 2. A table scan would only need to read three metadata files of the log: 00005.checkpoint.parquet, 00006.json and 00007.json.

A snapshot is read by reading the most recent checkpoint and then reading forwards to create a logical snapshot. 

Delta Lake is conceptually pretty simple, with regard to how it represents the canonical set of files.

Apache Hudi

Apache Hudi takes a bit more explaining and so I’ll have to use a lot more diagrams and words to describe it than the other formats.

The canonical set of files is represented by two components:

  • The timeline (the log of deltas).

  • The metadata table which among other things, acts as a file index for Hudi tables.

The Hudi metadata table contains a list of all the committed data files that make up a Hudi table. Each table commit writes to the timeline and to the metadata table. This list of committed data files can be combined with the timeline, to return the set of files for any given table version, stored in the timeline.

This might not be clear yet, first I need to describe a few more concepts before we come back to the paragraph above. We need to cover how Hudi distributes data across files and how the timeline works.

File groups, file slices, base and log files

A Hudi table is divided into a number of file groups, and one way of thinking about file groups is that they act as a kind of storage sharding mechanism. Primary keys are mapped to these shards and that mapping is stored in an index.

Fig 3. A table is divided into a number of file groups.

These file groups can be fixed in number or can be increased dynamically.

A file group consists of one or more file slices, and a file slice consists of a single base file (a Parquet file) and a number of log files (also Parquet but also could be Avro). Log files are written in MOR tables, and contain deltas (new rows and deletion vectors). For COW tables a file slice is just a single base file.

Fig 4. Depicts two file slices of a MOR table. A compaction job compacts a file slice with three log files into a new base file.

Timestamps play a critical role in Hudi and to understand this part, we’ll move on to the timeline.

The timeline

Each write operation goes through a process of writing a sequence of instants to the timeline, with the following action states:

  1. Requested

  2. Inflight

  3. Completed

There are a number of action types, such as:

  • Commit (copy-on-write op)

  • Delta Commit (merge-on-read op)

  • Compaction

  • Clean

  • and more

Each instant is written to the timeline with a name formed of:

  • The instant timestamp

  • The action

  • The action state

The completed instants for Commit and DeltaCommit actions contain the list of files that were added. Data files can be base or log files.

Fig 5. An Apache Hudi timeline of a COW table. Depicts a table with 5 file groups and a log of completed commit instants where each instant lists a set of added file slices. The requested and inflight instants are omitted here.

Apache Hudi timeline commit and delta-commit instants only list the files that were added, there are no deleted files which sets it apart from the other table formats.

File slices (and even log files within file slices) are filtered out based on timestamps, instead of using an explicit logical delete mechanism. In Iceberg and Delta Lake, a COW operation will load a data file, perform some row-level mutation and write it back as a new data file. Then the operation will register the new file as added and the original file as logically deleted. Likewise, a compaction operation for both COW and MOR tables will result in logical deletions. In Hudi, the timeline does not contain any logically deleted files, as timestamps determine which base and log files are read from each file group in a table scan.

For example:

  1. A writer writes file-slice-1.parquet at ts=1 to file-group-1. It commits the operation by adding a completed commit instant with ts=1 that references file-slice-1.parquet.

  2. Next, the writer wants to delete one row from file-slice-1.parquet, so it loads the file slice, deletes the row, and writes it back as file-slice-2.parquet with ts=2. It commits the operation by adding a completed commit instant with ts=2 that references file-slice-2.parquet.

When a reader performs a table scan at ts=1, for file group 1 it will read file-slice-1.parquet as its timestamp is less than or equal to the timestamp of the scan. A table scan at ts=2 will read file-slice-2.parquet, as it is the file slice with the highest timestamp that is less than or equal to the scan timestamp. This obviates the need for explicit logical deletes of files.

Finally, how Hudi clients learn of the canonical set of files

A Hudi client has two ways of discovering the file slices that make up the table:

  1. If the client only wishes to know the file slices of the latest table version (known as snapshot queries in Hudi), it only needs to read the Hudi metadata table which contains information on all committed file slices. It simply needs to take the file slice of each file group with the highest timestamp.

  2. If the client wishes to know the file slices of a prior table version (known as time-travel queries), it performs the same metadata table read process, except that it filters out file slices, and log files within file slices, based on the commit timestamp of the last committed instant of that table version. That information comes from the timeline.

The timeline is not the source of the canonical set of files for the latest table version, but it is required for filtering in time-travel queries.

Hudi keeps the size of the active timeline from growing too large via the process of timeline archiving. This keeps a certain number of completed instants in the active timeline by moving older instants to the archived timeline. The archived timeline is not referred to by regular operations but is there as a history of the table for other purposes.

Timeline archiving does not affect the ability of clients to read learn of the file slices of the latest table version, it just bounds how far back time-travel and incremental queries can go. Only the timeline has the history of file changes, the metadata table acts as a current snapshot.

The log of snapshots approach

The log of snapshots approach involves writing a new tree of metadata files (a snapshot) on each commit. The root node of a specific table version is either a snapshot, or another metadata file that points to that snapshot. A reader learns of the snapshot (for the table version it wishes to read) and explores the child nodes to discover the canonical set of files. With the log of deltas approach, a new commit only adds a delta and a reader must roll up the log of deltas to make a logical snapshot. With the log of snapshots approach, this roll-up process into snapshots is already done during the writing phase.

Apache Iceberg

Iceberg stores a log of snapshots, and each snapshot represents a point in time (a table version).

Fig 6. Each table version is stored as a snapshot. A snapshot contains a single manifest-list file, which contains a list of entries that point to a set of manifest files. Each manifest file contains a list of entries that point to a set of data files that were added, deleted or existing.

Each commit writes a new metadata file that stores the snapshot log (with the new snapshot appended to the end) and information about the schema. The Iceberg catalog contains the path of the current metadata file.

Fig 7. An Iceberg client will start by consulting the catalog to learn of the current metadata file. It then loads that metadata file which contains all live snapshots.

The metadata files themselves do not constitute a log that can be read. Instead, the current metadata file contains the log of live snapshots (one per table version). A reader can perform a read based on any of those snapshots; the current (most recent) snapshot will typically be read for normal queries but for time travel queries, older snapshots can be read.

The expireSnapshots operation drops snapshots based on their age, thus keeping the snapshot log from growing too large.

This is all a simplification, if you want a deeper understanding of these metadata files I recommend reading Apache Iceberg’s Consistency Model Part 1.

Apache Paimon

Paimon metadata has some similarities to Iceberg but does not use a catalog to point to a metadata file with the latest snapshot. Instead, it numbers its snapshots like Delta Lake numbers its log entries and the Paimon client can determine the current snapshot by listing and sorting the snapshot files.

Fig 8. Snapshot-4 is the current snapshot.

Each snapshot is a tree similar to that of Iceberg except that a Paimon snapshot has two manifest-list files:

  • Base manifest list: The tree of files that represent the table at the beginning of the operation.

  • Delta manifest list: Lists the manifest files of the data files written and logically deleted in this operation.

Fig 9. Depicts a snapshot that was written by a compaction operation that deleted two existing data files and rewrote them as a single data file.

A Paimon snapshot separates the table's manifest files into those that existed when the operation began and the manifest files that list the data or deletion vector files that were added or logically deleted by the operation. Regular writes don’t logically delete files, as all deletes from regular writes are simply row-level deletes in a level-0 data file (level-0 of an LSM tree). However, compaction jobs do logically delete files.

The snapshot log is prevented from growing too large by snapshot retention configurations. Typically, snapshots that are older than the retention period are deleted synchronously during a write operation.

Read Apache Paimon’s Consistency Model Part 1 for a more in-depth description of Paimon metadata.

Conclusions

While each table format represents the canonical set of data and delete files differently, they all share the following characteristics:

  • Each table commit results in a new table version and queries can choose to read from specific versions (time travel).

  • Immutable versioned metadata points to immutable data files. Once a table version has been created none of its metadata or data files are mutated. Eventually older versions are expired by removing their associated metadata and physically deleting the data and delete files.

  • The metadata versions are stored as either a log of deltas or a log of snapshots.

  • Ultimately, each of the approaches mixes in some of the aspects of their opposite. Log of deltas maintain some kind of latest snapshot, and log of snapshots contain information in each snapshot of the delta that produced them:

    • Delta Lake periodically writes a checkpoint to the log which rolls up all deltas to make a snapshot as a Parquet file.

    • Hudi maintains the current snapshot in the metadata table.

    • Iceberg and Paimon use a log of snapshots but register the changes made in each snapshot.

Next up in this table format comparison series, how do the table formats support append-only tables and incremental reads?