My career in data started as a SQL Server performance specialist, which meant I was deep into the nuances of indexes, locking and blocking, execution plan analysis and query design. These days I’m more in the world of the open table format such as Apache Iceberg. Having learned the internals of both transactional and analytical database systems, I find the use of the word “index” interesting as they mean very different things to different systems.
I see the term “index” used loosely when discussing open table format performance, both in their current designs and in speculation about future features that might make it into their specs. But what actually counts as an index in this world?
Some formats, like Apache Hudi, do maintain record-level indexes such as, primary-key-to-filegroup maps that enable upserts and deletes to be directed efficiently to the right filegroup in order to support primary key tables. But they don’t help accelerate read performance across arbitrary predicates like the secondary indexes we rely on in OLTP databases.
Traditional secondary indexes (like the B-trees used in relational databases) don’t exist in Iceberg, Delta Lake, or even Hudi. But why? Can't we solve some performance issues if we just added secondary indexes to the Iceberg spec?
The short answer is: “no and it's complicated”. There are real and practical reasons why the answer isn’t just "we haven't gotten around to it yet."