Have your Iceberg Cubed, Not Sorted: Meet Qbeast, the OTree Spatial Index

In today’s post I want to walk through a fascinating indexing technique for data lakehouses which flips the role of the index in open table formats like Apache Iceberg and Delta Lake.

We are going to turn the tables on two key points:

  1. Indexes are primarily for reads. Indexes are usually framed as read optimizations paid for by write overhead: they make read queries fast, but inserts and updates slower. That isn’t the full story as indexes also support writes such as with faster uniqueness enforcement and reducing lock contention (for example, by avoiding range locks during table scans) but the dominant mental model is that indexing serves reads while writes pay the bill.

  2. OTFs don’t use tree-based indexes. Open-table format indexes are data-skipping indexes scoped to data files or even blocks within data files. They are a loose collection of column statistics and Bloom filters.

Qbeast, a start-up with a presence here in Barcelona where I live, is reimagining indexes for open table formats, showing that neither assumption has to be true.

How Would You Like Your Iceberg Sir? Stream or Batch Ordered?

Today I want to talk about stream analytics, batch analytics and Apache Iceberg. Stream and batch analytics work differently but both can be built on top of Iceberg, but due to their differences there can be a tug-of-war over the Iceberg table itself. In this post I am going to use two real-world systems, Apache Fluss (streaming tabular storage) and Confluent Tableflow (Kafka-to-Iceberg), as a case study for these tensions between stream and batch analytics.

  • Apache Fluss uses zero-copy tiering to Iceberg. Recent data is stored on Fluss servers (using Kafka replication protocol for high availability and durability) but is then moved to Iceberg for long-term storage. This results in one copy of the data.

  • Confluent Kora and Tableflow uses internal topic tiering and Iceberg materialization, copying Kafka topic data to Iceberg, such that we have two copies (one in Kora, one in Iceberg).

This post will explain why both have chosen different approaches and why both are totally sane, defensible decisions.

A Fork in the Road: Deciding Kafka’s Diskless Future

The Kafka community is currently seeing an unprecedented situation with three KIPs (KIP-1150, KIP-1176, KIP-1183) simultaneously addressing the same challenge of high replication costs when running Kafka across multiple cloud availability zones.” — Luke Chen, The Path Forward for Saving Cross-AZ Replication Costs KIPs

At the time of writing the Kafka project finds itself at a fork in the road where choosing the right path forward for implementing S3 topics has implications for the long-term success of the project. Not just the next couple of years, but the next decade. Open-source projects live and die by these big decisions and as a community, we need to make sure we take the right one.

This post explains the competing KIPs, but goes further and asks bigger questions about the future direction of Kafka.

Why I’m not a fan of zero-copy Apache Kafka-Apache Iceberg

Over the past few months, I’ve seen a growing number of posts on social media promoting the idea of a “zero-copy” integration between Apache Kafka and Apache Iceberg. The idea is that Kafka topics could live directly as Iceberg tables. On the surface it sounds efficient: one copy of the data, unified access for both streaming and analytics. But from a systems point of view, I think this is the wrong direction for the Apache Kafka project. In this post, I’ll explain why. 

Beyond Indexes: How Open Table Formats Optimize Query Performance

My career in data started as a SQL Server performance specialist, which meant I was deep into the nuances of indexes, locking and blocking, execution plan analysis and query design. These days I’m more in the world of the open table format such as Apache Iceberg. Having learned the internals of both transactional and analytical database systems, I find the use of the word “index” interesting as they mean very different things to different systems.

I see the term “index” used loosely when discussing open table format performance, both in their current designs and in speculation about future features that might make it into their specs. But what actually counts as an index in this world?

Some formats, like Apache Hudi, do maintain record-level indexes such as, primary-key-to-filegroup maps that enable upserts and deletes to be directed efficiently to the right filegroup in order to support primary key tables. But they don’t help accelerate read performance across arbitrary predicates like the secondary indexes we rely on in OLTP databases.

Traditional secondary indexes (like the B-trees used in relational databases) don’t exist in Iceberg, Delta Lake, or even Hudi. But why? Can't we solve some performance issues if we just added secondary indexes to the Iceberg spec?

The short answer is: “no and it's complicated”. There are real and practical reasons why the answer isn’t just "we haven't gotten around to it yet."

Understanding Apache Fluss

This is a data system internals blog post. So if you enjoyed my table formats internals blog posts, or writing on Apache Kafka internals or Apache BookKeeper internals, you might enjoy this one. But beware, it’s long and detailed. Also note that I work for Confluent, which also runs Apache Flink but does not run nor contributes to Apache Fluss. However, this post aims to be a faithful and objective description of Fluss.

Apache Fluss is a table storage engine for Flink being developed by Alibaba in collaboration with Ververica. To write this blog post, I reverse engineered a high level architecture by reading the Fluss code from the main branch (and running tests), in August 2025. This follows my same approach to my writing about Kafka, Pulsar, BookKeeper, and the table formats (Iceberg, Delta, Hudi and Paimon) as the code is always the true source of information. Unlike the rest, I have not had time to formally verify Fluss in TLA+ or Fizzbee, though I did not notice any obvious issues that are not already logged in a GitHub issue.

Let’s get started. We’ll start with some high level discussion in the Fluss Overview section, then get into the internals in the Fluss Cluster Core Architecture and Fluss Lakehouse Architecture sections.

A Conceptual Model for Storage Unification

Object storage is taking over more of the data stack, but low-latency systems still need separate hot-data storage. Storage unification is about presenting these heterogeneous storage systems and formats as one coherent resource. Not one storage system and storage format to rule them all, but virtualizing them into a single logical view. 

The primary use case for this unification is stitching real-time and historical data together under one abstraction. We see such unification in various data systems:

  • Tiered storage in event streaming systems such as Apache Kafka and Pulsar

  • HTAP databases such as SingleStore and TiDB

  • Real-time analytics databases such as Apache Pinot, Druid and Clickhouse

The next frontier in this unification are lakehouses, where real-time data is combined with historical lakehouse data. Over time we will see greater and greater lakehouse integration with lower latency data systems.

In this post, I create a high-level conceptual framework for understanding the different building blocks that data systems can use for storage unification, and what kinds of trade-offs are involved. I’ll cover seven key considerations when evaluating design approaches. I’m doing this because I want to talk in the future about how different real-world systems do storage unification and I want to use a common set of terms that I will define in this post.

Remediation: What happens after AI goes wrong?

If you’re following the world of AI right now, no doubt you saw Jason Lemkin’s post on social media reporting how Replit’s AI deleted his production database, despite it being told not to touch anything at all due to a code freeze. After deleting his database, the AI even advised him that a rollback would be impossible and the data was gone forever. Luckily, he went against that advice, performed the rollback, and got his data back.

Then, a few days later I stumbled on another case, this time of the Gemini CLI agent deleting Anurag Gupta’s files. He was just playing around, kicking the tires, but the series of events that took place is illuminating.

These incidents show AI agents making mistakes, but they also show agents failing to recover. In both cases, the AI not only broke something, but it couldn't fix it. That’s why remediation needs to be a first-class concern in AI agent implementations.

The Cost of Being Wrong

A recent LinkedIn post by Nick Lebesis caught my attention with this brutal take on the difference between good startup founders and coward startup founders. I recommend you read the entire thing to fully understand the context, but I’ve pasted the part that most resonated with me below:

"Real founders? They make the wrong decision at 9am. Fix it by noon. Ship by 5. Coward founders are still scheduling the kickoff meeting. Your job isn't to be liked. Your job is to be clear. Wrong but decisive beats right but timid... every single time. Committees don't build companies. Convictions do."

It's harsh, but there's truth here that extends well beyond startups into how we approach technical decision-making in software development, even in large organizations. 

Responsibility Boundaries in the Coordinated Progress model

Building on my previous work on the Coordinated Progress model, this post examines how reliable triggers not only initiate work but also establish responsibility boundaries. Where a reliable trigger exists, a new boundary is created where that trigger becomes responsible for ensuring the eventual execution of the sub-graph of work downstream of it. The boundaries can even layer and nest, especially in orchestrated systems that overlay finer-grained boundaries.