Object storage is taking over more of the data stack, but low-latency systems still need separate hot-data storage. Storage unification is about presenting these heterogeneous storage systems and formats as one coherent resource. Not one storage system and storage format to rule them all, but virtualizing them into a single logical view.
The primary use case for this unification is stitching real-time and historical data together under one abstraction. We see such unification in various data systems:
Tiered storage in event streaming systems such as Apache Kafka and Pulsar
HTAP databases such as SingleStore and TiDB
Real-time analytics databases such as Apache Pinot, Druid and Clickhouse
The next frontier in this unification are lakehouses, where real-time data is combined with historical lakehouse data. Over time we will see greater and greater lakehouse integration with lower latency data systems.
In this post, I create a high-level conceptual framework for understanding the different building blocks that data systems can use for storage unification, and what kinds of trade-offs are involved. I’ll cover seven key considerations when evaluating design approaches. I’m doing this because I want to talk in the future about how different real-world systems do storage unification and I want to use a common set of terms that I will define in this post.