Apache Iceberg is a hot topic right now, and looks to be the future standard for representing tables in object storage. Hive tables are overdue for a replacement. People talk about table format wars: Apache Iceberg vs Delta Lake vs Apache Hudi and so on, but the “war” at the forefront of my mind isn’t which table format will become dominant, but the battle between open vs closed – open table formats vs walled gardens.
The curse of Conway and the data space
Conway’s Law:
"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure."
This is playing out worldwide across hundreds of thousands of organizations, and it is no more evident than in the split between software development and data analytics teams. These two groups usually have a different reporting structure, right up to, or immediately below, the executive team.
This is a problem now and is only growing.
Table format interoperability, future or fantasy?
In the world of open table formats (Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon, etc), an emerging trend is to provide interoperability between table formats by cross-publishing metadata. It allows a table to be written in table format X but read in format Y or Z.
Cross-publishing is the idea of a table having:
A primary table format that you write to.
Equivalent metadata files of one or more secondary formats that allow the table to be read as if it were of that secondary format.
BYOC, not “the future of cloud services” but a pillar of an everywhere platform
In 2023, I wrote a long post about why I don’t think the future of cloud data services is BYOC but large-scale multi-tenant SaaS. BYOC stands for Bring Your Own Cloud, and is the practice of deploying a co-managed service into a customer VPC. It’s somewhere between self-hosted and fully-managed SaaS. In my previous writing, I wrote in detail about the drawbacks of this deployment model from the perspective of both the vendor and the customer.
Since then, I’ve been involved in multiple calls with customers and prospective customers, where BYOC has been a large discussion point. When we lost a deal to a BYOC competitor, there were often valid reasons. A year on, my position on BYOC hasn’t really changed, though I would clarify that my position has been focused on a BYOC flavor where the vendor co-manages a complex, stateful single-tenant service. Confluent could have decided to package up Confluent Platform, its single-tenant self-hosted service, put it on Kubernetes with an operator and give it to customers as BYOC. But it wasn’t the right route for building out a BYOC offering at scale. Then Warpstream came along and showed another way of doing BYOC; one that avoids many of the pitfalls that make scaling a BYOC fleet so difficult.
In this post, I will reflect on my last year of customer conversations, movements in the market, Confluent’s acquisition of Warpstream, and its embrace of BYOC as a third deployment model.
A Cost Analysis of Replication vs S3 Express One Zone in Transactional Data Systems
Is it economical to build fault-tolerant transactional data systems directly on S3 Express One Zone, instead of using replication? Read on for an analysis.
Cloud object storage is becoming the universal storage layer for a wealth of cloud data systems. Some systems use object stores as the only storage layer, accepting the higher latency of object storage, and these tend to be analytics systems that can accept multi-second latencies. Transactional systems want single-digit millisecond latencies or latencies in the low tens of milliseconds and therefore don’t write to object stores directly. Instead, they land data on a fast replicated write-ahead-log (WAL) and offload data to an object store for read-optimized long-term, economical storage. Neon is a good example of this architecture. Writes hit a low-latency replicated write-ahead-log based on Multi-Paxos and data is eventually written to object storage.
Hybrid Transactional/Analytical Storage
Confluent has made two key feature announcements in the spring of 2024:
Freight Clusters, a new cluster type that writes directly to object storage. It is aimed at the “freight” of data streaming workloads, log ingestion, clickstreams, large-scale ETL and so on that can be cost-prohibitive using a low latency multi-AZ replication architecture in the cloud.
Tableflow, an automated feature that provides seamless materialization of Kafka topics as Apache Iceberg tables (and vice-versa in the future).
This trend towards object storage is not just happening at Confluent but across the data ecosystem.
The Sisyphean struggle and the new era of data infrastructure
I just started re-reading Start With Why by Simon Sinek, which is a fantastic book on leadership and business strategy. The book’s core argument is that great companies don’t focus on what they do or offer, or how they do it. Instead, they focus on their WHY, their story, and what they stand for.
Tableflow: the stream/table, Kafka/Iceberg duality
Confluent just announced Tableflow, the seamless materialization of Apache Kafka topics as Apache Iceberg tables. This announcement has to be the most impactful announcement I’ve witnessed while at Confluent. This post is about why Iceberg tables aren’t just another destination to sync data to; they fundamentally change the world of streaming. It’s also about the macro trends that have led us to this point and why Iceberg (and the other table formats) are so important to the future of streaming.
S3 Express One Zone, not quite what I hoped for
AWS just announced a new lower-latency S3 storage class and for those of us in the data infrastructure business, this is big news. It’s not a secret that a low-latency object storage primitive has the potential to change how we build cloud data systems forever. So has this new world arrived with S3 Express One Zone?
The answer is no, but this is a good time to talk about cloud object storage, its role in modern cloud data systems and the potential future role it can take.
On the future of cloud services and BYOC
My job at Confluent involves a mixture of research, engineering and helping us figure out the best technical strategy to follow. BYOC is something I’ve been thinking about recently so I decided to write down the thoughts I have on it and where I think cloud services are going in general.
Bring Your Own Cloud (BYOC) is a deployment model which sits somewhere between a SaaS cloud service and an on-premise deployment. The vendor deploys their software in a VPC in the customer account but manages most of the administration for the customer. It’s not a new idea, the term Managed Service Provider (MSP) has been around since the 90s, and refers to the general term of outsourcing management and operations of IT infrastructure deployed within customer or third-party data centers.