Jack Vanlightly

Strategy and commentary

Towards composable data platforms

Towards composable data platforms

Technology changes can be sudden (like generative AI) or slower juggernauts that kick off a slow chain reaction that takes years to play out. I would place object storage and its enablement of disaggregated architectures in that latter category. The open table formats, such as Apache Iceberg, Delta Lake, and Apache Hudi, form part of this chain reaction, but things aren’t stopping there.

I’ve written extensively about the open table formats (OTFs). In my original Tableflow post, I wrote that shared tables were one of the major trends, enabled by the OTFs. But why is it that OTFs make for a good sharing primitive? I have been focused mainly on the separation of compute and storage. That OTFs allow for a headless architecture where different platforms can bring their own compute to the same data. This is all true. 

But we can also view OTFs as enabling a kind of virtualization. In this post, I will start by explaining my take on OTFs and virtualization. Finally, I’ll bring it back to Confluent, the Confluent/Databricks partnership, and the future of composable data platforms.

Share

Why Snowflake wants streaming

Rumors are swirling that Snowflake intends to acquire Redpanda and many are questioning why and what impact this might have on Confluent. First, let’s remember that these are just rumors and there’s nothing official. But given that people are speculating, here are my thoughts on how to interpret such an acquisition, whether it ends up happening or not.

There are a number of market trends in play right now, such as the rise of Iceberg and open data, as well as the war with Databricks and Snowflake’s refocus on AI. While it may not be evident at first, these are all driving Snowflake towards streaming.

Share

Forget the table format war; it’s open vs closed that counts

Forget the table format war; it’s open vs closed that counts

Apache Iceberg is a hot topic right now, and looks to be the future standard for representing tables in object storage. Hive tables are overdue for a replacement. People talk about table format wars: Apache Iceberg vs Delta Lake vs Apache Hudi and so on, but the “war” at the forefront of my mind isn’t which table format will become dominant, but the battle between open vs closed – open table formats vs walled gardens.

Share

The curse of Conway and the data space

The curse of Conway and the data space

Conway’s Law:

"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure."

This is playing out worldwide across hundreds of thousands of organizations, and it is no more evident than in the split between software development and data analytics teams. These two groups usually have a different reporting structure, right up to, or immediately below, the executive team.

This is a problem now and is only growing.

Share

Table format interoperability, future or fantasy?

Table format interoperability, future or fantasy?

In the world of open table formats (Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon, etc), an emerging trend is to provide interoperability between table formats by cross-publishing metadata. It allows a table to be written in table format X but read in format Y or Z.

Cross-publishing is the idea of a table having:

  • A primary table format that you write to.

  • Equivalent metadata files of one or more secondary formats that allow the table to be read as if it were of that secondary format. 

Share

BYOC, not “the future of cloud services” but a pillar of an everywhere platform

BYOC, not “the future of cloud services” but a pillar of an everywhere platform

In 2023, I wrote a long post about why I don’t think the future of cloud data services is BYOC but large-scale multi-tenant SaaS. BYOC stands for Bring Your Own Cloud, and is the practice of deploying a co-managed service into a customer VPC. It’s somewhere between self-hosted and fully-managed SaaS. In my previous writing, I wrote in detail about the drawbacks of this deployment model from the perspective of both the vendor and the customer.

Since then, I’ve been involved in multiple calls with customers and prospective customers, where BYOC has been a large discussion point. When we lost a deal to a BYOC competitor, there were often valid reasons. A year on, my position on BYOC hasn’t really changed, though I would clarify that my position has been focused on a BYOC flavor where the vendor co-manages a complex, stateful single-tenant service. Confluent could have decided to package up Confluent Platform, its single-tenant self-hosted service, put it on Kubernetes with an operator and give it to customers as BYOC. But it wasn’t the right route for building out a BYOC offering at scale. Then Warpstream came along and showed another way of doing BYOC; one that avoids many of the pitfalls that make scaling a BYOC fleet so difficult.

In this post, I will reflect on my last year of customer conversations, movements in the market, Confluent’s acquisition of Warpstream, and its embrace of BYOC as a third deployment model.

Share

A Cost Analysis of Replication vs S3 Express One Zone in Transactional Data Systems

A Cost Analysis of Replication vs S3 Express One Zone in Transactional Data Systems

Is it economical to build fault-tolerant transactional data systems directly on S3 Express One Zone, instead of using replication? Read on for an analysis.

Cloud object storage is becoming the universal storage layer for a wealth of cloud data systems. Some systems use object stores as the only storage layer, accepting the higher latency of object storage, and these tend to be analytics systems that can accept multi-second latencies. Transactional systems want single-digit millisecond latencies or latencies in the low tens of milliseconds and therefore don’t write to object stores directly. Instead, they land data on a fast replicated write-ahead-log (WAL) and offload data to an object store for read-optimized long-term, economical storage. Neon is a good example of this architecture. Writes hit a low-latency replicated write-ahead-log based on Multi-Paxos and data is eventually written to object storage.

Share

Hybrid Transactional/Analytical Storage

Hybrid Transactional/Analytical Storage

Confluent has made two key feature announcements in the spring of 2024:

  • Freight Clusters, a new cluster type that writes directly to object storage. It is aimed at the “freight” of data streaming workloads, log ingestion, clickstreams, large-scale ETL and so on that can be cost-prohibitive using a low latency multi-AZ replication architecture in the cloud.

  • Tableflow, an automated feature that provides seamless materialization of Kafka topics as Apache Iceberg tables (and vice-versa in the future). 

This trend towards object storage is not just happening at Confluent but across the data ecosystem.

Share

Tableflow: the stream/table, Kafka/Iceberg duality

Tableflow: the stream/table, Kafka/Iceberg duality

Confluent just announced Tableflow, the seamless materialization of Apache Kafka topics as Apache Iceberg tables. This announcement has to be the most impactful announcement I’ve witnessed while at Confluent. This post is about why Iceberg tables aren’t just another destination to sync data to; they fundamentally change the world of streaming. It’s also about the macro trends that have led us to this point and why Iceberg (and the other table formats) are so important to the future of streaming.

Share