BYOC, not “the future of cloud services” but a pillar of an everywhere platform

In 2023, I wrote a long post about why I don’t think the future of cloud data services is BYOC but large-scale multi-tenant SaaS. BYOC stands for Bring Your Own Cloud, and is the practice of deploying a co-managed service into a customer VPC. It’s somewhere between self-hosted and fully-managed SaaS. In my previous writing, I wrote in detail about the drawbacks of this deployment model from the perspective of both the vendor and the customer.

Since then, I’ve been involved in multiple calls with customers and prospective customers, where BYOC has been a large discussion point. When we lost a deal to a BYOC competitor, there were often valid reasons. A year on, my position on BYOC hasn’t really changed, though I would clarify that my position has been focused on a BYOC flavor where the vendor co-manages a complex, stateful single-tenant service. Confluent could have decided to package up Confluent Platform, its single-tenant self-hosted service, put it on Kubernetes with an operator and give it to customers as BYOC. But it wasn’t the right route for building out a BYOC offering at scale. Then Warpstream came along and showed another way of doing BYOC; one that avoids many of the pitfalls that make scaling a BYOC fleet so difficult.

In this post, I will reflect on my last year of customer conversations, movements in the market, Confluent’s acquisition of Warpstream, and its embrace of BYOC as a third deployment model.

What’s wrong with single-tenant BYOC?

My main argument against BYOC, from the perspective of a data infrastructure vendor, is that BYOC has almost exclusively been about running single-tenant distributed systems in the customer’s account. My two biggest criticisms of this are:

  1. Losing the structural benefits of large-scale MT: Single-tenant systems lose the massive structural benefits of large-scale multi-tenant systems in terms of scalability, stability, elasticity, and cost-effectiveness. Gone are the benefits of resource pooling, co-location of workloads that even out in the aggregate, and so on. It’s digging its heels into the on-premise world and not embracing the opportunities offered by the cloud.

  2. Cost of support/evolution: Where a large-scale MT SaaS service operates in a pristine, controlled environment, BYOC involves hundreds or thousands (if the vendor is successful) of single-tenant clusters deployed across a heterogeneous set of environments where the vendor has only partial control. Not only is this a challenge operationally, and therefore likely to result in more reliability problems than SaaS, but the economics of supporting these environments presents a real challenge to the vendor. For these reasons, traditional BYOC will likely result in a poorer quality of service for the customers, not to mention the challenges of evolving the service over time (and being profitable).

But there can also be valid reasons to want software to run in your own environment. You may have signed agreements with your customers that their data won’t leave your networks. Some customers just like the idea of keeping their data in their own account and this desire overrides any other arguments. BYOC does strike a note with some organizations and in some cases for legal reasons, they are obligated to either go the self-hosted route or choose BYOC. Despite its issues, BYOC is often perceived to be the better of two hard choices when a customer can’t choose SaaS but doesn’t want to be responsible for the operation of a complex service. In short, BYOC is a third deployment model that customers ask for and it’s up to the data service vendors to figure out how to deliver in a scalable way that meets customer expectations and allows the vendor to turn a profit.

For Confluent, part of its core message to customers is that it runs everywhere. Confluent runs on most of the clouds and also offers a self-hosted on-premise version of Kafka and Flink. Being everywhere is one of the most critical differentiators for a data infra vendor, and many customers choose a platform like Confluent because they want multi-cloud (and on-prem) or don’t want to be tied to a specific CSP’s service. I can’t stress this aspect enough.

Confluent aims to be the everywhere platform; its core value is in connecting systems, whether they be on-premise, in the cloud, or “the edge”. It’s about data sharing, not data gravity, and the market has shown that BYOC is a legitimate third pillar of the everywhere story.

With this in mind, Confluent just announced its acquisition of Warpstream, a Kafka-compatible service that uses cloud object storage as its storage layer and offers BYOC deployments to its customers. Confluent has not only acquired the team but has committed to Warpstream continuing its product development. Confluent will offer Warpstream as a BYOC deployment model to Confluent customers.

The question about this BYOC strategy is: how can BYOC’s operational challenges be addressed so that it can form a profitable and reliable third pillar of the everywhere story?

Making BYOC operationally viable

Before we look at making a cloud data service operationally viable as a fleet of BYOC deployments, I’d like to use AWS RDS as a case study. 

First…a BYOC case study of AWS RDS

AWS chose the large-scale disaggregated MT architecture for S3 and DynamoDB, but they originally chose the single-tenant architecture for the Relational Database Service (RDS). RDS started out single-tenant and deployable in your VPC - which perhaps makes it the first BYOC service in the cloud. Why did AWS choose to offer RDS as a single-tenant VPC deployable service but didn’t do the same for DynamoDB or S3?

Before we can answer these questions, we should consider what the cloud is and the privileged position of the cloud service provider.

For the CSP, the cloud is everything from the data center buildings, the physical hardware, the operating system software, their proprietary cloud software and hardware which operates the various services and enforces security. The rest of us just see the tip of the iceberg: this giant API with a logical model of virtual networks, virtual hardware, logical security constructs and higher level building block services. As Amazon puts it in their Shared Responsibility Model, AWS is responsible for the security “of” the cloud whereas the customer is responsible for security “in” the cloud.

When you deploy to the cloud, you are “in” the cloud or more specifically the cloud frontend which consists of the logical model where all interactions are with virtual endpoints, secured with a logical security model. The closest you get to seeing real hardware is when you SSH into an EC2 instance but even then you are in a virtualized (usually multi-tenant) environment. AWS on the other hand, has full access and control of the entire backend, from the data center facilities to the software and hardware, which brings the frontend logical model to life.

S3 came before RDS and was never offered as a single-tenant solution. It was cloud-native from the start, embracing this giant API vision of the cloud. Customers didn’t need to think about servers, capacity planning, OS patching management or any of the other myriad things that organizations have to do in on-premise environments. The service name says it all: Simple Storage Service (S3). But it wasn’t just simplicity, it was the ability to scale, designed for 11 nines of durability, offer 99.99% uptime SLAs and all at a lower price than any other storage-related service.

Let’s examine why S3 went the multi-tenant route and RDS the single-tenant route.

RDS looks like BYOC:

  1. It appears to be deployed in your VPC.

  2. It deploys on VM instances that have names like your regular EC2 instances but are prepended with db, such as db.t3.micro, which comes with 2 vCPUs, 1GB RAM and EBS storage.

  3. You can control access to it via security groups.

  4. You pay for the instances and EBS volumes.

However, when you look more closely, you see that while an RDS database has network interfaces in your VPC, you won’t see the VM instances or EBS volumes in your EC2 dashboard. You can’t even SSH onto the instances. This is where the privileged position of the CSP comes in handy. The CSP controls the frontend (the API) and the backend (which you and I never get to see). What is a VPC? It is a virtual construct and Amazon is free to deploy software as it wishes and can expose that to the rest of us in a way that suits them.

Why can’t customers access RDS instances, or see their instances and EBS volumes in EC2? Because Amazon doesn’t want you touching any of it. The control they give you is in the form of the VPC, security group and IAM logical constructs as well as the database wire protocol itself. The rest is hidden from view because operational efficiency and reliability demand it.

The CSPs are known as hyperscalers for a reason: the number of customers, the number servers, the amount of traffic and bytes stored are all massive. Imagine, as a hyperscaler, if you deployed each RDS instance with open access to customers, who could configure the database as they wished, deploy their network monitoring software on the VMs, perform OS patches themselves, and any other whim. Could you scale that deployment model, while still offering the same managed service features (such as backups, snapshots, OS and database patching etc) at the same level of reliability at the same price? Of course not, the support staff on hand that would be required to diagnose each snowflake deployment just makes this a non-starter.

That is why AWS offers RDS as a locked-down service that is not open to customers to look inside and with limited operational control. Customers can defer certain maintenance operations, they can decide on the VM size but they can’t choose any custom configuration of the OS and most definitely cannot deploy their own software on the instances.

The question remains, why offer RDS as a single-tenant VPC-deployable system with highly restricted access at all? It likely comes down to multiple reasons:

  1. Customers have a lot of legacy applications they want to run in the cloud using the same database software as before - they just don’t want the hassle of managing these databases (backups, patching etc). Additionally, relational databases are still popular for building modern microservices.

  2. Building a massive scale, multi-tenant relational database system is hard and has taken many years of research and development. None of this was ready yet when RDS was born.

  3. CSPs want to attract customers and lift-and-shift is one way of migrating customers from on-prem to the cloud. The relational database was key to this.

  4. Customers are typically willing to spend significant amounts of money on their databases. The loss of operational efficiency can be offset by the higher costs that customers are willing to pay.

All the above reasons made RDS a no-brainer for AWS. But were they going to architect their other services the same way? The answer is no.

This is all a very long way of saying that AWS had a specific reason to build a BYOC data service, and they made it work reliably and profitably by locking down the environment and using their privileged position to make it appear as another service in a customer’s VPC. However, this is not an available option for ISVs. The data infra vendors play in the cloud frontend just like everyone else, and so if we’re going to make BYOC work, we need a different strategy.

BYOC data services, the sustainable way

We can’t lock down BYOC deployments like AWS but there are still ways of achieving some of the benefits of large-scale SaaS while also limiting the operational challenges of BYOC. This is where the Warpstream architecture comes in - gone is the shared-nothing storage architecture and in its place is a set of stateless agents over shared storage.

Almost all traditional distributed data systems use the shared-nothing storage architecture. This architecture consists of data being sharded across a set of servers where each shard is owned by a subset of those servers, usually in a leader-follower replication architecture. Examples are Apache Kafka, CockroachDB, ClickHouse and many more. This approach is efficient under normal conditions, as leader-based systems get ordering of operations cheaply and benefit from data locality. However, there are some serious drawbacks too, the most prominent being heat management and auto-scaling.

Heat management is the practice of coping with hot spots from load and data skews by dynamically moving data between nodes (among other techniques). In Apache Kafka, this is known as rebalancing. Auto-scaling stateful systems is also harder than stateless systems. Some data systems can scale out better than others, but scaling-in is almost always challenging as data movement is required.

If you did a survey of distributed data systems that incorporate storage, then the vast majority will be based on a shared-nothing architecture. If you look at my serverless data system series, of the five systems in that survey, four are use a shared-nothing architecture, including cloud-native services such as Amazon DynamoDB. Building a storage service is complicated and requires a high level of operational excellence to do well.

In a BYOC environment, complexity is perhaps the biggest enemy to offering a profitable and reliable service. The more complex a system is, the more failure modes it has and the more difficult it is to support. Stateful systems that must dynamically cope with changing load, handling load skews etc are the most operationally demanding systems to run. Put another way, they are perhaps the last kind of system you want to run as a fleet of BYOC deployments across hundreds of customer accounts, where you don’t have a pristine, controlled environment.

Warpstream took a different approach to shared-nothing, using stateless, leaderless, agents that operate over shared object storage. Warpstream is a kind of protocol translation layer, that presents the Kafka API to the outside world, but speaks object storage on the backend. Offloading the complex storage piece alone reduces the complexity of the service by an order of magnitude. There is still a stateful distributed system in the architecture, but it is run by the CSP as a massive-scale multi-tenant object storage service. The CSP does the hard work of replication, heat management, and scaling, but we still reap the benefits of a large-scale MT system - greater stability under various load conditions, better elasticity, and lower costs. 

A Warpstream deployment is a set of stateless compute nodes that operate a much simpler protocol of writing events in batches to shared object storage and adding those objects to a log via a metadata service. The metadata service is essentially a sequencer and an addressing scheme that maps Kafka topic offsets to location in object storage. The metadata service, the most complex piece, is hosted in Warpstream’s pristine cloud account, leaving only the stateless agents running in the customer’s cloud account.

It’s not a silver bullet, but this lightweight stateless agent architecture avoids the main pitfalls of BYOC that I explained at the top:

  1. Obtain the structural benefits of large-scale MT: While the stateless agent layer is single-tenant, the storage layer is a massive-scale multi-tenant storage system. Scaling stateful services is relatively slow and costly, but stateless services auto-scale relatively simply and with lower cost - the comparison isn’t even close. Stateless services still cannot instantly absorb load peaks as a large-scale MT can, but they can auto-scale far quicker than a single-tenant stateful system.

  2. Lower cost of support: The stateless agent layer is far simpler than a stateful one. We’re essentially running some proxies in the customer account, but storage and metadata are run in pristine, highly controlled environments. The number of failure modes for the pieces running in customer accounts is far lower, and the support burden is likewise far smaller.

The kicker is the higher latency from using object storage, but as we often see in computer architecture, when we relax constraints, we get large degrees of freedom to implement simpler solutions. The higher latency will be a blocker for some, but I think this is the compromise to be made to make BYOC work at scale. In time, object storage will improve, and we’ll hopefully see the cost of lower latency tiers drop or the latency of the low-cost tiers improve. How SSD technology advances in terms of cost per GB is likely a core component of such a trend.

The Warpstream architecture is by no means limited to BYOC, but it uniquely makes BYOC work due to its lightweight, stateless agents operating inside the customer VPC; a BYOC-native design if you will. It’s not the universal streaming architecture; it makes the trade-off of higher latency in return for lower cost and simpler operations. For the lower latency workloads, we rely on the low-latency durable caching architecture of Kora (or Confluent Platform if you’re self-hosted), which lands data on SSDs using replication but offloads data to an object store for long-term storage. It’s all part of the everywhere and complete strategy Confluent is consistently following. 

To wrap up…

I’m a believer in serverless data services built as disaggregated large-scale multi-tenant systems. People make the mistake of thinking that because building a stateful disaggregated cloud service is hard and takes a high degree of operation excellence, it’s a liability. Quite the opposite! Yes, it’s hard to do, it takes a large team and years of work, but that is a competitive moat in itself. Object storage is simply not ready as the universal storage layer for all workloads; most transactional workloads must still be served by servers writing to SSDs that operate a replication protocol. Object storage can play a role, but it can’t serve as the only durable storage medium for all workloads. This is always subject to change, and every data infra vendor needs a flexible architecture that can take advantage of the changing cloud primitives and pricing models.

The momentum behind serverless and multitenancy is clear. I believe that the majority of workloads will live there if they don’t already. Just look at how Databricks and AWS are steadily building out the serverless versions of their single-tenant products. Serverless is where the masses are, but on-premise and BYOC are also important pillars for any multi-cloud, multi-environment anywhere service… if BYOC can be done right at scale.