Jump to the bottom to see the links to the chapters.

I recently blogged about why I believe the future of cloud data services is large-scale and multi-tenant, citing, among others, S3.

“Top tier SaaS services like S3 are able to deliver amazing simplicity, reliability, durability, scalability, and low price because their technologies are structurally oriented to deliver those things. Serving customers over large resource pools provides unparalleled efficiency and reliability at scale.”
So said myself in that post.

To further explore this topic, I am surveying real-world serverless, multi-tenant data architectures to understand how different types of systems, such as OLTP databases, real-time OLAP, cloud data warehouses, event streaming systems, and more, implement serverless MT. It’s inspired by the book series The Architecture of Open Source Applications that I read over a decade ago. What I loved about those books, when reading them still relatively early in my career, was seeing how other people were building software. My aim for this analysis is the same but applied to cloud-based data systems that implement multi-tenancy and a serverless model.

From my research, some patterns immediately jump out, such as disaggregated architectures with separated storage and compute. But there are many different workloads, and what makes sense for compute-intensive BigQuery does not necessarily make sense for a storage API such as DynamoDB or Kafka, or an elastic OLTP Postgres workload. There is a surprising amount of diversity among the systems I have surveyed which makes it a fascinating subject.

Fig 1. from Dremel, A Decade of Interactive SQL Analysis at Web Scale.

It is also clear that serverless MT systems are growing in number. Many are now well established, primarily systems such as Google’s BigQuery and Spanner; Amazon’s DynamoDB and Aurora; and Azure’s CosmosDB, among others. The CSP offerings are probably the most mature, followed by public tech companies such as Snowflake (a pioneer among the non-CSPs), Cloudflare (R2), MongoDB (MongoDB Atlas), and Confluent with its Kora engine that powers Confluent Cloud. Then there are a number of start-ups bringing serverless data systems to market, such as Neon (serverless Postgres), CockroachDB (serverless CRDB), and ClickHouse Cloud (serverless ClickHouse), among many others.

My analysis is based on a mix of academic/engineering papers, public blog posts, access to engineers via Slack channels (e.g., ClickHouse Cloud), as well as direct communication with engineers building these systems, in the case of Neon and CockroachDB (thanks!). In the case of Kora, I work at Confluent.

At the beginning of this analysis, I started with many questions.

What challenges do these systems face that are unique to their workload?
What common challenges do all these systems face? Do they all employ the same solutions?
Are these systems built from scratch to be cloud-native or can traditionally single-tenant software be modified to work in an elastic cloud-native fashion?
Why did these serverless MT systems get built at all? Do the motivations align with my conclusions about single-tenant systems vs large-scale multi-tenant systems?

Over the course of this analysis series, I’ll try to answer those questions.

In this post, I’ll cover what serverless MT is, as well as the generic challenges involved in building these systems. At the end of the post is the list of deep dives into specific systems.

Defining “serverless multi-tenant system”

Defining “multi-tenancy”

Multi-tenancy is ultimately about resource-sharing by co-locating workloads on shared hardware. For those of us who operate “in” the cloud, this means that we build our systems such that multiple tenants are served from shared compute instances (like Amazon EC2 or Google Compute) or shared PaaS services like cloud object storage. For the CSP, it can go deeper, as some services are actually built on physical drives and servers while others are built on top of the same abstractions as CSP customers.

For this analysis, I’ll define multi-tenancy as “Servicing multiple tenants from shared resources such as (virtualized) servers, drives, and even PaaS building block services such as object storage and queues”.

There are multiple resource-sharing models available and some systems combine multiple sharing models across their components. These sharing models include:

Shared processes. The same software process serves multiple tenants. Data and security isolation is logical.
Containers. Running single-tenant nodes and packing multiple containers per host. Typically this is via Kubernetes, where any given K8s node hosts the pods of numerous tenants.
Virtualization. Running single-tenant nodes in VMs (such as QEMU) or microVMs (such as Firecracker), packing multiple VMs per host. Kubernetes can even be used in conjunction with VMs via Kata containers or KubeVirt.

There is also V8 isolates where tenants can share the same V8 process but in separate lightweight contexts, though I haven’t yet seen this in data systems.

Defining ”serverless”

Customers do not select server types or select hardware. Instead, these systems depend on a certain amount of elasticity and mobility to ensure that the demand of any workload is handled without the customer needing to size hardware explicitly. Elasticity refers to the ability of the service to scale up/out and down/in according to the workload's needs. Mobility refers to the ability of the service to move and balance the workload internally to satisfy performance and reliability requirements.

The serverless model uses consumption-based pricing which is becoming increasingly important to customers. Many customers don’t want to commit to big contracts up front and prefer to simply be billed for what they use (with possibly some commit coming later on to obtain greater discounts). There are many variants of consumption-based pricing which depend a lot on the workload and underlying system implementation:

Paying per (million) operation(s).
Paying for the CPU and memory consumption of the workload.
Paying per GB of storage.
Paying for virtual units of performance/capacity that correlate to resource and operation rates (RCU/WCUs of DynamoDB, for example).
Hybrid models where the customer pays for some baseline capacity and pays for consumption above that (known as ”Own the base, pay for peak”).

The common challenges

Working within the constraints imposed by the workload

There are many constraints imposed by the workload of the given data system that are significant drivers of the underlying architecture:

Latency/Availability requirements.
Consistency requirements.
Correlation/dependencies between requests and data.
Sequential vs random access patterns.
Variability of work done per request.
Data size.
Session vs request-oriented protocols and push vs pull mechanics.
Compute intensity of the work.

Looser latency and consistency requirements give engineers more degrees of freedom. Leveraging the low-cost and high-durability benefits of cloud object storage is a great example of this, as low-latency systems are constrained in how they introduce high-latency components. Eventually consistent systems can avoid this dilemma by writing data asynchronously to object storage instead of including it in the synchronous data hot path. Low-latency, strongly consistent systems get no such get-out-of-jail-free card.

When combined with other constraints (such as low latency), spatial and temporal locality of workloads can drive architectural choices. For example, a workload characterized by sequential scans would benefit from keeping contiguous data ranges together for fast, efficient scanning on disk. While the subdivision of these ranges into smaller sub-ranges aids in hotspot management, the two consititute competing concerns and a balance between the two must be found. More random patterns of access with little correlation between individual requests can use the benefits of a flat address space that can be evenly and thinly spread over a fleet of servers.

Session-oriented protocols, which establish persistent connections, are typically more difficult compared to request-oriented protocols where each request is independent of the last. Persistent connections may require connection pooling, and perturbations such as rolling nodes and data balancing can result in an externally visible impact on clients.

Some systems are just storage APIs, such as object storage and the Kafka API, while others are compute-intensive such as SQL databases. This leads to the topic of the predictability and variability of the amount of work required to service each request. On one extreme, there is a data streaming API such as Kafka, which must simply retrieve a contiguous block of records. On the other end of the spectrum is SQL, which can lead to huge differences in work between one query and another.

As Marc Brooker put it:

“The declarative nature of SQL is a major strength, but also a common source of operational problems. This is because SQL obscures one of the most important practical questions about running a program: how much work are we asking the computer to do?”

In turn, one of my colleagues, Mahesh Balakrishnan, remarked:

“To misuse some terminology from math, SQL is an “ill-conditioned” API: small changes in input can trigger very different amounts of work. The opposite would be block storage, which is “well-conditioned”. Another famous example of an ill-conditioned abstraction is IP Multicast.”

Tenant isolation

While resource sharing is great for hardware utilization, it can introduce resource contention between tenants where the demands of one tenant's workload interfere with another. Another key driver is that of security isolation.

Just as the Serializable transaction isolation guarantee ensures that concurrent transactions are executed such that they appear to be serial, multi-tenant systems need to ensure that concurrent tenants that are served from shared hardware resources appear to be served from their own dedicated services.

Separation of storage and compute

The separation of storage and compute is a core design principle that all the systems I have surveyed so far have implemented to some degree. This separation seems to be fundamental to MT designs for several reasons that I’ll explore in this analysis.

Hardware trends are making this architecture more and more viable. Hardware continues its advance forward in performance, but as Roland Dreier recently blogged: “Everything got faster, but the relative ratios also completely flipped.”. He refers to the advances in network throughput compared to memory and storage drive throughput.

Fig 2. The advances in throughput across SSDs, memory, and networking.

The separation of storage and compute is becoming increasingly a reality, partly because the network no longer presents the bottleneck it used to.

However, while network throughput is increasing, new challenges still present themselves with this separation of concerns - with cloud object storage taking first place.

Cloud object storage is still relatively high latency, and while durable and cheap, it can be hard to introduce to workloads that are typically low latency, such as OLTP databases. The economic model of cloud object storage also punishes the designs that rely on many tiny objects, further complicating the life of the low-latency systems as they must accumulate data into larger objects with fewer requests.

Engineers can choose to include object storage in their low-latency system but counter the latency issues of object storage by placing a durable, fault-tolerant write-cache and predictive read-cache that sits in front of the slower object storage. This durable write-cache is essentially a cluster of servers that implement a replication protocol and write data to block storage. In the background, the cluster uploads data asynchronously to object storage obeying the economic pattern of writing fewer, larger files.

Low latency writes are served well by the fault-tolerant write cache; it is the read cache that can present the challenge in this architecture. Sequential workloads such as event streaming are straightforward and extremely effective; as long as aggregate prefetching keeps up with demand, reads should always hit the local read cache. Databases have a harder time of it due to the more random access pattern which is harder to predict, though table scans still benefit from readahead operations.

Fig 3. Choices for integrating (or not) cheap, durable cloud object storage.

However, implementing a distributed, fault-tolerant write cache with a replication protocol is non-trivial and can incur other costs, such as cross-AZ charges in multi-AZ environments. But today, there is simply no alternative for low-latency systems that want cheap, durable object storage as the primary data store.

Other low-latency systems must eschew the use of cloud object storage altogether, favoring predictable low latency above all else. Cloud storage, while prevalent, is not universal due to the latency trade-off involved.

Heat management

Heat management refers to balancing load as evenly as possible over a fleet of storage nodes to avoid hotspots that can cause externally visible performance issues such as latency spikes or drop in operations per second. We could also refer to this as load balancing, but we often use the term load balancing in terms of load balancers over stateless nodes. In a stateful system, hot spots can develop where a particular storage node can experience contention due to an unfortunate grouping of high-demand objects. Whereas load balancers can spread out load evenly across a set of stateless nodes with simple strategies such as random, least connections, or some FIFO variant, stateful systems must route requests to nodes based on where the data resides.

Moving data to redistribute load is often referred to as rebalancing. To further complicate things, load distribution can change over time. Data distribution becomes a dynamic process that must handle everything from short-lived peaks affecting a small subset of data to more significant load changes caused by some diurnal pattern or seasonal event that manifests across multiple tenants.

Large datasets such as big databases or high throughput event streams must be sharded in order to spread the load effectively over the fleet. Rebalancing becomes the rebalancing of shards, and the system may also be able to split and merge shards as the load distribution changes. However, there can exist competing concerns regarding shard counts and sizes, such as data locality. On the one hand, the more co-located the data is, the more efficient it is to retrieve. On the other hand, the cost of compute tasks having to fetch from too many shards can outweigh the benefits of spreading the load over more servers.

Heat management can also be necessary in single-tenant systems, so it isn’t a problem unique to multi-tenancy. However, good heat management becomes even more critical in an MT data system to prevent tenants from experiencing quality of service fluctuations.

Obtaining high resource utilization

One of the primary motivations for implementing a serverless multi-tenant architecture is to provide better economic performance by using underlying hardware resources more efficiently. High resource utilization through resource pooling is the name of the game, but doing so with solid tenant isolation and predictable performance is the challenge.

Cold starts

Serverless systems that scale resources to zero on a per-tenant basis can face the challenge of cold starts when a tenant resumes their workload. These cold starts have been a focus of serverless functions from the beginning, and they can also affect some serverless data systems.

Some systems do not suffer cold starts at all, while for others, cold starts are a kind of intractable, inescapable result of their architecture and scale-to-zero product offerings. In all cases I have seen, it is a product decision, and different plans and pricing may involve different levels of scaling down resources. Ultimately, customers and vendors can choose their trade-offs to suit their needs.

The surveyed systems

Group 1 - Storage APIs (compute-light)

Group 2 - SQL OLTP databases (compute-heavy)

Group 3 - SQL OLAP databases and data warehouses (compute-heavy)

Group 4 - let’s see what happens! I’m open to suggestions.

I imagine this will be an ongoing thing, but some kind of “conclusions” post (sometime in 2024) will be interesting, where we take stock of the various architectures, look at common patterns and approaches as well as the differences. Search for generalizations and lessons to be learned for others starting down the serverless multi-tenant data system path.

I have already learned a lot by seeing what a diverse set of other talented engineers has built, I hope you learn something too.

I’m open to contributions, so if you know of a good system to include or see errors then you can contact me at vanlightly@gmail.com. This will be a curated set, I’m not going to double up on similar systems - comparison and contrast is often where the insights can be found.

The Architecture of Serverless Data Systems