What is scaling in large-scale multi-tenant data systems, and how does that compare to single-tenant data systems? How does per-tenant scaling relate to system-wide scaling? How do scale-to-zero and cold starts come into play? Answering these questions is chapter 6 of The Architecture of Serverless Data Systems.

Rather than looking at a new system, this chapter focuses on some patterns that emerge after analyzing the first five systems. So far, I’ve included storage-oriented systems (Amazon DynamoDB, Confluent Cloud’s Kora engine), OLTP databases (Neon, CockroachDB serverless), and an OLAP database (ClickHouse Cloud). The workloads across these systems diverge a lot, so their architectural decisions also diverge, but some patterns emerge.

Multi-tenancy refers to co-locating workloads of multiple tenants on shared resource pools which leads to a better economic and scalability model. The larger the resource pools, the better it gets for both resource utilization and efficient scaling.

There are multiple resource-sharing models available, and some systems combine multiple sharing models across their components. The serverless multi-tenant data systems surveyed so far all separate monoliths into multiple layers such as proxies, compute and storage, where each layer can use different sharing models. These sharing models can be broadly classified into:

Multi-tenant processes:

Shared processes. The same software process serves multiple tenants. Data and security isolation is logical.
V8 isolates. Multiple tenants can share the same V8 process but in separate lightweight contexts. See Cloudflare’s blog about how they use V8 isolates for workers.

Bin-packing single-tenant processes:

Containers. Running single-tenant containers and packing multiple containers per host. Typically this is via Kubernetes, where any given K8s node hosts the pods of numerous tenants.
Virtualization. Running single-tenant nodes in VMs (such as QEMU) or microVMs (such as Firecracker), packing multiple VMs per host. Kubernetes can even be used in conjunction with VMs via Kata containers or KubeVirt.

I will refer to these approaches broadly as “shared-processes” vs “bin-packing single-tenant processes”.

The following patterns have emerged from the first five chapters of The Architecture of (Multi-tenant) Serverless Data Systems:

Proxy layers tend to use the shared-process model.
Compute layers can use any sharing model. DynamoDB uses the shared-process model, whereas Neon, CockroachDB and ClickHouse Cloud use bin-packed single-tenant containers or VMs hosted on Kubernetes.
Storage layers use the shared-process sharing model.

What are the reasons for choosing a shared-process model or the bin-packing of single-tenant processes? How does this choice affect scaling?

Please note: I’ll be referring to topics already covered in the opening chapter and chapters 1-5. So if you missed those, I recommend going back and reading them. Secondly, the topic of scheduling is both fascinating and deep, this is not a treatise on scheduling in general (that is a large topic). This post is more concerned with identifying common patterns among the concrete systems included so far in this series on serverless data systems.

Let’s start with compute layers.

Scaling compute

Neon, CockroachDB Serverless, and ClickHouse Cloud chose single-tenant bin-packed pods or VMs on Kubernetes, whereas DynamoDB chose the shared-process model. There were several driving forces behind these decisions.

The first driving force is SQL. Neon, CockroachDB, and ClickHouse are SQL systems where the amount of work required to execute a query can vary dramatically from one query to another. Multi-tenant systems aim to provide a service to a customer that appears single-tenant, but do so far more efficiently than a single-tenant system. This requires good isolation between workloads, but SQL makes isolation challenging as it is hard to predict how many resources any given query will require. Placing the SQL query processing portions of the system in pods or VMs controlled by Kubernetes allows the service provider to both 1) tightly control the amount of compute and memory utilized by any one tenant and 2) allow for the rapid scaling up/out of the pods/VMs to account for load changes.

For example, both CockroachDB and ClickHouse Cloud can dynamically scale the number of compute pods according to load (within preconfigured bounds to avoid out-of-control costs). Adequately sizing the allocated compute resources for the workload in real time gains high resource utilization.

DynamoDB uses the shared-process model. It is not an SQL-based database; it is a KV store with far more predictable resource demands per request. Because the amount of work required to satisfy a request is relatively narrow and predictable, the shared-process model can be utilized in the compute layer without bespoke and complex admission control logic. Simpler rate limiting (via token buckets) and effective load balancing are enough.

Scaling bin-packed single-tenant processes in the compute layer

Scaling with this model relies on adequately provisioning enough single-tenant processes to handle the load and do so responsively as load changes. This can include scaling to zero after a period without requests. The more closely the scheduler can match compute resources to the workload in real-time, the better the resource utilization.

Fig 1. Scaling single-tenant bin-packed pods to match tenant load changes.

There are a number of aspects to such scheduling logic:

Matching pods to per-tenant load.

As load of a given tenant grows, the scheduler decides to add another tenant pod to take on a share of the load.
As load of a given tenant shrinks, the scheduler decides to drain a pod by not routing any new requests to it until it has no running requests on it and it can be evicted. The decision to drain could be made based on average utilization, then selecting the least loaded pod. A more disruptive and less sophisticated version of this is simply to evict a pod (without drain phase) based on some utilization measure. A less disruptive and less sophisticated version is to wait for a pod to become idle before evicting it. However, this may make scale down not occur if load is evenly spread across a number of pods, such as all pods have low utilization. The specifics of the scheduling algorithm are highly dependent on the type of workload and service-level the customer expects.
Oversubscribing tenants may not be needed if the scaling is responsive enough and each pod has good utilization of its allocated slice of resources. However, if it hard to achieve high per-pod utilization, oversubscribing pods may be the only way of reaching high utilization of the hosts. ClickHouse Cloud noted that they deliberately avoided oversubscribing tenants and still achieved ~70% CPU utilization.
A number of empty pods, per host, can be kept ready when a tenant needs a new pod to be allocated to it. The scheduling of a pod itself can be the most time consuming aspect of bringing up a new compute node, so keeping a pre-warmed pool of pods ready reduces the time to get a new tenant pod serving requests.
The pods of a tenant can be scaled-to-zero when no requests are received for that tenant. More on scaling to zero below.

Matching hosts to aggregate load.

Using a most-allocated strategy when scheduling pods on hosts. The idea is by preferring the most allocated host, as pods get evicted due to tenant load decreases, it allows a host to naturally become idle, allowing it to be shutdown without affecting live requests.
Waiting for a host to become idle may take too long. The CockroachDB proxy layer is able to migrate pods across hosts, without breaking TCP connections. This would allow a host which has reached a minimum load threshold to have its pods migrated and then shutdown.
One or more hosts can be kept in reserve in case aggregate load exceeds the spare capacity of the current pool of hosts.

Scaling a tenant to zero is good for lowering costs but also introduces the cold-start problem. When a new request comes in, a pod (or a tenant VM) must be provisioned in the hot path, which inevitably leads to extra latency. The pre-warmed pools of empty pods/VMs mentioned above mitigate these cold starts by reducing the time required to allocate a pod to the scaled-to-zero tenant. CockroachDB refers to this process as stamping an empty pod with a tenant. Likewise, some spare capacity of underlying hosts will also be necessary so that blocking on starting a host (which is very slow) does not occur. The pre-warmed pool reduces the size of the latency spike but adds to costs. The larger these pools, the higher the costs but the lower the chance of a bad latency spike occurring.

Cold starts can be avoided to a large extent by scaling a tenant to one rather than zero. Workloads of an intermittent nature may prefer scaling to one over scaling to zero to avoid unpredictable latency spikes. Much depends on the speed of provisioning a new compute process, the impact of a cold cache on performance and the customer’s tolerance regarding latency. This becomes a product decision.

Fig 2. CockroachDB auto-scaling with a pre-warmed pod pool. Tile 1: when scaling in, pods are first drained to avoid disruption. Tile 2: scaling out adds a new pod for the tenant (by stmaping a pod from the pre-warmed pool). Tile 3: a tenant is scaled to zero after a period without requests. Tile 4: an inactive tenant is resumed by stamping a pod in the pre-warmed pool. The CockroachDB proxy layer can also move pods between hosts without terminating TCP connections.

Scaling shared-processes in the compute layer

With this model, the resource pool as a whole is sized according to aggregate load. As aggregate load increases, the pool may grow, and when aggregate load decreases, the pool may shrink. For completely stateless compute, where any node in the pool can process the requests of any tenant, there is no scale-to-zero, only aggregate scaling, good load distribution, and rate-limiting.

In such a system, the larger the resource pool, the better, as it allows for the instantaneous absorption of per-tenant load spikes. Imagine a system with 5 small servers that is designed to operate at 70% CPU and memory utilization. Such a system may have little buffer capacity to absorb a large load spike from one or two tenants. Relying on aggregate load scaling (based on provisioning more hardware) is slow which would result in large latency spikes. Now imagine a system with 1000 servers at the same target resource utilization; 30% of 1000 servers is a lot and can absorb even large per-tenant load spikes instantly. This leaves the slower-to-respond aggregate level scaling time to ensure this buffer capacity remains adequate (like a PID controller trying to reach its target of 70% utilization). This is a simplistic model but can be true in broad strokes. All things being equal, if possible, this is the best place to be in terms of scaling a system.

More challenging compute workloads exist, such as long running tasks with variable load over time, which can make load balancing more difficult. One example would be Neon, with its long running Postgres instances'; or long running Apache Flink jobs which run indefinitely. More sophisticated techniques are necessary for these cases, such as being able to seamlessly move these tasks between compute nodes (like Neon does) or mixing long running and short duration tasks on the same hosts.

Scaling proxies

Proxies are similar to compute in my model except for two key differences:

Typically there is low variance of work required to serve each request, it is not a compute intensive layer.
In scale-to-zero compute layers, the proxy layer is responsible for triggering a pre-warmed pod/VM to be allocated to a scaled-to-zero tenant in order to satisfy a new request. This means there is no scale-to-zero for proxy layers.

For these reasons, proxies tend to use the shared-process model which scales according to aggregate load.

Scaling storage

All the systems surveyed so far, except for ClickHouse Cloud, have a storage layer. The reason for this is the predictable low-latency demands of these systems:

DynamoDB places predictable single-digit latency as a main driving factor.
Kora (Confluent Cloud) uses its storage layer as a replicated, fault-tolerant, low-latency cache in front of object storage.
Neon uses storage as a replicated, fault-tolerant, low-latency write cache and for longer-term read-optimized storage (and also tiers to object storage).
CockroachDB places all data in a replicated storage layer with no object storage tiering because of its emphasis on predictable latency.

Note: S3 Express One Zone has arrived and does offer compelling latency, though at a much higher cost. The calculations made by my colleagues at Confluent demonstrate that replication over disks still offers the best price/performance. However, S3 standard does make sense when optimizing for costs above all else, given non-trivial throughputs. Low throughput workloads will not see much benefit.

Scaling storage in terms of bytes on disk depends a lot on data and whether storage tiering is used. As the amount of data grows, DynamoDB and CockroachDB (which don’t tier) continue splitting shards and distributing those shards over more and more storage servers. Kora, which does tier, keeps only the front of each partition log on disk and there is no splitting of partitions necessary to cope with data size.

All the above systems use the shared-process model for their storage layer. The reason to choose a shared-process model is that the data must be replicated and stored either indefinitely (DynamoDB/CockroachDB) or for potentially a number of minutes to hours (Kora). Whereas compute can scale in and out quickly on per-tenant pods, the same is not true for stateful systems which would require more costly data movement.

If the storage layer used per-tenant bin-packed pods, then scaling would need to be performed based on those pods, keeping at least three around with appropriately sized storage volumes to cope with future data growth. Efficiencies in replication systems that can aggregate common features such as failure detection and leader-follower keep-alives are not available. This is inefficient compared to a shared-process model.

A large storage node can host multiple tenants with advantages such as:

Reduce network traffic (and CPU cycles) by consolidating common communication needs such as failure detection and keep-alives.
Use shared disks as a common resource pool for the data of those tenants, so we get the same resource efficiency from the pooling of resources.
Co-located tenants with uncorrelated (or low correlated) resource access patterns can allow storage nodes to reach higher resource utilization.

The cost of adding one tenant to such a storage layer is relatively small or even tiny. The key is good heat management and load control.

One important caveat to this disk pooling point is storage tiering. Kora, for example, keeps only the front (the suffix) of the partition log on local disks and tiers the rest to object storage quite aggressively. Keeping a small amount of data on disk until it reaches the safety of object storage simplifies many things and lowers costs. Smaller disks are required, and heat management only applies to the log suffix, which is tiny compared to the total size of the log.

Scaling to zero in storage layers

Scaling to zero in storage layers is an interesting topic. For a system such as CockroachDB, which currently has no storage tiering, there is no way of doing it beyond performing a backup and turning off the system. For storage systems that tier to object storage, such as Kora, a tenant could be fully offloaded to object storage and all local state of the tenant cleared from any storage nodes. When a request would come in for such a system, the proxy layer would need to trigger the provisioning of a virtual cluster from object storage which would likely be a costly operation in terms of latency.

The question is whether scaling a tenant to zero is worth it in a shared-process multi-tenant storage system that tiers to object storage. Given that a shared-process system is scaled according to aggregate load, the cost of a single idle tenant is zero in terms of CPU and memory (if properly designed for that). The disk storage cost is also small as only a small log suffix remains on disk. Therefore the upside is relatively small, but the downside is relatively large - that of the cold-start problem.

Per-tenant scaling in multi-tenant shared-process systems = rate limits + consumption based billing

A multi-tenant system based on shared-processes (such as DynamoDB or Kora) can scale according to aggregate load by shaping traffic, performing heat management in the storage layer, rate limiting tenants, and scaling the size of the proxy, compute and storage pools as aggregate load changes.

Given the dynamic, self-healing, self-balancing nature of this type of architecture, per-tenant scaling becomes a matter of setting appropriate rate limits while ensuring the system as a whole is adequately sized for aggregate load. In a sense, for a given tenant there may not actually be any scaling at all. What we are really talking about here is:

Rate limiting with either a fixed rate or a dynamic rate with some fixed range. Fixed rate may not be totally fixed, as bursts should be accommodated.
Recording of usage for billing.

Fig 3. There is no tenant auto-scaling in this picture, only consumption based pricing with rate limits.

DynamoDB offers provisioned tables which correspond to fixed rate limits, and on-demand tables which correspond to dynamic rates. The aim of on-demand tables is to adjust these dynamic rates before throttling occurs.

With Kora, per-tenant scaling depends on the type of cluster. Basic and Standard clusters simply have a fixed upper rate limit on throughput. The customer simply pays for what they use of that capacity - if they send no traffic then billing reflects that. As long as the customer doesn’t need more than 250 MB/s ingress and 750 MB/s egress (which covers the vast majority of customers), then the scaling of a single tenant doesn’t even exist - it’s simply consumption based pricing.

Conclusions

The nature of scaling is highly dependent on whether the particular component is stateless or stateful, as well as its sharing model (which in turn is dependent on the workload). Sometimes, per-tenant scaling is really just an illusion created by rate limits and consumption-based billing.

Scaling stateless shared-process resource pool components (with request/response workloads) is based on aggregate load and is suited to workloads with a low variance of work done per request. The larger the resource pool, the greater the system's ability to instantly absorb per-tenant load spikes while maintaining high resource utilization. The slower aggregate load scaling adjusts capacity to maintain its ideal resource utilization. The general concept of scaling a tenant to zero does not apply to such a system, as a tenant that does not place load on the system causes no work to be done (beyond things such as keeping some metadata in memory).
Scaling stateless bin-packed single-tenant pods may be better suited for unpredictable work such as SQL processing. High resource utilization is obtained by the responsive scaling of per-tenant pods to match the workload demands in real-time. Scaling to zero is possible, and the proxy layer will trigger a new pod to be allocated when a request arrives for a scaled-to-zero tenant. This introduces extra latency and is known as the cold-start problem.
The shared-process resource pool model is the most efficient for storage layers. The same insights from (1) regarding resource pool size apply here but with the additional need to implement responsive heat management and load control. Storage tiering helps here a lot. Scaling a tenant to zero doesn’t really have much upside for shared-process storage layers with tiering, but does introduce the cold start problem. Replicated storage systems should be implemented such that idle tenants cause no additional network traffic or CPU cycles to occur.

How does all this compare to a single-tenant distributed data system? A single-tenant system cannot instantly absorb a load spike unless it is already provisioned adequately for that spike. In this model, you scale for peak even if your average is far lower, which results in poor resource utilization. Growing and shrinking the number of nodes (and underlying servers) is comparatively slow, which only makes it suitable for slow moving load changes. Performing this growing and shrinking to track the workload (in order to obtain better resource utilization) can be disruptive to clients, causing additional high latency spikes. Compare this to a large multi-tenant system where the load of one tenant is small compared to the capacity of the system and individual nodes in that system. The highs and lows of the tenant can be accommodated without physical scaling at all, leading to less latency issues while also achieving high resource utilization.

This is why I am bullish on multi-tenant serverless data systems compared to single-tenant dedicated, BYOC, and on-premise deployments. I’m not saying they don’t have a place, but I think multi-tenant serverless will be the predominant pattern of the (near) future.