Serverless CockroachDB - ASDS Chapter 4 (part 2)

In part 1 we covered the basics of the CockroachDB (CRDB) single-tenant architecture and the high-level changes for building the multi-tenant serverless architecture. In parts 2 and 3, I’ll start focusing more narrowly on the tenant isolation and scaling mechanisms in multi-tenant serverless CRDB.

Chapter 4 parts:

Multi-tenancy

Large-scale multi-tenant systems increase resource utilization via resource pooling which drives down operating costs significantly. The key is to ensure that co-located workloads benefit from the economies of scale and the ability of the large system to absorb per-tenant load spikes, without introducing a lot of tenant interference that results in a poor quality of service. In this section, I’ll dive deeper into what Serverless CRDB does to isolate tenants from each other and scale according to workload demands.

Serverless CRDB uses a mix of resource-sharing models across its components:

  • Shared processes with logical isolation in the proxy and storage layer.

  • Container isolation in the compute intensive SQL layer.

No matter the resource-sharing model, CRDB achieves high resource utilization by over-subscribing the number of tenants (as do all these systems). It employs a number of techniques to oversubscribe tenants while providing strong tenant isolation. These can be categorized as the following:

  1. Rate limiting.

  2. Heat management (part 3).

  3. Auto-scaling (part 3).

Rate limiting (and Admission control)

In the SQL layer, tenants are isolated from one another by Kubernetes pod resource limits. No rate limiting needs to be applied at this layer, as the underlying cgroups act as the limiter based on the amount of CPU, memory, and network bandwidth. However, the storage layer is shared and so rate-limiting is required to isolate tenants from each other. 

CRDB has a sophisticated mechanism for rate-limiting called Admission Control (AC) which gates work in multiple layers. However, as stated above, explicit rate-limiting isn’t important in the SQL layer with the serverless CRDB model, but it remains very important in the shared storage layer. AC is complex, so I will try to simplify my coverage of it here, focusing on the shared storage, while still extracting the core insights of its design. 

The main priorities of admission control are:

  • Ensuring that the write load on the storage engine (Pebble) is at a sustainable level such the compaction and garbage collection processes can keep up.

  • Ensuring a fair allocation of CPU across requests and across tenants.

  • Avoid priority inversion issues where the locking sub-system interacts with different priority tasks such that lower-priority tasks end up blocking higher-priority tasks.

AC consists of a set of admission queues that work is submitted to, then dequeued according to priority and a balance of work across tenants. The KV-CPU admission queue rate limits based on slots that correlate to a multiple of the core count, whilst the remaining admission queues rate limit based on token buckets.

Admission queues

Admission queues are priority queues that either gate some specific work, like writing to the Pebble KV store, or prevent CPU utilization from getting so high that it turns into saturation.

The KV-CPU admission queue is a slot-based queue focused on limiting CPU usage. When a work item starts, it is assigned a free slot, and when it completes, the slot becomes free again. Work items only get dequeued when there are free slots available. The number of slots can change dynamically based on conditions but the number loosely correlates to a multiple of the core count based on a threshold.

Fig 1. Admission queues can be based on slots or tokens.

Each Pebble KV Store has its own admission queue, and items are dequeued based on a token bucket design.

Fig 2. Writes must pass through a token-based queue followed by a slot-based queue.

For a write to get executed, it must pass first through the KV Store queue and then the KV-CPU queue. It does it in this order to ensure that CPU slots do not get taken up, but then the task gets blocked on IO.

The KV-CPU admission queue buffers goroutines waiting to be executed. When a goroutine completes, another can be started. This makes the KV-CPU admission queue a good candidate for slots that loosely correspond to cores. Due to variable amounts of IO and blocking on locks, the number of slots must be flexible. 

However, slots make a bad proxy for LSM engine work because a single write does relatively little work synchronously and triggers a set of asynchronous activities after the fact. The synchronous portion of a write may have been completed, but the majority of the total work is yet to come. Additionally, the follow-on background work can be delayed to a certain extent, until foreground reads and writes drop to a lower level or the efficiency of the LSM data structure becomes too low. For this reason, the KV Store admission queue uses the token bucket approach, as it allows for some fluctuations in the short term but controls aggregate work over the longer term. We’ll dive deeper into this aspect in the next section. For now, we’ll continue to look deeper into the admission queue itself.

Admission queues are implemented as heaps. Heaps are often chosen for priority queues as they are very efficient for that use case. AC typically uses about 1% of CPU. In order to balance work across tenants, an admission queue uses one top-level heap and multiple per-tenant heaps.

The top-level heap consists of [tenant->requested slot/token] pairs ordered by the total number slots or tokens requested by the tenant. The per-tenant heaps consist of the actual work items to be executed, and order is based on [priority, transaction start time] precedence.

The process of dequeuing the next work item goes like this:

  1. Dequeue from the top-level heap, which returns the tenant id with the most requested slots/tokens.

  2. Go to the heap of that tenant and dequeue the work item with the highest priority, transaction start time.

  3. Enqueue the tenant in the main heap with a deducted value for requested slots/tokens.

Fig 3. An admission queue is implemented as a set of heaps.

For any given tenant, higher-priority work items are executed before lower-priority ones, and for work items of the same priority, items are chosen using FIFO (based on transaction start time). Under heavy load, this FIFO approach can reduce effective throughput dramatically as all queued items end up exceeding the query deadline by the time they are dequeued.

Therefore under heavy load, FIFO is swapped for LIFO to prioritize recent items, while older ones expire. However, this still causes recent operations to expire if the incoming rate of work is too high. If every request were independent, this would not be a problem, but the storage layer may be receiving multiple KV operations for the same SQL transaction. The result can be that most transactions end up with one or more KV operations that time out. To avoid this issue, CRDB implements an Epoch-LIFO approach where operations are grouped into blocks of 100 ms (or epochs) and the blocks are executed in LIFO order. This tends to result in transactions either having all their requests getting serviced or all timing out, rather than all transactions suffering a scattering of timed-out operations.

See this talk (15 minutes in) for more detail on switching between FIFO and LIFO based on load, and how it can be combined with Controlled Delay in queue-based algorithms.

Another important thing to note is that work item priorities can get bumped under certain conditions, such as being part of an ongoing transaction (to ensure the transaction, which may be holding locks, doesn’t get blocked). Likewise, transaction start times can be advanced as a result of the consistency protocol, all of which can affect the admission queues.

Let’s now look at LSM trees to understand why the KV Store admission queue uses a token bucket implementation.

A quick guide to LSM trees

The Pebble KV store is an LSM tree storage engine. The LSM tree is optimized for write-intensive workloads by sequentially writing incoming data to a write-ahead log and later compacting and merging data in the background.

An LSM tree consists of a set of levels that are progressively merged and compacted. A write request is typically written to an in-memory data structure called a memtable and to a write-ahead log (WAL) for durability. When the memtable reaches a certain size, it is flushed to disk as an SSTable (Sorted String Table) to the first level of SSTables.

Fig 4. A write is a mix of synchronous and asynchronous work,

This first disk level consists of SSTables created from the memtable flushes. These SSTables are numerous and small with overlapping key ranges as they are point-in-time flushed memtables. These small SSTables get merged and compacted periodically and written to the next layer down and this process continues until data is merged at the bottom layer. 

A read request must traverse these levels and possibly multiple SSTables in each layer until it finds the requested key. Each SSTable can be searched quickly with the combination of a Bloom Filter and the sorted nature of the file. However, the more SSTables that must be checked, the slower and more costly the read is.

Fig 5. The read of k=6 must check multiple SSTables before it finds the current value of B. The k-7 read gets lucky and finds the latest value in the in-memory memtable.

Read amplification refers to this process of having to check multiple files. The more files the average read must scan, the higher the read amplification and the less efficient reads become. Compaction aims to reduce the number of SSTables that must be searched by consolidating these SSTables through the merging of overlapping ranges of data. This reduces the number of SSTables and improves read performance. Another thing that compaction does is delete tombstones. In LSM trees, deletes are handled by adding tombstones, which are markers indicating that a particular key has been deleted. Compaction processes identify and remove obsolete data, including tombstones, to reclaim disk space.

These background processes are executed while trying to minimize their impact on ongoing foreground read and write operations. With this type of storage engine, a burst of writes triggers a series of background processes to maintain the structure's efficiency, keeping read amplification under control.

Fig 6. Read amplification rises and falls as a function of foreground writes to background work.

The key to predictable, sustainable performance is to ensure that the background processes can keep up with the foreground processes. This leads us back to admission control and the admission queues.

Limiting IO work (balance write workload with background processes)

Each Pebble KV store is gated by a KV Store admission queue that uses token buckets to ensure the write throughput is sustainable, i.e., read amplification remains within acceptable levels. This will be a simplified view of the token bucket implementation as the implementation is more complex.

Fig 7. The token bucket algorithm. Bursts above the replenishment rate drain the bucket, and periods below the replenishment rate cause the bucket to fill. When the bucket becomes empty, throttling kicks in.

Token buckets match the needs of an LSM storage engine because the LSM engine needs medium-term write throughput to stay within the bounds where background processes can keep up. Bursts of writes happen and shouldn’t be throttled if the node has capacity; what needs to be controlled is the medium-term aggregate throughput to allow for background processes to kick in when foreground processes die down.

Fig 8. A right-sized token bucket, with the right replenishment rate, can effectively rate limit writes to an LSM based storage engine to balance writes with read efficiency.

However, if foreground writes do not die down, then they must be throttled to allow for background processes to do their job.

The question is how to set the bucket size and replenishment rate. The strategy here is to calculate the tokens that can be consumed over 15-second periods. The token calculation is based on metrics from the Pebble KV store, and 15 seconds is a period long enough to cover a single compaction run.

During this 15-second period, the replenishment rate and bucket size are continually adjusted on an interval (known as a tick). The interval is set depending on whether the Pebble KV store is overloaded (high read amplification) or not. When it is overloaded, the tick is 1 ms else, it is set to 250 ms. 

A 15-second period can go like this:

  1. At the beginning of this period, the total number of tokens gets calculated based on Pebble metrics. If the read amplification is low, the tokens are unlimited, whereas if the read amplification is higher, the tokens are progressively limited based on the degree of overload.

  2. On each tick, a portion of the tokens is made available for consumption:

    1. The remaining tokens of the period are calculated as the total tokens - previously allocated tokens.

    2. The bucket size is dynamically adjusted to be 1/60th of the remaining tokens. This allows for a burst to consume 250ms of tokens in a single tick (which may last only 1 ms).

    3. The bucket is replenished with remaining tokens/remaining ticks in the period but cannot grow larger than the bucket size.

The key point here is that when read amplification is low, the tokens are effectively unlimited as read operations are still efficient. Only when read amplification starts becoming a problem do the available tokens get progressively limited to ensure that the compaction process can bring read-amplification back down.

Avoiding priority inversion

Combining work priorities with locks on shared resources can lead to priority inversion, where lower-priority work items can cause higher-priority work items to get delayed because of the lock subsystem. An example is two transactions, t1 and t2, where t1 holds some row locks and t2 also needs to lock some of those same rows. Both have submitted work items to the admission queue but t2, which is waiting on locks, has a higher priority than t1, and is dequeued before t1 even though it cannot progress until t1 has released the locks. To avoid these issues and reduce lock contention in general, the work items of transactions with locks get their priority bumped so that they can ultimately release their locks faster.

Next up in part 3

Rate limiting is about ensuring that storage node performance does not degrade, by protecting the nodes from overload. Additionally, it balances the load across tenants to avoid noisy neighbors problems with an efficient multi-heap based admission queuing system. However, ideally, autoscaling and heat management should work proactively to prevent nodes from experiencing overload and in the cases that they do, they should minimize the time under excessive load. That is what we will look at in part 3.

Chapter 4 parts: