The Architecture of Serverless Data Systems introduction.

DynamoDB is a serverless, distributed, multi-tenant NoSQL KV store that was designed and implemented from day one as a disaggregated cloud-native data system.

The goals were to build a multi-tenant system that had the following properties:

Consistent performance with low single-digit latency.
Obtain high resource utilization through multi-tenancy in order to reduce costs which could be passed on to customers.
Unbounded size of tables where the size does not affect the performance.
Support for ACID transactions across multiple operations and tables.

The DynamoDB paper states that during the 66-hour Amazon Prime Day, Amazon services “… made trillions of API calls to DynamoDB, peaking at 89.2 million requests per second, while experiencing high availability with single-digit millisecond performance.”

This is a pretty compelling statement and makes it worth digging into the DynamoDB architecture to understand how they achieved it.

Note! My focus on this write-up is not simply to repeat what is said in the DynamoDB papers and talks but to focus on the architecture with a focus on multi-tenancy. I will not describe how Paxos works or how two-phase commit is implemented across the transaction coordinators and storage nodes, for example.

Architecture overview

DynamoDB has a disaggregated architecture where I’ll classify components into the broad categories of proxy, compute, and storage. I differentiate these layers as follows:

The proxy layer is concerned with routing, authentication, authorization, course-grained rate limiting, and such.
The compute layer is concerned with any kind of workload-specific coordination/tasks needed on top of storage. With DynamoDB being a KV workload, the compute layer is pretty light, unlike a distributed SQL service that may need to distribute requests across multiple storage nodes, collect results, and perform compute-intensive operations such as joins.
The storage layer is a stateful layer where data durably and redundantly resides.

DynamoDB uses a shared-process sharing model across all its components with logical isolation between tenants. Every component is a horizontally scalable set of homogenous nodes though they may be deployed across heterogeneous compute (EC2) instances.

First, a little on tables, sharding, and replication

While DynamoDB is a NoSQL data system, it still has the concept of tables and rows. Tables come in two flavors:

Provisioned tables which the customer must configure upfront with a provisioned throughput.
On-demand tables, which scale according to demand.

DynamoDB tables have either static throughput quotas (provisioned tables) or dynamic quotas (on-demand tables) that control the throughput in terms of Read Capacity Units (RCUs) and Write Capacity Units (WCU). One read request of a 4kb value equates to 1 RCU, and one write of a 1kb value equates to 1 WCU. Variations occur depending on whether requests are transactional (costing more units) or consistent (consistent reads cost more).

Throughput is expressed in terms of these RCUs and WCUs, and a provisioned table is configured with a specific provisioned throughput quota. On-demand tables are more flexible, with a dynamic provisioned throughput quota that changes according to load. The aim of on-demand tables is to adjust these dynamic quotas before throttling occurs.

Tables are unbounded in size and can be configured with more throughput than a single storage node could handle. Tables are broken into one or more partitions (shards) so that the data storage and read/write load of the table can be spread evenly across the fleet of storage nodes. Each table partition represents a disjoint and contiguous range of keys (according to the primary key). As a table grows in size, its partitions are split and moved across storage nodes. A partition that experiences excessive load is also split to avoid throttling clients that may be under the aggregate table throughput limit but are performing operations over a hot subset of keys.

Additionally, each table partition is replicated across multiple storage nodes, with placement ensuring that a partition is split across three availability zones.

Fig 1. Each table partition is formed of a Multi-Paxos group.

Each copy of a partition is known as a replica, and a single storage node can host multiple replicas (though never of the same table partition). Replication is based on Multi-Paxos, which is a leader-follower algorithm where writes and consistent reads must go through the leader. Elections occur when leaders become unreachable in order to ensure continued availability.

Fig 2. The table partition replicas are distributed across a set of storage nodes.

Therefore read and write requests must be routed to specific replicas hosted on specific nodes.

High-level architecture

Fig 3. DynamoDB’s disaggregated components.

The service is split into the following components:

Request routers are proxies that not only forward any given request to the storage node that hosts the requested key; but also perform authorization, authentication, and table-level rate limiting (via the Global Admission Control component).
Global Admission Control (GAC) is a rate-limiting service with table-level granularity.
The Metadata service tracks the mapping of primary keys to storage nodes, including which replicas are leaders and followers. Writes must be routed to the leader replicas, as should consistent reads. The storage layer is the source-of-truth for this mapping metadata; storage nodes notify the metadata component of changes to keep the metadata service up-to-date. Request routers consult the metadata component in order to route requests to the correct storage node.
Transaction coordinators perform two-phase commit for transactional requests. Non-transactional requests get routed directly to the storage layer.
The storage node fleet hosts the table partitions. Each partition is composed of a set of replicas that form a Multi-Paxos group.
Autoadmin is the control plane service that is responsible for:
- Serving DDL type requests such as create/update/delete table.
- Autoscaling and moving partitions.
- Monitoring partition health

Next, we’ll look at each component in more detail, starting from the bottom up. Then we’ll analyze how DynamoDB overcomes the challenges of multi-tenancy, such as tenant isolation and obtaining high resource utilization.

Storage layer

The storage layer is composed of a fleet of storage nodes where each node stores the data of multiple tenants. The DynamoDB data model consists of tables and rows, where each row has a primary key (the K in KV) and a set of attributes (the V in KV). The primary key can be a scalar value (known as a partition key) or a composite key composed of a partition key and a sort key. The partition key is used to determine which partition the row resides in.

Write requests and strongly consistent read requests are routed to the partition leader. Each replica consists of a write-ahead log (WAL) and a B-tree that stores the partition key-value pairs. On receiving a write, the leader generates a write-ahead log (WAL) record, which it writes to its own WAL and replicates that WAL record to its followers. Once a majority of the replicas have confirmed the operation (including the leader), the operation is acknowledged.

Fig 4. Two storage nodes and two different replicas.

There are some additional durability mechanisms at play, such as archiving WAL logs to S3 and creating special log replicas when a follower becomes unreachable. However, I won’t go into detail about these durability mechanisms as my focus is on the high-level architecture and the specific strategies for multi-tenancy. The DynamoDB paper goes into some more detail about durability and availability mechanisms.

Metadata layer

The metadata layer hosts the mapping of primary keys to storage nodes so that requests arriving at the proxy layer can be routed to the correct storage nodes. The metadata service is a distributed in-memory metadata store called MemDS that replicates all metadata across all MemDS nodes. The storage nodes are the source of truth for this mapping and push mapping updates to the MemDS metadata service.

In turn, the request routers also cache this mapping information locally to avoid blocking on MemDS lookups in the hot path. The interesting aspect of this part of the architecture is how DynamoDB avoids metastable failures (that can occur with the use of caches) by performing an async lookup against MemDS on each local cache lookup.

Fig 5. How DynamoDB’s metadata system is designed with predictable performance over pure efficiency.

The traditional approach to caches is to only perform a lookup to a remote data source on a cache miss; however, this can introduce pathological cases where the local cache is lost or has become ineffective due to a workload interacting with its particular caching strategy (such as LRU) causing large spikes to backend components which in turn lead to cascading failures that can be hard to recover from. DynamoDB avoids this risk performing asynchronous lookups against MemDS even on local cache hits, therefore generating a stable load regardless of the cache-hit ratio. This also has the benefit of ensuring that the local caches don’t remain stale for long.

Global Admission Control (GAC)

While provisioned throughput is a table-level configuration, the actual throughput hitting the partitions of a table may not be even, with some hot partitions receiving a disproportionate amount of the load. The other dimension with load skew is that of time itself; while the average load over an arbitrary period may fall within the quota, very short-lived spikes may exceed it and end up being throttled.

The Global Admission Control aims to allow for throughput at the aggregate table level to reach the maximum provisioned throughput of the table, despite a skewed load distribution either in terms of hot partitions or short-lived load spikes.

The request routers apply the throttling to prevent a table from going over its provisioned throughput quota, but it is the horizontally scalable fleet of GAC nodes that control this throttling by tracking table-level request rates and supplying request routers with information to proportionally dynamically apply those rate limits.

A more detailed discussion of the various throttling mechanisms is discussed in the multi-tenancy section further down.

Transaction coordinators

DynamoDB supports transactional reads and write, and these transactional requests do not get routed directly to storage nodes but to transaction coordinators that form a horizontally scalable set of stateless nodes responsible for two-phase commit (backed by a DynamoDB table). I won’t go into the details of how this works as the focus of this analysis is serverless multi-tenancy. However, it is relevant to point out that this component, like all other DynamoDB components, is a horizontally scalable set of multi-tenant nodes.

Request routers

Requests first land at a request router where the request is authenticated and authorized, potentially rate-limited and finally routed to either a storage node directly or to a transaction coordinator if it is a transactional request.

Multi-tenancy

As I discussed in the series overview, large-scale multi-tenant systems can increase resource utilization via resource pooling, bringing down operating costs significantly. Additionally, we can improve reliability as the system can use its scale and elasticity to absorb per-tenant load spikes without causing stress or overload.

DynamoDB uses the same resource-sharing model across all its components - shared processes with logical isolation. It achieves high resource utilization by over-subscribing the number of tenants. What this means is that the system could not offer the low single-digit latency per operation performance if all tables were being written to and read from at their provisioned limit. If the system were scaled to that level, resource utilization would be low because aggregate load is far lower than the aggregate throughput limits. However, over-subscription leaves a system vulnerable to correlated and stochastic uncorrelated load spikes which exceed the capacity of localized parts of the system. This overload results in throttling and high latencies, which would violate the core tenet of DynamoDB itself - predictable low single-digit latency.

The challenge is high resource utilization, with elasticity for load changes, while providing strong tenant isolation and predictable performance. To achieve this, DynamoDB has a multi-layered and multi-dimensional load limiting and distribution strategy. In an ideal world, the aggregate load of a table would be constant, and the key distribution of that work would be uniform. However, in the real world, neither of these is true; aggregate load can spike and some keys can be disproportionately accessed, creating hotspots in a table.

Rate-limiting

The first concept that DynamoDB uses is the Token Bucket which is a rate-limiting algorithm that limits average throughput but allows for throughput bursts without throttling. In this algorithm, we have a separate bucket for reads and a separate bucket for writes, given that a table can be provisioned with different read vs write throughput.

Let’s take reads as an example. 1 RCU corresponds to 1 read per second of 4 Kb of data. If, for example, we read one row per second, and each row was 12Kb, that would equate to 3 RCUs. Converted to tokens, one 12Kb read would be represented by 3 tokens.

The bucket is initially filled with the tokens that correspond to a multiple of the provisioned RCUs. For example, if we have a 100 RCU table, we can fill the bucket with 5 minutes' worth of tokens (300 * 100 = 30000 tokens). Each read request would drain 3 tokens from the bucket. If the bucket holds 3 or more tokens, then the request is serviced; if the bucket has less than 3 tokens, then the request gets blocked. At the same time, the bucket is getting filled at a rate of 100 tokens per second (100 RCU).

What this means is that in periods where read throughput is less than 100 RCUs, the bucket is filling until it reaches capacity, which in this example is 30000 tokens. When the read throughput is greater than 100 RCUs, the bucket is draining. This algorithm enforces the throughput limit over time but allows for short-lived spikes without throttling. However, if load continues above the 100 RCU limit, then eventually, the bucket will empty and requests will be limited to the rate at which the bucket is filled (100 tokens per second in this case).

Fig 7. Throttling occurs when the bucket empties.

DynamoDB employs token buckets in multiple sites:

At the proxy layer with table-level buckets. This ensures that the aggregate load of the table is controlled.
At the storage node:

A node-level bucket to ensure that aggregate load, across its hosted partitions, does not exceed the physical capacity of the node (based on the hardware it has been allocated).
Per-partition buckets, though the paper does not discuss the capacity or rate at which these buckets are filled*. However, the idea is to prevent one or more partitions from utilizing all available node resources on a node.

* A previous design equally split the available tokens across partitions though this resulted in sub-optimal throttling due to hot partitions, and the dilution of tokens as partitions got smaller (due to splitting). Per-partition token buckets were retained in the GAC design though I presume some kind of heuristic is used to determine capacity and fill rate, potentially with the ability to dynamically change partition buckets on the fly.

Therefore, for a request to be serviced, it must be able to draw tokens from:

The table token bucket in the proxy layer.
The node token bucket on the storage node that admits the request.
The partition token bucket on the storage node that admits the request.

It’s worth going over how the request routers and GAC nodes coordinate to implement distributed rate limiting efficiently.

Each request router maintains a local token bucket for a given table (if it has previously received requests for that table) and draws tokens from that bucket without needing to synchronously make a call to a GAC node. Once the router has an empty bucket, it requests a set of tokens from a GAC node which maintains an aggregate token bucket for that table. If the token bucket has available tokens the GAC node will draw a batch for the request router which can add them to its local bucket. If the aggregate bucket is empty, then the request router will remain with an empty local bucket and will throttle incoming requests for that table.

Fig 8. How request routers and GAC nodes interact to apply distributed rate-limiting via token buckets.

Heat management (load distribution)

Many of the components of DynamoDB are stateless, and of those services, some can service any request (request routers, metadata) while others split load according to hash-partitioning where the nodes form a ring (GAC). In the case of request routers, each router can service any request so load changes can be accommodated by horizontal scaling and load balancing. Others, like GAC nodes, can also be horizontally scaled, and the hash partitioning is designed to accommodate changes in node counts without too much disruption.

However, the storage layer requires a more sophisticated approach to load distribution. Any given write or consistent read can only be serviced by one storage node out of an entire fleet. Requests cannot simply be routed to the least loaded storage node; routing is based on where data resides, which can lead to skewed load distributions focusing too much load on a subset of storage nodes.

Unnecessary throttling is throttling that occurs when the table-level throughput is within the provisioned bounds, but a hotspot has caused localized throttling in the storage layer. Avoiding these hotspots that can cause individual storage nodes to receive requests beyond their performance or storage capacity is the subject of “heat management”.

The unit of distribution of table data is the partition. If a particular partition gets too big, or receives too much read/write throughput, it is split. The split point is determined based on the trigger (size or load). In the case of load, the split point is determined by the load distribution across the keys to ensure that the left-hand and right-hand sides of the split point are balanced.

Fig 9. Partition splits use key distribution to determine the split point.

Throughput-based splits will only occur if the result improves load distribution. Some cases are somewhat pathological such as if all load is focused on a single key.

While splitting can even out load over the set of partitions of a table, it doesn’t alone solve the problem of hot storage nodes. To even out the load over storage nodes, partitions are migrated from busy storage nodes to less busy ones. Therefore load is distributed by evening out load over right-sized partitions and evening out right-sized partitions over nodes. Aggregate load changes can be accommodated by scaling out (or in) the storage node fleet and migrating partitions accordingly.

Placement and correlated load

DynamoDB places tenants across the storage fleet in order to minimize correlated load changes being focused on a small number of storage nodes. Tables of the same tenant have a high load correlation; that is, the real-time load on one partition of a table has a high correlation with the real-time load on all other partitions of that table. If the partitions of a table are distributed across a small number of storage nodes, then a load spike on that table will likely result in a spike across its partitions, resulting in a focus of load on those nodes. This correlation also includes different tables of the same tenant; any given tenant may be subject to a time of day or seasonal correlations of load across tables.

Therefore partitions are placed across the fleet to minimize tenant-based correlation. The more uncorrelated the partitions are of any given storage node, the less likely that the node will reach its performance capacity and require partition migration. Another benefit of co-locating uncorrelated partitions is that tenants can be more densely packed (or over-subscribed) on each storage node without sacrificing the performance needs.

The workload

The key-value workload itself, with its four CRUD operations (PutItem, UpdateItem, DeleteItem, and GetItem), is ideal in terms of building a scalable, multi-tenant system with predictable performance. The non-transactional operations make for a simple key-based storage API. Even transactional operations are bounded by having to have all work encapsulated in a single client request. The variability of the cost of these operations is low, which leads to a number of benefits:

Operations can be translated, at the proxy layer, into units of work (RCUs and WCUs) which are accurate enough to reflect actual work done by the system. Table throughput provisioning, billing, and rate-limiting can all be applied effectively based on these simplified units of work.
Course-grained rate-limiting can be effectively applied at the proxy layer because each request has a low variability impact on the aggregate system load.
Heat management in the storage layer becomes more manageable because per-table aggregate load can be controlled before reaching storage. The storage layer then only needs to focus on ensuring that the admitted throughput can be distributed as evenly as possible over storage nodes via the splitting and moving of partitions.

Cloud object storage

Finally, DynamoDB only uses S3 for extra durability, not for reducing storage costs by offloading data to cheaper S3 storage. All data is stored across the storage drives of the storage fleet. Low single-digit latency is one of the core requirements of this system and the random access workload pattern of a distributed key-value store makes cloud object storage unviable as the primary data store. Smart caching can mitigate the latency costs but not hide them completely.

Security isolation

Being primarily a storage API simplifies the security isolation as tenants cannot execute arbitrary code. This makes shared processes with logical isolation a straightforward choice. The only other risk is data corruption causing reads past the end of a row into another customer row. Typically this is solved by using a rock solid storage engine such as Postgres, MySQL or RocksDB on the storage nodes. For custom storage engines, not co-locating data of multiple tenants in the same files and/or using data consistency checks to validate the integrity of data that is read. I don’t know the details of the actual storage engine on DynamoDB storage nodes but I have to assume they guard against this.

Conclusions

As I described in my post regarding how I see the future of cloud data services, we architect our systems for the properties we want that system to have. DynamoDB was crafted at levels around the predictable low-latency performance requirements. This does not just include the disaggregated architecture, the multi-layered rate-limiting, and storage level elasticity and mobility; it includes the simple key-value API itself.

DynamoDB epitomizes the large-scale multi-tenant argument that I made about the future of cloud services. It employs a disaggregated architecture with horizontally scalable components that pool resources to serve multiple tenants. The high resource utilization combined with its scale and elasticity makes for a more reliable system from the perspective of any given tenant, as load can scale with the needs of the tenant (at least for on-demand tables). The system utilizes specific strategies to mitigate tenant interference and avoids certain possibilities altogether, such as using cloud object storage as a primary data store, to safeguard the core requirements that initially formed the basis for building the service.