See The Architecture of Serverless Data Systems introduction chapter to find other serverless data systems. This is the first system of group 2 - the SQL OLTP database group.

Before we get stuck into Neon, I’d like to take a brief look at Amazon Aurora. I could have focused only on Aurora, but I’ve already covered an AWS service and thought Neon was interesting. Neon implements a similar architecture to Aurora but has some differences such as a unique point-in-time query model that places different stresses on the storage layer.

First, a detour to Amazon Aurora

The Aurora paper provided us with the insight that the durability and availability of a traditional RDBMS such as MySQL could be improved (and at the same time performance be enhanced*) by separating the monolith into separated storage and compute.

The paper compares its approach with MySQL on EBS volumes using block-level synchronous replication (such as DRDB) to a stand-by replica. This is how MySQL RDS replication works today as far as I know (at least in single stand-by mode). Given that each write involves writing to the redo log (write-ahead log of InnoDB), the binlog (the log of committed changes to data pages), as well as the double-write buffer, each write ends up being multiplied on disk.

On top of the existing amplification, multiple rounds of replication further amplify the writes:

The EBS volume on the primary is replicated.
The block-level replication replicates all changes on disk to the secondary.
The EBS volume on the secondary is replicated.

Worse, latency is additive. Aurora takes a very different approach.

The Aurora architecture is based on the following main points:

Replicate only the write-ahead log. As the Aurora paper puts it, “The database is the log”.
Move durable storage (WAL and data pages) from MySQL disks to a separate large-scale multi-tenant storage service. By only replicating the WAL, total bytes replicated is lower despite a higher replication factor (6-way). The log applicator (that turns WAL records into mutations to data pages) now lives on the storage nodes that host the data pages.
The MySQL database makes network calls to the storage service instead of file system calls. If the database fits in RAM, then no network calls are needed. If a page is needed that is not buffered, then make a call to fetch it from the storage service.
Read replicas are maintained by streaming the write-ahead log asynchronously from the primary to the read replicas (called Aurora replicas). Fail-over uses a consensus protocol between the promoted replica and the storage service to ensure that the replica is consistent, despite its replication from the primary being asynchronous.

* I will stay away from the performance claims regarding Aurora. The claims are plausible. Foreground and background processes on traditional monolithic RDBMS’ can cause contention which the Aurora model avoids. But if you care then you should always run your specific workload yourself when assessing performance claims of vendors.

What is clear, is that this architecture is superior from a resiliency and durability standpoint:

Replication to the storage service is synchronous (meaning no lost data if the primary goes down) while also replicated across AZs with 6-way replication (with a 4/6 write quorum). An entire AZ plus one other replica can perish without data loss.
Disk failures do not cause database downtime or fail-over due to quorum writes to the storage service. Additionally any loss of redundancy can be healed in the storage layer without impact to the database or its clients.

Finally there are the economics of large-scale multi-tenancy, lowering the total cost of ownership due to better resource efficiency through massive resource-pooling.

The Neon architecture takes the same approach, aiming for the same durability improvements but also adds scaling-to-zero and a non-overwriting storage model to support data branching. Let’s dig into the Neon architecture.

Neon, serverless Postgres

I’d like to thank Neon for their support with access to the engineers behind the service to ensure this description is accurate and complete. Please look at this architecture deep-dive as a point-in-time snapshot, the architecture is evolving fast.

Neon is a serverless Postgres service based on an architecture similar to Amazon Aurora. It separates the Postgres monolith into disaggregated storage and compute. The motivation behind this architecture is four-fold:

Aim to deliver the best price-performance Postgres service in the world.
Use modern replication techniques to provide high availability and high durability to Postgres.
Simplify the life of developers by bringing the serverless consumption model to Postgres.
Do all this while keeping the majority of Postgres unchanged. Rather than building a new Postgres-compatible implementation, simply leverage the pluggable storage layer to provide all the above benefits while keeping Postgres Postgres.

While Neon isn’t quite “Just Postgres”, it’s as close as one can get to the spirit of it while still obtaining the benefits of a disaggregated multi-tenant architecture. Neon is banking on developers preferring an actual Postgres database, rather than a Postgres-compatible database. The entire architecture is based on that fact - take the Postgres monolith and make it cloud-native using the insights of the Aurora design template (plus unique features such as data branching).

Architecture overview

Neon is composed of a stateless compute layer and a disaggregated storage layer. Postgres instances form the stateless compute layer, and the storage layer is itself disaggregated and consists of a fleet of storage nodes where each node stores the data of multiple tenants. The fleet is broken into horizontally scalable services for the write-ahead logs (WAL service) and data pages (Page service).

Fig 1. Neon separates the Postgres monolith into a compute and storage layer.

Each Postgres database is stateless because any state it maintains is not required for durability. If a Postgres instance is lost, it can be recreated from the state stored in the storage layer. After a period of inactivity, an idle Postgres instance is taken offline and stood up again when a new query is received. The most common demand from customers is large numbers of small databases rather than one or two very large ones. That said, Neon does have multi-terabyte database customers and the system is designed to support these large databases as much as the more numerous small databases.

Neon has patched Postgres in a number of ways in order to be able to intercept writes to WAL and data files as well as ensure cache evictions incur no writes to local disk. Neon hopes that these patches will be accepted by the Postgres community though there is no guarantee of that.

Fig 2. On the left, we see the traditional Postgres monolith, which reads and writes to local files. On the right, we see Neon’s disaggregated Postgres, where reads and writes go against the remote storage services.

Postgres databases in Neon no longer experience local disk IO as a bottleneck, nor do foreground processes interfere with background processes, because the nodes do not read and write to disk but to a remote storage service. Reads and writes are now spread out over a fleet of storage servers which helps dissipate load peaks. The new bottleneck is one of the network and the storage layer’s ability to serve IO requests quickly. The design of Neon into separate read and write components should help here as each component can be optimized and scaled separately.

Fig 3. Writes (WAL records) go to Safekeeper nodes and data page reads go to Pageservers.

Point-in-time-recovery, queries, and branching

One of the killer features of Neon, beyond being serverless, is its ability to support point-in-time queries, point-in-time restores, and branching. All this requires that data is never overwritten and layered such that history is maintained in an efficient manner.

A key component of this layered data approach is the log sequence number (LSN), a monotonic integer assigned to each WAL record that represents a specific point in time in the linearized history of data changes. Reads can be performed against specific LSNs and branches can share a common history based on a common LSN.

This layering is maintained until a time horizon where it gets garbage collected. This presents a couple of challenges; the immediate challenge is that the size of the database does not necessarily correspond to the size of the data stored. Some databases with a high rate of updates can see their storage balloon in size, while other databases that are INSERT heavy but UPDATE light have a very close database-to-data-storage ratio.

The layered data storage model consists of image and delta files. Each image file represents the materialized data for a page range at a particular point in time (LSN). WAL records, which are deltas, are then layered on top of the image files as delta layer files (think of these as an indexed WAL). These deltas get replayed on top of the most recent image file to make a new image file. Reads at a given LSN are served by first finding the highest image file equal to or below the LSN and then applying all WAL records above it. This strategy has similarities to a log-structured merge tree (LSM), where a read must traverse a set of layers to find either the current value of a key or a value of a key at a given point in time. I won’t cover it any further in this post, but I recommend reading about it in their blog post on how they implemented layered reads efficiently. I’ll simply refer to the existence of layer files, which form an LSM-like approach.

The write path

With vanilla Postgres, data is first written to internal shared buffers, applying the changes to cached 8kb data pages of the relevant rows. Next, a delta is written to the WAL buffer. Before the write is committed, the WAL Writer takes the delta from the WAL buffer and persists it to the WAL file. At this point, the only durable change in state is that of the deltas written to the WAL. The modified 8kb data pages in the shared buffers are “dirty” as they have not been written to disk yet.

The Writer process is responsible for ensuring there is enough space in the shared buffers and periodically flushes dirty data to the data files to make space. The shared buffers are essentially a cache, and data is evicted from the cache to make space for other operations. When dirty data is evicted, it is written to disk. Checkpointing is the process of cleaning the WAL tail as dirty data is marked as clean.

In Neon, none of these disk operations happen; instead, these operations are turned into network IO requests or no-ops in the case of cache evictions. The only table state maintained locally by the Postgres database is what it stores in its internal buffers. When data must be evicted, only committed data is allowed to be evicted as committed data is now durably and redundantly stored by the remote WAL service.

Safekeepers

The WAL of a Postgres instance is written to a cluster of three Safekeeper nodes.

You’ll notice that the Postgres database is responsible for writing the WAL records to each Safekeeper rather than writing to a leader node which then replicates those records to followers (such as with a Raft cluster).

If you’ve read my posts about Apache Pulsar and Apache BookKeeper, you’ll recognize this pattern as the way Pulsar and BookKeeper work. In this disaggregated model, the leader (or writer) is a stateless actor who serves clients and caches data locally.

Rather than adopt Raft, Neon has chosen a Paxos implementation for its WAL service. Paxos defines the roles of Proposer, Acceptor, and Learner. Each role is responsible for a different part of the protocol and there are no rules regarding where the different roles run.

Proposers. A proposer simply proposes a value to be written. In Multi-Paxos, one proposer at a time is elected as the leader who proposes a sequence of values. This leader is also known as a Distinguished Proposer.
Acceptors. Acceptors store the proposed values, and values are committed once accepted by a majority.
Learners. A learner learns of committed values from the acceptors.

With Multi-Paxos, one leadership term consists of a Prepare phase where a Proposer is elected as the Distinguished Proposer by the Acceptors. Then the second (steady-state) phase is the Accept phase, where the leader proposes a sequence of values to the Acceptors, who must store the proposed values. Learners learn of the committed values from the acceptors.

Implementations can choose to have these processes running on different machines in a disaggregated way or have a single process act as all three roles. The latter is precisely what Raft does. The Raft leader is the Distinguished Proposer; the leader and followers are all Acceptors, and the state machine on each member that sits atop this replication layer acts as a Learner.

Coming back to Neon, it chose Paxos because of the ability to disaggregate the roles. If a Postgres database fails, a new one must be spun up to take its place. But what happens if the first database node is actually still operating? Now we have two primaries and what is known as split-brain. Split-brain leads to data inconsistency which we really want to avoid. What we need is a way of ensuring that only one primary can write to the WAL service at a time. We can’t prevent two primaries from existing, but we can ensure that only one can make progress while the other remains blocked. Paxos solves this problem.

Each Postgres database is a proposer, each Safekeeper is an acceptor and the Pageservers are learners. Before a Neon Postgres database can write WAL records, it must get elected by the Safekeepers as the leader (distinguished proposer). Once elected, it is free to write (or propose) a sequence of WAL records to the Safekeepers. Once a majority of Safekeepers have acknowledged a WAL record, the database treats that record as committed.

Fig 5. Paxos roles spread over different components.

There is a consensus protocol between Postgres databases and the Safekeepers that governs the fail-over process, based on the Aurora design. If you’re interested (and motivated), you can learn the fundamentals via Neon’s TLA+ specification.

The Page service is a fleet of Pageservers where currently one database is serviced by one Pageserver, and one Pageserver can service multiple databases. A sharding mechanism is being worked on where a single database can be spread over multiple Pageservers. Sharding will allow for improved performance and better heat management for large databases.

Pageservers sit in both the read and write path. They are stateless as they do not durably store data, but they play a critical role in the durability chain of responsibility. Just like in vanilla Postgres, the WAL must be cleaned periodically to prevent it from using up all available disk space. The checkpointing process is responsible for this in vanilla Postgres by flushing dirty pages to disk and marking the WAL as applied up to that point. The Pageservers hold this checkpointing role in Neon.

Pageservers learn of the committed WAL records in their role as Paxos learners. WAL records are replayed over the latest image file, and those files are eventually uploaded to S3. The Pageservers then communicate with the Safekeepers about the index of the last applied WAL record, allowing Safekeepers to safely garbage collect that data.

The read path

Reads will first hit the local cache in the Postgres memory buffers, and only in the case of a cache miss will a get page request be sent to the relevant Pageserver. Pageservers are also caches, implementing an LRU cache eviction algorithm. If another cache miss occurs on the Pageserver, it will fetch a block of pages from cloud storage (in the form of layer files).

Fig 6. Multiple layers of caching sit between the query processor and cloud object storage.

The granularity of cache management on the Pageservers is per layer file, which is usually about 128-256MB.

Neon supports read replicas but unlike Aurora, the database primary does not stream WAL records to the read replicas. Instead this responsibility is moved to the Safekeepers. This also places read replicas in the role of Paxos learners, alongside the Pageservers.

Fig 7. Safekeepers stream WAL records to read replicas.

All writes go through the Postgres primary instance, therefore caching can be used to improve performance and reduce load on the storage backend. Many workloads can leverage local caching to the extent that requests to retrieve pages from the Page service are relatively infrequent. Many real-life workloads fit entirely in RAM on the Postgres database requiring the backend service only for durability and standing up a new database instance after a period of inactivity caused the prior one to be stood down. The local cache is still helpful when a workload must read from the Pageservers as once a page has been served from its Pageserver, it is unlikely that the same page will need to be retrieved again in the short term. Postgres also does its own readahead, which makes the latency of the fetches from Pageservers less impacting.

Fig 8. Readahead reads can reduce latency by prefetching data from Pageservers.

Because caching is so effective, a recent addition to Neon is allowing the Postgres local caches to spill over onto a local NVMe drive. The more data that can be cached on the Postgres instance itself, the better performance and lower load on the storage backend.

Cloud object storage

Neon falls into the Kora camp with respect to leveraging the cheap, durable cloud object storage. Neon reduces storage costs by offloading data to cloud storage with its disaggregated storage service acting as a fast, fault-tolerant cache in front. An important consideration is the data branching and point-in-time querying/recovery offered by Neon and the amplification of storage required.

Fig 9. Neon chooses the fast storage in front of higher latency object storage approach.

Neon doesn’t have it quite as easy as Kora when it comes to integrating object storage. Kora benefits from the sequential access pattern that makes prefetching extremely effective. However, Neon also has great opportunities for caching in Postgres itself and table scans can benefit from readaheads that prefetch pages. This makes object storage a no-brainer for Neon.

Multi-tenancy

As I discussed in the introduction chapter, the challenges of serverless multi-tenancy are obtaining high resource utilization while maintaining strong tenant isolation. I’ll discuss these challenges in terms of the compute and storage layers.

Compute layer

Neon achieves multi-tenancy with its own virtualization tech to pack multiple tenants onto shared compute instances. Each Postgres database operates in its own NeonVM based on QEMU (with KVM acceleration when available) which is scheduled in Kubernetes (presumably with something like KubeVirt). This means that each Postgres instance itself is single-tenant, and multi-tenancy is achieved through the smart scheduling of multiple NeonVMs per K8s node as well as the scaling to zero of idle databases.

Tenant isolation and elastic scaling in the compute layer is an interesting subject in Neon. Each tenant database has only a single primary Postgres database. Scaling is only possible in one dimension, up and down. This presents the challenge of achieving high hardware utilization while avoiding nasty performance issues due to resource contention or even pods being killed by the scheduler. Postgres instances are assigned the amount of resources (CPU and memory) according to their needs and those needs change over time, sometimes very quickly. Given that scaling is vertical and multiple VMs are packed onto shared hosts, the underlying hosts can run out of CPU and/or memory. Running out of CPU can cause latency spikes but running out of memory is generally a worse thing as it can result in Postgres getting OOM killed or the K8s scheduler killing the pod. Neon handles both these challenges in the following ways.

Each Postgres instance runs in a cgroup in the NeonVM. Alongside Postgres, runs the vm-informant process, which receives notifications from the cgroup of memory.high events. The vm-informant process then requests additional memory to a shared autoscaler-agent process running on the host. This autoscaler-agent, in turn, coordinates with the K8s scheduler (modified) to get approval to assign more resources to the VM. The K8s scheduler is considered the single source of truth and so all resource limit changes must go through it to avoid race conditions where the scheduler and the autoscaler attempt conflicting changes.

Fig 10. The components involved in auto-scaling Postgres databases.

There is a bidirectional arrow between the autoscaler-agent on the host and the vm-informant in the NeonVM because when autoscaler-agent wants to downscale a VM, it must first check with the informant that the Postgres instance is able to tolerate a reduction in memory.

You may wonder about why Neon chose QEMU instead of a microVM like Firecracker. QEMU has a much larger overhead in terms of size and memory footprint and so Firecracker would be attractive, as it could lead to more dense packing of VMs per host. However, one of the limitations of using Postgres is that all writes go to a single primary. Any given database can only scale its writes to the size of a single server. Given that horizontal scaling is off the cards, scale up (and down) is all that is left. This is where QEMU becomes important, what do you do when multiple VMs all want to scale up their resources, on the same host? The answer is live migration, something that Firecracker does not support.

Neon uses QEMU’s live migration capability to migrate VMs to less loaded hosts. This migration is a phased process where most of the VM is copied while the original VM is still running, and then a last memory copy is performed during the switch-over, which causes a short pause to the Postgres instance but maintains consistency. The IP address is maintained and TCP connections remain intact. It’s a neat approach to handling vertical scaling on shared hosts.

High resource utilization is achieved by this combination of right-sizing VM resources to the load and moving VMs preemptively to avoid host overload. Secondly, scaling to zero means that no hardware resources sit idle and are wasted for databases that are not receiving traffic. It is a strategy based on elasticity and mobility.

Cold starts are a problem for Neon due to the scale-to-zero architecture. In the compute layer, cold starts are mitigated by keeping a pool of empty VMs around, each ready to take on the identity of a tenant rapidly. The less tractable problem is that of avoiding cold starts in the storage layer.

Storage layer

Neon’s disaggregated storage layer achieves multi-tenancy via a shared-process resource-sharing model with logical isolation. Shared processes with logical isolation is a common pattern for stateful services, as moving large amounts of data in single-tenant VMs or pods is more expensive. That said, currently, Neon does not use sharding which is a common technique for spreading load across a larger set of shared servers, generally making heat management easier. Neon is working on sharding at the time of writing, and by the time you read this, it may already be rolled out. I will try and update the post when the feature hits production.

The separation of the storage backend into WAL and Page services helps with multi-tenancy as each experiences a different type of workload which requires different load balancing and scaling strategies. Each can be deployed on hardware optimized for the workload:

The WAL service needs low-latency writes (for faster commits) and the data size stay relatively small. Smaller but faster disks can be provisioned. Servers do not require huge amounts of memory.
The Page service needs larger disks and more memory.

Heat management is somewhat limited today due to the lack of sharding, but as I stated earlier, this is in the works. Tenants can be transferred between Safekeepers and between Pageservers, although such transfers are less common with Safekeepers due to their lower contention levels compared to Pageservers. In the case of Pageservers, Neon typically executes "warm" cut-overs by proactively fetching some state from cloud storage before performing the cutover. In addition, during the phased cut-over, both the original and new Pageserver consume the Write-Ahead Log (WAL) to ensure they have the most up-to-date data. Cold cut-overs are less frequent and are primarily reserved for very small databases and free-tier instances.

High resource utilization is achieved by:

Allocating enough Safekeepers and Pageservers to handle the aggregate load. As aggregate load changes, the numbers of each node type can be increased or decreased.
Tenants can be moved between nodes to avoid hotspots or to consolidate traffic when removing excess node capacity.

Cold starts

Cold starts are the thorniest of thorns with the Neon architecture, albeit one of choice. On the compute layer, this is addressed by compute pools which consist of ready-to-go VMs that can assume the identity of a tenant quickly. However, if the Pageserver does not host the layer files needed to serve the initial requests, then the first query will experience the latency cost of cloud object storage.

Cold starts are business decision made by Neon and not a fundamental limitation of the architecture. Databases do not have to be scaled down to zero and Pageservers could host the entire dataset. The reason they don’t is to provide the best price-performance. There is nothing, beyond a product/business decision, to offer varying SKUs which avoid cold starts. Many customers would be willing to pay for additional features such as:

Extending the idle time period that triggers scaling a VM to zero.
Never scaling the customer to zero but scaling down to a configured minimum size.
Pageservers always keep some state locally for a tenant, primarily the layers that contain the most essential metadata on which relations exist and their sizes at the latest LSN. However, Pageservers could keep some additional configurable amount of data for that customer. Some customers may be willing pay to keep all data hosted in Pageservers, while others may be happy to keep historical data only in cloud object storage.

I don’t know that Neon offers any of these, I’m just musing on mitigations that could be possible. Cold starts will likely be something that Neon will be optimizing over the long term, just as the functions-as-a-service providers have done.

Conclusions

Neon is a great case study for taking a traditionally single-tenant, on-premise piece of software and making it cloud-native. It employs a mixture of common techniques, such as the separation of storage and compute, distributed write-ahead logs, and offloading to cloud object storage, but also unique techniques, such as using virtualization for elasticity and mobility in the compute layer. Neon is also a poster child for why Raft should not simply be the default replication protocol out there. The use of Paxos in this disaggregated architecture demonstrates that there are alternatives that might fit your needs better than the more monolithic Raft.

The architecture is evolving quickly, and it would be interesting to come back a year from now to see the progress that the Neon team has made. I am especially interested in the sharding as it will unlock a number of improvements around scalability and heat management.