Is it economical to build fault-tolerant transactional data systems directly on S3 Express One Zone, instead of using replication? Read on for an analysis.
Cloud object storage is becoming the universal storage layer for a wealth of cloud data systems. Some systems use object stores as the only storage layer, accepting the higher latency of object storage, and these tend to be analytics systems that can accept multi-second latencies. Transactional systems want single-digit millisecond latencies or latencies in the low tens of milliseconds and therefore don’t write to object stores directly. Instead, they land data on a fast replicated write-ahead-log (WAL) and offload data to an object store for read-optimized long-term, economical storage. Neon is a good example of this architecture. Writes hit a low-latency replicated write-ahead-log based on Multi-Paxos and data is eventually written to object storage.
Both Kafka and Confluent’s Kora also use this architecture, which in streaming tends to be called tiered storage. Essentially, the Kafka/Kora brokers are a replicated write-cache and serving layer, and data gets compacted into larger objects and offloaded to more economical cloud storage asynchronously.
Call it a write-ahead-log or a durable write-cache, the idea is that data lands on a stateful replicated log layer before being offloaded to an object store. For transactional systems that need low latency, this has been the required architecture. The arrival of S3 Express One Zone has raised the question of whether this fast replicated write-ahead-log layer can be replaced by Express One Zone, foregoing the need for the stateful replication layer. S3 Express One Zone is a low-latency tier of S3 that offers single-digit latency though only distributes data within a single availability zone. Given that Express One Zone storage cost is 7 times more expensive than Standard, it isn’t suitable for long-term storage. But it can be suitable for the durable write-ahead-log/write-cache in the same way Kafka/Kora/Neon uses replication today.
Replication and S3 Express One Zone have different economic cost models and delivering a cost-competitive service based on replication or S3 Express One Zone depends on a number of factors. That is the focus of this blog post.
Five durable write-ahead-log (WAL) options
I work at Confluent, and the vast majority of production dedicated clusters in Confluent Cloud are multi-AZ. Most single-AZ clusters are for development and QA clusters. Most organizations want high availability so that in the event that a single zone becomes degraded or goes offline, the system continues to operate. While zonal outages do not happen every day, they do happen, and it only takes one outage to directly cost all affected organizations a lot of money and also cause reputational damage. It only takes one long outage to cause material harm to a business - hence the reluctance of organizations to put all their eggs in one zone.
That said, we’ll compare both single-AZ and multi-AZ write-ahead-log options.
The WAL Cost Model
For this cost study, we’ll assume that the WAL keeps the last 6 hours of data. The offloading could happen much earlier, but we’ll price for 6 hours of local retention. We’ll also focus on the write path, as this is the dominant cost driver of both replication and S3 Express One Zone. Replication can use tricks such as fetch-from-follower to avoid cross-AZ data transfers on consumption and the GET costs of S3 Express One Zone are less than a tenth of the cost of PUTs.
The Replication Cost Model
A stateful replication layer has three costs:
Compute: The compute instances (servers).
Storage: The storage drives (such as EBS)
Networking: The cross-AZ data transfer costs. Single-zone clusters have $0 cost and Microsoft Azure recently announced that they are officially not charging for cross-AZ data transfer at all. So the networking costs can be zero in some deployments.
This analysis ignores the compute instance costs, as the requirements for CPU and memory are wholly dependent on the service and its implementation. We’ll focus on the base storage and network resources required.
To calculate storage requirements:
Total storage throughput (MB/s) = replication factor * aggregate ingress throughput (MB/s).
Total storage size = total storage throughput * 6 hours * 60 minutes * 60 seconds.
Storage throughput per node = Total storage throughput / number of nodes.
Storage size per node = Total storage size / number of nodes.
To calculate the cross-AZ data transfer:
Producer cross-AZ throughput (MB/s) = ⅔ * aggregate ingress throughput MB/s (on average, 2 out of 3 bytes will cross an AZ).
Replication cross-AZ = 2 * aggregate ingress throughput MB/s (each byte that lands at a leader will be replicated to two followers).
Total cross-AZ = 2.66 * aggregate ingress throughput MB/s.
For example, given 3 nodes, a replication factor of 3 and an aggregate ingress of 100 MB/s, we get:
Total storage throughput = 300 MB/s, with 100 MB/s per node.
Total storage size = 6.5 TB, with 2.16 TB per node.
Cross-AZ = 2.66 * 100 MB/s = 266 MB/s.
Let’s calculate some examples:
1 MB/s, 3 nodes
Storage: 3x gp3, 125 MB/s, 3000 IOPs, 10 GB = $2 per month.
Cross-AZ: 0.02 * 0.00266GB/s * 30 * 24 * 60 * 60 = $135 per month.
100 MB/s, 3 nodes
Storage: 3x gp3, 125 MB/s, 3000 IOPs, 2.5TB = $600 per month.
Cross-AZ: 0.02*0.266GB/s*30*24*60*60 = $13800 per month.
1000 MB/s, 9 nodes
Storage: 18x gp3, 350 MB/s, 5000 IOPs, 7.5TB = $1850 per month.
Cross-AZ: 0.02*2.66GB/s*30*24*60*60 = $138,000 per month.
The replication model is dominated by cross-AZ data transfer costs, taking around 90-95% of the storage + networking combined cost with 6 hours of local retention. Perhaps because cross-AZ costs can be so large, the CSPs tend to give out sizable cross-AZ discounts, and Azure simply doesn’t charge for cross-AZ at all. The larger the aggregate data transfer, the larger the discounts an organization can typically obtain. I have seen discounts above 85% for organizations with a large cloud footprint.
Therefore, the dominant cost factors for a multi-AZ replication-based system are the throughput combined with the cross-AZ data transfer discount. It’s worth remembering that on Azure, there are no cr0ss-AZ costs at all.
The Express One Zone Cost Model
Where replication in multi-AZ deployments is dominated by the number of bytes that cross availability zones, the dominant cost driver for Express One Zone is the request rate with an additional condition on the request size.
The S3 Express One Zone pricing:
Storage: $0.16 per GB per month.
PUT requests:
$0.0025 per 1000 PUT requests.
$0.008/GB for all bytes that exceed 512KB in PUT requests.
S3 % discount.
Regarding discounts, I don’t believe large customers pay list price for anything. However, the S3 discounts are typically a lot lower than cross-AZ data transfer discounts, and Azure effectively have a data transfer discount of 100%. I’ve personally not seen discounts above 25-30% mark for S3, and those are for the really big cloud footprints, but that’s my own experience.
An S3 Express One Zone write-cache can be comprised of one, two or three availability zones. Multiple zones can be achieved by writing each object to multiple buckets where each bucket is in a different zone. For a 3-zone configuration, it can use a majority quorum approach so that a write is deemed successful once 2 out of 3 writes have confirmed (to avoid latency spikes), but eventually it is written to all 3.
At the crux of Express One Zone costs are the tension between request rate (and request size) and the tolerated buffering latency. Where replication costs are unaffected by the number of actual requests exchanged between nodes, S3 Express One Zone is all about the number of requests. Given the Express One Zone is a low-latency tier capable of single-digit latency writes, it doesn’t make sense to buffer data for a long time before writing a request. For a transactional data system, the buffering time must be kept short. As we’ll see in this analysis, this can be ok for high throughput workloads but presents a real problem for low throughput workloads.
The following relationships exist regarding request rate, request size, desired latency and throughput:
There is a linear relationship between request size and the amount of buffering time on the proxy, for a fixed throughput.
There is an inverse linear relationship between request size and request rate for a fixed throughput.
There is a linear relationship between the request rate and the number of proxies, for a fixed buffering time.
These relationships come into play to affect costs. The following charts show the Buffering Time and Put Rate relationships for a 1 GB/s throughput over 9 proxies. For a fixed throughput and proxy count, the longer we buffer, the larger the request sizes and the lower the put rate.
The aggregate put rate and aggregate bytes over 512 KB put size, is the main cost-driver. In the monthly costs chart below, we see that Express One Zone (1-zone) is half the cost of Standard for requests under 512KB, which reflects the pricing of the two tiers: $0.0025 per 1000 requests for Express One Zone and $0.005 per 1000 requests for Standard.
Because Express One Zone charges an additional fee for requests over 512KB, costs jump up again once the request size exceeds 512KB and then start to drop slightly. Where Standard benefits from ever-larger and less frequent requests, increasing the request size beyond 512 KB does not benefit Express One Zone at all.
The best-case request costs for each:
Standard: $200 per month, or 0.000072 per GB(64MB request size)
Express One Zone (1az): $13,000 per month, or $0.005 per GB (512KB request size)
Express One Zone (2az): $26,000 per month, or $0.01 per GB (512KB request size)
Express One Zone (3az): $39,000 per month, or $0.015 per GB (512KB request size)
The good news for this 1 GB/s workload over 9 proxies is that only 5ms of buffering time is required to reach 512 KB requests. However, for lower throughputs, there is a request rate floor based on the maximum buffering time to be tolerated. For example, a 10ms buffering time will result in a minimum rate of 100 requests per second per proxy.
Let’s examine how put rate, put size, and put costs change according to throughput for a fixed number of proxies (6 in this case), and with 1-zone and 3-zone configurations.
We see that with 6 proxies, there is a baseline cost of $3888 per month (1-zone) and $11664 per month (3-zone). This baseline affects the lower throughput end, and means that whether the workload is 1 MB/s or 250 MB/s the request costs are the same.
As I said before, the number of proxies impacts the request cost, as for a fixed buffering time, there is a linear relationship between proxy count and put rate (up until 512 KB requests sizes are reached), and therefore request cost. So it would be better to implement auto-scaling, where the number of proxies grows and shrinks in increments of 3 (assuming 3 AZs). Let’s say that each proxy should handle 100 MB/s of the aggregate ingress throughput, then the costs would look like the charts below.
The baseline put rate changed from 600/s (1-zone) and 1800/s (3-zone) to 300/s (1-zone) and 900/s (3-zone) when we allowed the number of proxies to go down to the minimum count of 3. This made the baseline cost reduce from $3888 to $1944 per month (1-zone) and from $11664 to $5832 per month (3-zone). The baseline also now only applies up to 150 MB/s from 250 MB/s when we used a fixed 6 proxy configuration. However, while the cost profile improved with auto-scaling, we’re still paying the flat rate of $5832 per month in request costs for workloads of 1 MB/s and a 100 MB/s (with a maximum buffering time of 10ms and a maximum request size of 512KB).
Given a write should be single digit milliseconds, I think 10 ms buffering already adds a large overhead (and already greater than the average end-to-end latency of replication). Because we must fix the maximum buffering latency in order to support a low-latency workload, a proxy has to send 100 (tiny) requests/s at a minimum even for low throughputs. Let’s look at the same charts for the 1-20 MB/s range.
The cost-per-GB rises to $2.25 for a 1 MB/s (2.5 TB per month) with a 3-zone configuration. This is far from the best case of $0.015/GB for a 3-zone configuration with a high throughput. Notice how request sizes reach as little as 3kb in order to satisfy the 10ms buffering latency at 1 MB/s across 3 proxies.
To summarize, the cost model of a fault-tolerant WAL based on S3 Express One is more complex than that of replication. It all depends on the mix of:
Throughput
Number of proxies
Maximum buffering latency
Maximum request size (economically, 512KB is the largest one should use, going higher would only happen if larger requests provided some other benefit to the system despite the higher cost).
It’s also worth noting that object-store-based logs require a low-latency addressing and sequencing component that enforces ordering and contains the logical-to-physical data mapping. This is most likely done through a State-Machine-Replication (SMR) service such as Raft. This splits the data plane into a flat address space of S3 data objects plus an ordered replicated log of object metadata. The costing of this component has been omitted in this analysis.
Fragmentation and small request sizes
This cost model assumes the co-location of data of multiple collections (tables, topic partitions etc) in shared objects. Not doing so simply isn’t cost-effective as it would drive up the cost-per-GB massively. The object sizes of S3 Express One Zone should be <= 512 KB, whereas S3 Standard objects will typically be a lot larger (because we are not penalized for large objects and we are not trying to do low latency writes meaning we can buffer for longer).
This data mixing in small object sizes adds a large amount of data fragmentation. When writing an object, we sort the data of each object by the collection (e.g. table or topic partition), which means that we have a number of contiguous blocks of data per object. With large objects, we get fewer and larger contiguous blocks and with 512KB objects, we get more numerous, smaller contiguous blocks. The impact of this depends on whether read requests are served from the WAL or not. For an architecture such as Neon, no reads are directed to the WAL at all. However, for a Kafka API workload, catch-up consumers might need to be served from the WAL objects if the requested data is no longer cached locally. Such catch-up consumers would elevate the number GET requests as a proportion of bytes read (i.e. it must read small sections of a large number of files).
Comparing Replication to S3 Express One Zone
There are infinite combinations to compare, so I’ll just take a selection of throughputs, with a limited selection of discounts.
1 MB/s
S3 Express One Zone:
3 proxies, max 10ms buffering, max 512 KB requests
3 proxies, fixed 100ms buffering, request size not limited
Replication: 3 nodes
50 MB/s
S3 Express One Zone:
3 proxies, max 10ms buffering, max 512 KB requests
3 proxies, fixed 100ms buffering, request size not limited
Replication: 3 nodes
500 MB/s
S3 Express One Zone:
6 proxies, max 10ms buffering, max 512 KB requests
6 proxies, fixed 100ms buffering, request size not limited
Replication: 9 nodes
1000 MB/s
S3 Express One Zone:
12 proxies, max 10ms buffering, max 512 KB requests
12 proxies, fixed 100ms buffering, request size not limited
Replication: 9 nodes
Storage
For replication, the EBS drives are sized according to the number of nodes, and some amount of additional margin. For example, for a 1 MB/s throughput, we only need three nodes and each node will host a copy of the 6 hours. This equates to roughly 3 nodes each with 21 GB of storage required, but I priced it at 3x30GB gp3 volumes.
As you can see, the EBS costs are just below those of the 2-AZ Express One Zone configuration. Of course, S3 Standard is far cheaper per GB than EBS or Express One Zone.
Networking vs request costs (+storage)
I compare 3-zone replication vs 3-zone Express One Zone. Additionally, I compare the following discounts:
Replication: Cross-AZ data transfer discounts of 0%, 50%, 80%, 90%.
S3: General S3 discounts of 0%, 25%.
If you wonder why the data transfer discounts are larger, it’s because this reflects the real world, as I mentioned earlier in the post. The data transfer discounts can get so large because of the huge costs involved with a large data transfer footprint and the relatively low cost to the CSP for the networking infrastructure. S3 likely has less wiggle room for discounts while still remaining profitable.
Single-AZ configurations
Note that the replication discount is 0% as it is irrelevant in a single-AZ configuration (or a multi-AZ cluster deployed in Azure).
What we’re seeing here for replication, is solely the storage cost, vs the storage + request cost of S3 Express One Zone. Not surprisingly, replication comes out cheaper.
What is interesting is the impact of buffering time and request size on S3 Express One Zone. At 1 MB/s, it is clearly more economical to buffer for 100ms, this is because this reduces the request rate while still keeping the request size <= 512 KB. However, at some point there is a crossover where buffering for 100ms makes the economics worse. We see this at the 500MB/s and 1 GB/s workloads where the long buffering creates 10 MB request sizes.
Multi-AZ configurations
Again, we see that buffering for longer on the lower throughputs is cheaper for S3 Express One Zone. But by the 500 MB/s, it actually becomes more expensive as the request sizes exceed the 512 KB request size.
We clearly see that the lower the throughput, the better replication does in terms of cost competitiveness. For the 500 MB/s and 1000 MB/s workloads, it all comes down to the size of the networking data transfer discount. For organizations with the best data transfer discounts, replication remains the cheapest, for lower discounts, S3 Express One Zone can end up cheaper, especially if it sticks to 512 KB request sizes.
We should also take into consideration that replication will offer the lowest produce latency and end-to-end latency. With proxies having to buffer writes for 5-10ms in this model, the S3 Express One Zone WAL has already lost the latency battle. Additionally, if the WAL wants multi-megabyte request sizes, then latencies are orders of magnitude worse than replication.
Comparing S3 Express One Zone to S3 Standard with longer buffering times
So far I’ve focused on low-latency workloads, given that this is a cost analysis for a low-latency fault-tolerant WAL. What if we relaxed the latency requirements of our WAL? If we were to use a larger buffering time, say anywhere from 100-250ms, then we’d be in the zone where S3 Standard could also be used (albeit with likely double the end-to-end latency). So how does Express One Zone compare to S3 Standard for larger buffering time, for a 1 GB/s workload (with 9 proxies).
As we can see, for Express One Zone, the cost sweet spot is 5ms of buffering on the proxy. All buffering time above that actually costs more money because the request size grew larger than 512 KB. S3 Standard does not penalize for larger requests, so as the buffering time goes up, the cost just keeps falling. So we see that if we’re going to buffer for longer periods, we might as well just use S3 Standard.
Using Express One Zone with longer buffering time makes for a strange kind of system. Probably half the end-to-end latency of Standard, but with a cost profile that is far, far inferior. Therefore I don’t see a clear use case of Express One Zone with long buffering times, unless it is to make low throughput workloads more economical by reducing the request rate. For a high throughput workload, such buffering times are far more costly and add up to the worst of both worlds.
How I generated the data
Last year I wrote a Java program that I use to simulate the hardware resource needs (and costs) of State-Machine-Replication (SMR) workloads. For this analysis I wrote a new one for an S3 Express One Zone log. They spit out CSV files which I then run through some R notebooks. I may make time to clean them up for general consumption, but these calculations can all be done with a spreadsheet, using the pricing formulas available in AWS. It would be great to release a polished cost model, but that is a lot of work, and I don’t want to be responsible for other people’s cost calculations. Not to mention that discounts schemes can often make a general purpose model way too complex to build.
Conclusions
The principal weakness of S3 Express One Zone for a low-latency log is the cost-effectiveness at low to medium throughputs. Given there is a cost penalty to sending requests larger than 512KB, the main cost-driver is the request rate.
At low throughputs, the request rate remains constant, relative to the maximum buffering time to be tolerated. Given S3 Express One Zone offers single-digit write latency, it doesn’t make sense to pay the large premium compared to Standard if you are going to buffer for 100-400ms to make a low throughput workload economical. For this reason, I see Express One Zone as a possible candidate for a write-ahead-log only in high throughput workloads. This eliminates the vast majority of single-tenant workloads, and so for the regular workloads, all that is left is using it within a multi-tenant serverless architecture. In such an architecture you would need to mix the data of different tenants into shared objects. BYOC deployments would not be able to make use of multi-tenancy, so the use of Express One Zone in BYOC would have to be limited to high throughput workloads only.
Replication on the other hand suffers from none of these issues. Its main problem is cross-AZ data transfer, but given the right discounts for the large workloads, it can be made to be cheaper than S3 Express One Zone (and with lower latencies).
There are a number of players in the “write-directly-to-S3” game right now. For my part, I work at Confluent which recently announced Freight Clusters, a Kafka-over-object-storage implementation. Freight clusters is a high latency system based on standard object store tiers. We’ve been studying S3 Express One Zone closely to see if it would be a good fit as a low-latency write-cache to replace replication, and for us at least, it doesn’t make sense. We can get better performance and a better cost with replication (via Kora brokers) rather than doing so with Express One Zone. That might change, or it might not. It’s possible that AWS could reduce the request costs enough for it to be more competitive, or make S3 Express MultiZone, but it’s also possible that they might reduce or eliminate cross-AZ data transfer fees in order to compete better against Azure. Should cross-AZ data transfer become free, then there is no hope of S3 Express One Zone ever being as cheap as replication. We’ll just have to see how it plays out.