Understanding Apache Hudi's Consistency Model Part 3

In part 1 we built a logical model for how copy-on-write tables work in Apache Hudi, and posed a number of questions regarding consistency with regard to types of concurrency control, timestamp monotonicity and more. In part 2 we studied timestamp collisions, their probabilities and how to avoid them (and be conformant to the Hudi spec). In part 3, we’ll be focusing on the results of model checking the TLA+ specification, and answering those questions.

Understanding Apache Hudi's Consistency Model Part 2

In part 1 we built up an understanding of the mechanics of Hudi Copy-on-write tables, with a special regard to multi-writer scenarios, using a simplified logical model. In this part we’ll look at:

  • Understanding why the Hudi spec instructs the use of monotonic timestamps, by looking at the impact of timestamp collisions.

  • The probability of collisions in multi-writer scenarios where writers use their local clocks as timestamp sources.

  • Various options for avoiding collisions.

Understanding Apache Hudi's Consistency Model Part 1

Apache Hudi is one of the leading three table formats (Apache Iceberg and Delta Lake being the other two). Whereas Apache Iceberg internals are relatively easy to understand, I found that Apache Hudi was more complex and hard to reason about. As a distributed systems engineer, I wanted to understand it and I was especially interested to understand its consistency model with regard to multiple concurrent writers. Ultimately, I wrote a TLA+ specification to help me nail down the design and understand its consistency model. 

Scaling models and multi-tenant data systems - ASDS Chapter 6

Scaling models and multi-tenant data systems - ASDS Chapter 6

What is scaling in large-scale multi-tenant data systems, and how does that compare to single-tenant data systems? How does per-tenant scaling relate to system-wide scaling? How do scale-to-zero and cold starts come into play? Answering these questions is chapter 6 of The Architecture of Serverless Data Systems.

Rather than looking at a new system, this chapter focuses on some patterns that emerge after analyzing the first five systems. So far, I’ve included storage-oriented systems (Amazon DynamoDB, Confluent Cloud’s Kora engine), OLTP databases (Neon, CockroachDB serverless), and an OLAP database (ClickHouse Cloud). The workloads across these systems diverge a lot, so their architectural decisions also diverge, but some patterns emerge.

Serverless ClickHouse Cloud - ASDS Chapter 5 (part 2)

Serverless ClickHouse Cloud - ASDS Chapter 5 (part 2)

In part 1 we looked at the open-source ClickHouse architecture, in part 2 we dive into the serverless architecture of ClickHouse Cloud.

The serverless ClickHouse architecture

ClickHouse Cloud is an entirely serverless offering of ClickHouse that uses a new MergeTree engine called the SharedMergeTree. Where a traditional on-premise ClickHouse cluster is composed of a set of monolithic servers that do both query processing and storage, CH Cloud servers separate storage and compute, with the servers being stateless compute instances and storage served solely by cloud object storage. The SharedMergeEngine is at the core of this serverless architecture.

Serverless ClickHouse Cloud - ASDS Chapter 5 (part 1)

Serverless ClickHouse Cloud - ASDS Chapter 5 (part 1)

See The Architecture of Serverless Data Systems introduction chapter to find other serverless data systems. This chapter is the first system of group 3 - the analytics database group.

ClickHouse is an open-source, column-oriented, distributed (real-time) OLAP database management system. ClickHouse Cloud offers serverless Clickhouse clusters, billed according to the compute and storage resources consumed. 

Before we jump into the architecture of ClickHouse Cloud, it’s worth understanding the architecture of a standard ClickHouse cluster first. The storage architecture of ClickHouse plays a critical role in how the serverless version was designed. Part 1 of this chapter will focus on the standard, fully open-source ClickHouse, and part 2 will focus on the serverless architecture of ClickHouse Cloud.

Serverless CockroachDB - ASDS Chapter 4 (part 3)

Serverless CockroachDB - ASDS Chapter 4 (part 3)

In part 3, the focus is on heat management (the mitigation of hot spots in the storage layer) and autoscaling in the compute layer. In part 2 we looked at the Admission Control sub-system of CRDB and how it helps node overload and noisy neighbors. CRDB uses the combination of sharding, shard movement, and lease distribution to avoid hot spots in its shared storage layer.

Serverless CockroachDB - ASDS Chapter 4 (part 1)

Serverless CockroachDB - ASDS Chapter 4 (part 1)

CockroachDB is a distributed SQL database that aims to be Postgres-compatible. Over the years, the Postgres wire protocol has become a standard of sorts with many database products implementing its wire protocol (much like the Apache Kafka protocol has become a de facto standard in the streaming space).

While it may be Postgres-compatible, there is almost nothing about the serverless CockroachDB architecture that is shared with Neon (serverless Postgres covered in the previous chapter). What they do share, like all the serverless multi-tenant systems in this series, is the separation of storage and compute; the rest is completely different.

Neon - Serverless PostgreSQL - ASDS Chapter 3

Neon - Serverless PostgreSQL - ASDS Chapter 3

Neon is a serverless Postgres service based on an architecture similar to Amazon Aurora. It separates the Postgres monolith into disaggregated storage and compute. The motivation behind this architecture is four-fold:

  • Aim to deliver the best price-performance Postgres service in the world.

  • Use modern replication techniques to provide high availability and high durability to Postgres.

  • Simplify the life of developers by bringing the serverless consumption model to Postgres.

  • Do all this while keeping the majority of Postgres unchanged. Rather than building a new Postgres-compatible implementation, simply leverage the pluggable storage layer to provide all the above benefits while keeping Postgres Postgres.

Kora - Serverless Kafka - ASDS Chapter 2

Kora - Serverless Kafka - ASDS Chapter 2

This is the second instalment of the Architecture of Serverless Data Systems. In the first post, I covered DynamoDB, and I will refer back to it where comparison and contrast are interesting.

Kora is the multi-tenant serverless Kafka engine inside Confluent Cloud. It was designed to offer virtual Kafka clusters on top of shared physical clusters, based on a heavily modified version of Apache Kafka. Today, as little as 20% of the code is shared with the open-source version as the demands of large-scale multi-tenant systems diverge from the needs of single-tenant clusters.

The goals of Kora were to avoid stamping cookie-cutter single-tenant Kafka clusters which would miss out on the economic and reliability benefits of large-scale multi-tenancy. This architecture is evolving fast, and a year or two from now, this description will likely be stale as we continue to disaggregate the architecture from the original Kafka monolith.

Amazon DynamoDB - ASDS Chapter 1

Amazon DynamoDB - ASDS Chapter 1

DynamoDB is a serverless, distributed, multi-tenant NoSQL KV store that was designed and implemented from day one as a disaggregated cloud-native data system.

The goals were to build a multi-tenant system that had the following properties:

  • Consistent performance with low single-digit latency.

  • Obtain high resource utilization through multi-tenancy in order to reduce costs which could be passed on to customers.

  • Unbounded size of tables where the size does not affect the performance.

  • Support for ACID transactions across multiple operations and tables.

Kafka vs Redpanda Performance - Part 6 - Draining backlogs

In the last post we saw how only Apache Kafka was able to fully utilize the 2 GB/s throughput limit of the i3en.6xlarge. In this post we’re going to test the ability of Kafka and Redpanda to drain a backlog while under continued producer load.

This test starts up the producers and consumers, at the target throughput, but then pauses the consumers for a period until consumer lag builds up to a desired amount (also known as a backlog), then the consumers are resumed and we see how long it takes for them to catch-up and return to sub-second end-to-end latency.

Kafka vs Redpanda Performance - Part 5 - Reaching the limits of the NVMe drive

In the previous post we saw how using record keys impacted both Apache Kafka and Redpanda. However, Redpanda struggled more than Kafka with the more numerous and smaller sized batches that result from key-based partition distribution. 

Next I decided to see if I could get Apache Kafka and Redpanda to reach the absolute limit of the NVMe drive throughput on the i3en.6xlarge - 2 GB/s. To do this I deployed both systems without TLS and modified the Redpanda 1 GB/s benchmark to attempt 2 GB/s, use 10 producers/consumers and use acks=1 instead of acks=all.

Kafka vs Redpanda Performance - Part 4 - Impact of record keys

In the last post we saw that Redpanda latency can literally jump once data retention limits kick in. In this post we’re going to look at the impact of using record keys.

When we don’t use record keys, the producer default partitioner accumulates messages in batches in a first-come-first-serve manner and randomly chooses a partition to send it to. The new Uniform Sticky Partitioner can also probabilistically choose less overloaded partitions. This is all good for performance because even with a short linger.ms, the producer can create large batches which is good for performance. On the contrary, sending more numerous and smaller batches can negatively impact performance.

Kafka vs Redpanda Performance - Part 3 - Hitting the retention limit

In the last post we saw that Redpanda performance degraded over time and that we needed to include a certain amount of drive over-provisioning to cope with the random IO nature of Redpanda. In this post we’re going to look at a phenomenon I see in every high throughput test I run which reaches the retention limit.

Kafka vs Redpanda Performance - Part 1 - 4 vs 50 producers

The Redpanda benchmark and TCO analysis claims that Redpanda needs only 3 brokers on the i3en.6xlarge to reach 1 GB/s and Apache Kafka needs 9 brokers and Kafka still shows inferior end-to-end latency to Redpanda. I decided to see if I could reproduce those claims and see if the Redpanda performance was generalizable. That is, does the 1 GB/s with 4 producers and consumers translate to other workloads? I.e, is it a useful benchmark you could make decisions off?

I ran the Redpanda 1 GB/s benchmark against Redpanda and Apache Kafka on identical hardware at 6 different throughputs: 500, 600, 700, 800, 900 and 1000 MB/s. I also ran it with the original 4 producers and consumers, then with 50 producers and consumers. The result was significant performance degradation with 50 producers for Redpanda. The other noteworthy result was that Redpanda was unable to reach 1000 MB/s with TLS which conflicts with the Redpanda benchmarks.

Paper: VR Revisited - Checkpoint-Based Replica Recovery (part 6)

In part 5 we looked at basic and log-suffix recovery and came to the conclusion that perhaps a persistent State Machine Replication (SMR) protocol should not use asynchronous log flushing with a recovery protocol for recovering from data loss - at least not one that gets stuck after a cluster crash. Checkpoint based recovery doesn’t change that position but it does introduce a very important component of any SMR protocol - checkpointing application state.