S3 Express One Zone, not quite what I hoped for

S3 Express One Zone, not quite what I hoped for

AWS just announced a new lower-latency S3 storage class and for those of us in the data infrastructure business, this is big news. It’s not a secret that a low-latency object storage primitive has the potential to change how we build cloud data systems forever. So has this new world arrived with S3 Express One Zone?

The answer is no, but this is a good time to talk about cloud object storage, its role in modern cloud data systems and the potential future role it can take.

The Architecture of Serverless Data Systems

The Architecture of Serverless Data Systems

I recently blogged about why I believe the future of cloud data services is large-scale and multi-tenant, citing, among others, S3. 

Top tier SaaS services like S3 are able to deliver amazing simplicity, reliability, durability, scalability, and low price because their technologies are structurally oriented to deliver those things. Serving customers over large resource pools provides unparalleled efficiency and reliability at scale.” So said myself in that post.

To further explore this topic, I am surveying real-world serverless, multi-tenant data architectures to understand how different types of systems, such as OLTP databases, real-time OLAP, cloud data warehouses, event streaming systems, and more, implement serverless MT.

The importance of liveness properties (with TLA+ Part 2)

The importance of liveness properties (with TLA+ Part 2)

In part 1 we introduced the concept of safety and liveness properties, then a stupidly simple gossip protocol called Gossa. Our aim is to find liveness bugs in the design and improve the design until all liveness issues are fixed.

Gossa had some problems. First it had cycles due to nodes contesting whether a peer was dead or alive. We fixed that by making deadness take precedence over aliveness but still the cluster could not converge. The next problem was that a falsely accused dead node was unable to refute its deadness as no-one would pay attention to it - deadness ruled.

The proposed fix I mentioned in part 1 was to allow a falsely accused node to refute its deadness via the introduction of a monotonic counter.

The importance of liveness properties (with TLA+ Part 1)

The importance of liveness properties (with TLA+ Part 1)

Invariants get most of the attention because they are easy to write, easy to check and find those histories which lead to really bad outcomes, such as lost data. But liveness properties are really important too and after a years of writing TLA+ specifications, I couldn’t imagine having confidence in a specification without one. This post and the next is a random walk through the world of model checking liveness properties in TLA+.

The outline is like this:

  • Part 1: I (hopefully) convince you that liveness properties are important. Then implement a gossip algorithm in TLA+ and use liveness properties to find problems.

  • Part 2: Continue evolving the algorithm, finding more and more liveness problems, overcome some challenges such as infinite state-space and impart some helpful principles - making you a better engineer and thinker by the end.

On the future of cloud services and BYOC

On the future of cloud services and BYOC

My job at Confluent involves a mixture of research, engineering and helping us figure out the best technical strategy to follow. BYOC is something I’ve been thinking about recently so I decided to write down the thoughts I have on it and where I think cloud services are going in general.

Bring Your Own Cloud (BYOC) is a deployment model which sits somewhere between a SaaS cloud service and an on-premise deployment. The vendor deploys their software in a VPC in the customer account but manages most of the administration for the customer. It’s not a new idea, the term Managed Service Provider (MSP) has been around since the 90s, and refers to the general term of outsourcing management and operations of IT infrastructure deployed within customer or third-party data centers.

Kafka KIP-966 - Fixing the Last Replica Standing issue

Kafka KIP-966 - Fixing the Last Replica Standing issue

The Kafka replication protocol just got a new KIP that improves its durability when running without fsync. As I previously blogged, Why Kafka Doesn’t Need Fsync to be Safe, there are distributed system designs that exist which allow for asynchronous storage engines. Being asynchronous means that the system can reap performance benefits which are not available to a synchronous storage engine.

Kafka vs Redpanda Performance - Do the claims add up?

Apache Kafka has been the most popular open source event streaming system for many years and it continues to grow in popularity. Within the wider ecosystem there are other open source and source available competitors to Kafka such as Apache Pulsar, NATS Streaming, Redis Streams, RabbitMQ and more recently Redpanda (among others).

Redpanda is a source available Kafka clone written in C++ using the Seastar framework from ScyllaDB, a wide-column database. It uses the popular Raft consensus protocol for replication and all distributed consensus logic. Redpanda has been going to great lengths to explain that its performance is superior to Apache Kafka due to its thread-per-core architecture, use of C++, and its storage design that can push high performance NVMe drives to their limits.

They list a bold set of claims and those claims seem plausible. Built in C++ for modern hardware with a thread-per-core architecture sounds compelling and it seems logical that the claims must be true. But are they?

Is sequential IO dead in the era of the NVMe drive?

Is sequential IO dead in the era of the NVMe drive?

Two systems I know pretty well, Apache BookKeeper and Apache Kafka, were designed in the era of the spinning disk, the hard-drive or HDD. Hard-drives are good at sequential IO but not so good at random IO because of the relatively high seek times. No wonder then that both Kafka and BookKeeper were designed with sequential IO in mind.

Both Kafka and BookKeeper are distributed log systems and so you’d think that sequential IO would be the default for an append-only log storage system. But sequential and random IO sit on a continuum, with pure sequential on one side and pure random IO on the other. If you have 5000 files which you are appending to in small writes in a round-robin manner, and performing fsyncs, then this is not such a sequential IO access pattern, it sits further to the random IO side. So just by being an append-only log doesn’t mean you get sequential IO out of the gate.