Constraints breed innovation and so does tenacity

August 23, 2024

Essays

A number of years ago, I got a bit addicted to coding challenges on HackerRank. It was kind of intense, fun learning the algorithms, frustrating seeing my early attempts crash and burn but ultimately triumphant when I finally got the thing to run in under N number of seconds that was demanded.

Something that has always stuck with me from those days was how easy it would have been to settle if I hadn’t known it was possible. In real life, on each iteration, as I slowly improved the running time, I could have settled. But in a coding challenge, if the target was 4 seconds then I knew it was possible, and even though it seemed impossible after my initial attempts, I carried on. I carried on despite frustration and exasperation as my attempts continued to fail.

Another example of this was the 1 Billion Row Challenge. You could see other people’s results, so you knew you could do better.

In the real world it's so easy to settle. You write an algorithm, come up with an architecture, design a protocol and so on; it works and has reasonable performance, reasonable properties. In the real world sometimes that's enough and you move on. But sometimes perhaps it’s worth striving and not settling for your early ideas. What if, instead of accepting your design you ask yourself for something better, may be something out of left field? May be add a constraint that if you could pull off would be amazing. May be it requires some extra reading, like when I would dip into my various algorithms books for another way, another strategy.

It’s something I seriously think about whenever it comes to software design and implementation. It’s a sort of ongoing epiphany that surfaces anytime I need to design something - remember HackerRank. I ask, well that’s good but what if you had to come up with something better than this?

Jack Vanlightly

August 22, 2024

Data

Table format comparisons - Streaming ingest of row-level operations

Jack Vanlightly

August 22, 2024

Data

Table format comparisons - Streaming ingest of row-level operations

In the previous post, I covered append-only tables, a common table type in analytics used often for ingesting data into a data lake or modeling streams between stream processor jobs. I had promised to cover native support for changelog streams, aka change data capture (CDC), but before I do so, I think we should first look at how the table formats support the ingestion of data with row-level operations (insert, update, delete) rather than query-level operations that are commonly used in SQL batch commands.

Jack Vanlightly

August 13, 2024

Data

Table format comparisons - Append-only tables and incremental reads

Jack Vanlightly

August 13, 2024

Data

Table format comparisons - Append-only tables and incremental reads

This post is about how the table formats support append-only tables and incremental reads. Streaming is becoming more and more important in the data analytics stack and the table formats all have some degree of support for streaming. One of the pillars of a streaming workload based on table formats is the append-only table. There are other pillars, such as changelog streams, and I’ll cover those in another post.

Incremental reads allow compute engines to perform repeated queries that return new records or changes to records that have occurred since the last query was executed. Basically, a table client polls the table on an interval, receiving the latest data on each occasion. Much like a Kafka consumer, albeit with a lot more end-to-end latency.

Jack Vanlightly

August 7, 2024

Data

Table format comparisons - How do the table formats represent the canonical set of files?

Jack Vanlightly

August 7, 2024

Data

Table format comparisons - How do the table formats represent the canonical set of files?

This is the first in a series of short comparisons of table format internals. While I have written in some detail about each, I think it’s interesting to look at what is the same or similar and what sets them apart from each other.

Question: How do the table formats represent the canonical list of data and delete files?

All the table formats store references to a canonical set of data and delete files within a set of metadata files. Each table format takes a slightly different approach but I’ll classify them into two categories:

The log of deltas approach (Hudi and Delta Lake)
The log of snapshots approach (Iceberg and Paimon)