Jack Vanlightly

Data

Incremental Jobs and Data Quality Are On a Collision Course - Part 2 - The Way Forward

Incremental Jobs and Data Quality Are On a Collision Course - Part 2 - The Way Forward

So what should we do instead?

This is less of a technology problem and more of a structural problem. We can’t just add some missing features to data tooling; it’s about solving a people problem, how we organize together, how team incentives line up, and also about applying well-established software engineering principles that are still to be realized in the data analytics space.

Share

Incremental Jobs and Data Quality Are On a Collision Course - Part 1 - The Problem

Incremental Jobs and Data Quality Are On a Collision Course - Part 1 - The Problem

Big data isn’t dead; it’s just going incremental

If you keep an eye on the data space ecosystem like I do, then you’ll be aware of the rise of DuckDB and its message that big data is dead. The idea comes from two industry papers (and associated data sets), one from the Redshift team (paper and dataset) and one from Snowflake (paper and dataset). Each paper analyzed the queries run on their platforms, and some surprising conclusions were drawn – one being that most queries were run over quite small data. The conclusion (of DuckDB) was that big data was dead, and you could use simpler query engines rather than a data warehouse. It’s far more nuanced than that, but data shows that most queries are run over smaller datasets. 

Why?

Share

Table format comparisons - Change queries and CDC

Table format comparisons - Change queries and CDC

This post, and its associated deep dives, will look at how changes made to an Iceberg/Delta/Hudi/Paimon table can be emitted as a stream of changes. In the context of the table formats, it is not a continuous stream, but the capability to incrementally consume changes by performing periodic change queries.

These change queries can return full Change Data Capture (CDC) data or just the latest data written to the table. When people think of CDC, they might initially think of tools such as Debezium that read the transaction logs of OLTP databases and write a stream of change events to something like Apache Kafka. From there the events might get written to a data lakehouse. But the lakehouse table formats themselves can also generate a stream of change events that can be consumed incrementally. That is what this post is about.

Share

Table format comparisons - Streaming ingest of row-level operations

Table format comparisons - Streaming ingest of row-level operations

In the previous post, I covered append-only tables, a common table type in analytics used often for ingesting data into a data lake or modeling streams between stream processor jobs. I had promised to cover native support for changelog streams, aka change data capture (CDC), but before I do so, I think we should first look at how the table formats support the ingestion of data with row-level operations (insert, update, delete) rather than query-level operations that are commonly used in SQL batch commands. 

Share

Table format comparisons - Append-only tables and incremental reads

Table format comparisons - Append-only tables and incremental reads

This post is about how the table formats support append-only tables and incremental reads. Streaming is becoming more and more important in the data analytics stack and the table formats all have some degree of support for streaming. One of the pillars of a streaming workload based on table formats is the append-only table. There are other pillars, such as changelog streams, and I’ll cover those in another post.

Incremental reads allow compute engines to perform repeated queries that return new records or changes to records that have occurred since the last query was executed. Basically, a table client polls the table on an interval, receiving the latest data on each occasion. Much like a Kafka consumer, albeit with a lot more end-to-end latency.

Share

Table format comparisons - How do the table formats represent the canonical set of files?

Table format comparisons - How do the table formats represent the canonical set of files?

This is the first in a series of short comparisons of table format internals. While I have written in some detail about each, I think it’s interesting to look at what is the same or similar and what sets them apart from each other.

Question: How do the table formats represent the canonical list of data and delete files?

All the table formats store references to a canonical set of data and delete files within a set of metadata files. Each table format takes a slightly different approach but I’ll classify them into two categories:

  • The log of deltas approach (Hudi and Delta Lake)

  • The log of snapshots approach (Iceberg and Paimon)

Share

SQL Server CDC to Redshift Pipeline

In this post we'll take a look at what Change Data Capture (CDC) is and how we can use it to get data from SQL Server into Redshift in either a near real-time streaming fashion or more of a batched approach.

CDC is a SQL Server Enterprise feature and so not available to everyone. Also there are vendors that sell automated change data capture extraction and load into Redshift, such as Attunity and that may be your best option. But if you can't or don't want to pay for another tool on top of your SQL Server Enterprise license then this post may help you.

Building Synkronizr - A SQL Server Data Synchronizer Tool - Part 1

Origins of Synkronizr

In a recent post I described a method I recently used at work for synchronizing a SQL Server slave with a master. Because the master is hosted by a partner and that partner does not offer any replication, mirroring or log shipping I opted for a replication technique loosely based on how some distributed NoSQL databases do it - the generation and comparison of hash trees (Merkle Trees).