November 26, 2024

Data

Dismantling ELT: The Case for Graphs, Not Silos

November 26, 2024

Data

Dismantling ELT: The Case for Graphs, Not Silos

ELT is a bridge between silos. A world without silos is a graph.

I’ve been banging my drum recently about the ills of Conway’s Law and the need for low-coupling data architectures. In my Curse of Conway and the Data Space blog post, I explored how Conway’s Law manifests in the disconnect between software development and data analytics teams. It is a structural issue stemming from siloed organizational designs, and it not only causes inefficiencies and poor collaboration but ultimately hinders business agility and effectiveness.

Jack Vanlightly

November 13, 2024

Data

Incremental Jobs and Data Quality Are On a Collision Course - Part 2 - The Way Forward

Jack Vanlightly

November 13, 2024

Data

So what should we do instead?

This is less of a technology problem and more of a structural problem. We can’t just add some missing features to data tooling; it’s about solving a people problem, how we organize together, how team incentives line up, and also about applying well-established software engineering principles that are still to be realized in the data analytics space.

Jack Vanlightly

November 13, 2024

Data

Incremental Jobs and Data Quality Are On a Collision Course - Part 1 - The Problem

Jack Vanlightly

November 13, 2024

Data

Big data isn’t dead; it’s just going incremental

If you keep an eye on the data space ecosystem like I do, then you’ll be aware of the rise of DuckDB and its message that big data is dead. The idea comes from two industry papers (and associated data sets), one from the Redshift team (paper and dataset) and one from Snowflake (paper and dataset). Each paper analyzed the queries run on their platforms, and some surprising conclusions were drawn – one being that most queries were run over quite small data. The conclusion (of DuckDB) was that big data was dead, and you could use simpler query engines rather than a data warehouse. It’s far more nuanced than that, but data shows that most queries are run over smaller datasets.

Why?

Jack Vanlightly

October 28, 2024

Data

The ultimate guide to table format internals - all my writing so far

Jack Vanlightly

October 28, 2024

Data

I’ve created this page to make it easier for me to share links about my writing on table format internals. Currently, it includes Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon.

Jack Vanlightly

September 19, 2024

Data

Table format comparisons - Change queries and CDC

Jack Vanlightly

September 19, 2024

Data

Table format comparisons - Change queries and CDC

This post, and its associated deep dives, will look at how changes made to an Iceberg/Delta/Hudi/Paimon table can be emitted as a stream of changes. In the context of the table formats, it is not a continuous stream, but the capability to incrementally consume changes by performing periodic change queries.

These change queries can return full Change Data Capture (CDC) data or just the latest data written to the table. When people think of CDC, they might initially think of tools such as Debezium that read the transaction logs of OLTP databases and write a stream of change events to something like Apache Kafka. From there the events might get written to a data lakehouse. But the lakehouse table formats themselves can also generate a stream of change events that can be consumed incrementally. That is what this post is about.

Jack Vanlightly

August 22, 2024

Data

Table format comparisons - Streaming ingest of row-level operations

Jack Vanlightly

August 22, 2024

Data

Table format comparisons - Streaming ingest of row-level operations

In the previous post, I covered append-only tables, a common table type in analytics used often for ingesting data into a data lake or modeling streams between stream processor jobs. I had promised to cover native support for changelog streams, aka change data capture (CDC), but before I do so, I think we should first look at how the table formats support the ingestion of data with row-level operations (insert, update, delete) rather than query-level operations that are commonly used in SQL batch commands.

Jack Vanlightly

August 13, 2024

Data

Table format comparisons - Append-only tables and incremental reads

Jack Vanlightly

August 13, 2024

Data

Table format comparisons - Append-only tables and incremental reads

This post is about how the table formats support append-only tables and incremental reads. Streaming is becoming more and more important in the data analytics stack and the table formats all have some degree of support for streaming. One of the pillars of a streaming workload based on table formats is the append-only table. There are other pillars, such as changelog streams, and I’ll cover those in another post.

Incremental reads allow compute engines to perform repeated queries that return new records or changes to records that have occurred since the last query was executed. Basically, a table client polls the table on an interval, receiving the latest data on each occasion. Much like a Kafka consumer, albeit with a lot more end-to-end latency.

Jack Vanlightly

August 7, 2024

Data

Table format comparisons - How do the table formats represent the canonical set of files?

Jack Vanlightly

August 7, 2024

Data

Table format comparisons - How do the table formats represent the canonical set of files?

This is the first in a series of short comparisons of table format internals. While I have written in some detail about each, I think it’s interesting to look at what is the same or similar and what sets them apart from each other.

Question: How do the table formats represent the canonical list of data and delete files?

All the table formats store references to a canonical set of data and delete files within a set of metadata files. Each table format takes a slightly different approach but I’ll classify them into two categories:

The log of deltas approach (Hudi and Delta Lake)
The log of snapshots approach (Iceberg and Paimon)

Jack Vanlightly

April 28, 2018

Data

SQL Server CDC to Redshift Pipeline

Jack Vanlightly

April 28, 2018

Data

In this post we'll take a look at what Change Data Capture (CDC) is and how we can use it to get data from SQL Server into Redshift in either a near real-time streaming fashion or more of a batched approach.

CDC is a SQL Server Enterprise feature and so not available to everyone. Also there are vendors that sell automated change data capture extraction and load into Redshift, such as Attunity and that may be your best option. But if you can't or don't want to pay for another tool on top of your SQL Server Enterprise license then this post may help you.

Jack Vanlightly

November 12, 2016

Data

Building Synkronizr - A SQL Server Data Synchronizer Tool - Part 1

Jack Vanlightly

November 12, 2016

Data

Origins of Synkronizr

In a recent post I described a method I recently used at work for synchronizing a SQL Server slave with a master. Because the master is hosted by a partner and that partner does not offer any replication, mirroring or log shipping I opted for a replication technique loosely based on how some distributed NoSQL databases do it - the generation and comparison of hash trees (Merkle Trees).