Data

Table format comparisons - Streaming ingest of row-level operations

Table format comparisons - Streaming ingest of row-level operations

In the previous post, I covered append-only tables, a common table type in analytics used often for ingesting data into a data lake or modeling streams between stream processor jobs. I had promised to cover native support for changelog streams, aka change data capture (CDC), but before I do so, I think we should first look at how the table formats support the ingestion of data with row-level operations (insert, update, delete) rather than query-level operations that are commonly used in SQL batch commands. 

Table format comparisons - Append-only tables and incremental reads

Table format comparisons - Append-only tables and incremental reads

This post is about how the table formats support append-only tables and incremental reads. Streaming is becoming more and more important in the data analytics stack and the table formats all have some degree of support for streaming. One of the pillars of a streaming workload based on table formats is the append-only table. There are other pillars, such as changelog streams, and I’ll cover those in another post.

Incremental reads allow compute engines to perform repeated queries that return new records or changes to records that have occurred since the last query was executed. Basically, a table client polls the table on an interval, receiving the latest data on each occasion. Much like a Kafka consumer, albeit with a lot more end-to-end latency.

Table format comparisons - How do the table formats represent the canonical set of files?

Table format comparisons - How do the table formats represent the canonical set of files?

This is the first in a series of short comparisons of table format internals. While I have written in some detail about each, I think it’s interesting to look at what is the same or similar and what sets them apart from each other.

Question: How do the table formats represent the canonical list of data and delete files?

All the table formats store references to a canonical set of data and delete files within a set of metadata files. Each table format takes a slightly different approach but I’ll classify them into two categories:

  • The log of deltas approach (Hudi and Delta Lake)

  • The log of snapshots approach (Iceberg and Paimon)

SQL Server CDC to Redshift Pipeline

In this post we'll take a look at what Change Data Capture (CDC) is and how we can use it to get data from SQL Server into Redshift in either a near real-time streaming fashion or more of a batched approach.

CDC is a SQL Server Enterprise feature and so not available to everyone. Also there are vendors that sell automated change data capture extraction and load into Redshift, such as Attunity and that may be your best option. But if you can't or don't want to pay for another tool on top of your SQL Server Enterprise license then this post may help you.

Building Synkronizr - A SQL Server Data Synchronizer Tool - Part 1

Origins of Synkronizr

In a recent post I described a method I recently used at work for synchronizing a SQL Server slave with a master. Because the master is hosted by a partner and that partner does not offer any replication, mirroring or log shipping I opted for a replication technique loosely based on how some distributed NoSQL databases do it - the generation and comparison of hash trees (Merkle Trees).

Exploring the use of Hash Trees for Data Synchronization - Part 1

n this post we'll explore a relational database replication strategy that you can use when standard database replication is not an option – so no replication feature, no log shipping, no mirroring etc. The approaches outlined below will only work with a master-slave model where all writes go to the master. Conflict resolution is not addressed in this article.

We’ll cover phase one of a two-phase approach of
1.    Generate and compare hash trees to identify blocks of rows that have discrepancies
2.    For each block with a different hash value, identify and import the individual changes (insert, update, delete)
This post is really about exploring the approach rather than looking at the implementation details and detailed performance metrics. Perhaps I might share some code and metrics in a later post if people are interested.