November 13, 2024

Data

Incremental Jobs and Data Quality Are On a Collision Course - Part 1 - The Problem

November 13, 2024

Data

Big data isn’t dead; it’s just going incremental

If you keep an eye on the data space ecosystem like I do, then you’ll be aware of the rise of DuckDB and its message that big data is dead. The idea comes from two industry papers (and associated data sets), one from the Redshift team (paper and dataset) and one from Snowflake (paper and dataset). Each paper analyzed the queries run on their platforms, and some surprising conclusions were drawn – one being that most queries were run over quite small data. The conclusion (of DuckDB) was that big data was dead, and you could use simpler query engines rather than a data warehouse. It’s far more nuanced than that, but data shows that most queries are run over smaller datasets.

Why?

Jack Vanlightly

November 5, 2024

Essays

On writing and getting from zero to done

Jack Vanlightly

November 5, 2024

Essays

On writing and getting from zero to done

How does one go from an idea through to a (self) published piece that is effective/beautiful/inspiring/useful? Whether it’s some analysis, creative writing, how-to developer material, writing is an art form. Getting from idea to completed work is not always straightforward.

Jack Vanlightly

November 2, 2024

Non-Tech

Learning first that you can learn

Jack Vanlightly

November 2, 2024

Non-Tech

Learning compounds, not only because new skills can be utilized to learn yet further skills, but over time, the more you learn, the more faith you have that you can learn. Greater faith leads to greater ambition when selecting new topics to learn and more enjoyment of the learning process itself.

Jack Vanlightly

October 31, 2024

Strategy and commentary

Forget the table format war; it’s open vs closed that counts

Jack Vanlightly

October 31, 2024

Strategy and commentary

Forget the table format war; it’s open vs closed that counts

Apache Iceberg is a hot topic right now, and looks to be the future standard for representing tables in object storage. Hive tables are overdue for a replacement. People talk about table format wars: Apache Iceberg vs Delta Lake vs Apache Hudi and so on, but the “war” at the forefront of my mind isn’t which table format will become dominant, but the battle between open vs closed – open table formats vs walled gardens.

Jack Vanlightly

October 28, 2024

Data

The ultimate guide to table format internals - all my writing so far

Jack Vanlightly

October 28, 2024

Data

I’ve created this page to make it easier for me to share links about my writing on table format internals. Currently, it includes Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon.

Jack Vanlightly

October 24, 2024

Essays

The teacher's nemesis

Jack Vanlightly

October 24, 2024

Essays

A few months ago I wrote Learning and Reviewing System Internals - Tactics and Psychology. One thing I touched on was how it is necessary to create a mental model in order to grok a codebase, or learn how a complex system works. The mental model gets developed piece by piece, using a layer of abstractions.

Today I am also writing about mental models and abstractions, but from the perspective of team/project leaders and their role in onboarding new team/project members. In this context, the team lead and senior engineers are teachers and how effective they has a material impact on the success of the team. However, there are real challenges and leaders can fail without being aware of it, with potentially poor outcomes if left unaddressed.

Jack Vanlightly

October 21, 2024

Strategy and commentary

The curse of Conway and the data space

Jack Vanlightly

October 21, 2024

Strategy and commentary

Conway’s Law:

"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure."

This is playing out worldwide across hundreds of thousands of organizations, and it is no more evident than in the split between software development and data analytics teams. These two groups usually have a different reporting structure, right up to, or immediately below, the executive team.

This is a problem now and is only growing.

Jack Vanlightly

September 26, 2024

Strategy and commentary

Table format interoperability, future or fantasy?

Jack Vanlightly

September 26, 2024

Strategy and commentary

Table format interoperability, future or fantasy?

In the world of open table formats (Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon, etc), an emerging trend is to provide interoperability between table formats by cross-publishing metadata. It allows a table to be written in table format X but read in format Y or Z.

Cross-publishing is the idea of a table having:

A primary table format that you write to.
Equivalent metadata files of one or more secondary formats that allow the table to be read as if it were of that secondary format.

Jack Vanlightly

September 19, 2024

Data

Table format comparisons - Change queries and CDC

Jack Vanlightly

September 19, 2024

Data

Table format comparisons - Change queries and CDC

This post, and its associated deep dives, will look at how changes made to an Iceberg/Delta/Hudi/Paimon table can be emitted as a stream of changes. In the context of the table formats, it is not a continuous stream, but the capability to incrementally consume changes by performing periodic change queries.

These change queries can return full Change Data Capture (CDC) data or just the latest data written to the table. When people think of CDC, they might initially think of tools such as Debezium that read the transaction logs of OLTP databases and write a stream of change events to something like Apache Kafka. From there the events might get written to a data lakehouse. But the lakehouse table formats themselves can also generate a stream of change events that can be consumed incrementally. That is what this post is about.

Jack Vanlightly

September 11, 2024

Strategy and commentary

BYOC, not “the future of cloud services” but a pillar of an everywhere platform

Jack Vanlightly

September 11, 2024

Strategy and commentary

BYOC, not “the future of cloud services” but a pillar of an everywhere platform

In 2023, I wrote a long post about why I don’t think the future of cloud data services is BYOC but large-scale multi-tenant SaaS. BYOC stands for Bring Your Own Cloud, and is the practice of deploying a co-managed service into a customer VPC. It’s somewhere between self-hosted and fully-managed SaaS. In my previous writing, I wrote in detail about the drawbacks of this deployment model from the perspective of both the vendor and the customer.

Since then, I’ve been involved in multiple calls with customers and prospective customers, where BYOC has been a large discussion point. When we lost a deal to a BYOC competitor, there were often valid reasons. A year on, my position on BYOC hasn’t really changed, though I would clarify that my position has been focused on a BYOC flavor where the vendor co-manages a complex, stateful single-tenant service. Confluent could have decided to package up Confluent Platform, its single-tenant self-hosted service, put it on Kubernetes with an operator and give it to customers as BYOC. But it wasn’t the right route for building out a BYOC offering at scale. Then Warpstream came along and showed another way of doing BYOC; one that avoids many of the pitfalls that make scaling a BYOC fleet so difficult.

In this post, I will reflect on my last year of customer conversations, movements in the market, Confluent’s acquisition of Warpstream, and its embrace of BYOC as a third deployment model.