The Sisyphean struggle and the new era of data infrastructure

Note: Chris Riccomini also wrote about commoditization in the data space in his interesting article: Databases are commodities, now what? My thoughts broadly match those of Chris, but I want to focus beyond technology here and speak about how strategy and vision play into this dynamic as well.

I just started re-reading Start With Why by Simon Sinek, a fantastic book on leadership and business strategy. The book’s core argument is that great companies don’t focus on what they do or offer, nor how they do it. Instead, they focus on their WHY, their story, and what they stand for.

This quote in particular jumped out at me:

There’s barely a product or service on the market today that customers can’t buy from someone else for about the same price, about the same quality, about the same level of service and about the same features. If you truly have a first-mover’s advantage, it’s probably lost in a matter of months. If you offer something truly novel, someone else will soon come up with something similar and maybe even better. But if you ask most businesses why their customers are their customers, most will tell you it’s because of superior quality, features, price or service. In other words, most companies have no clue why their customers are their customers. This is a fascinating realization. ― Start With Why, Simon Sinek.

For a while now, I’ve been spending a lot of time thinking about technology trends in the data infrastructure space (as a researcher at Confluent). These trends are making some previously difficult things easy and therefore, commodity. I would go as far as to say that we are witnessing a kind of phase change, a regime shift, at least in the cloud. Almost inevitably, the quote above brought me back to this topic as it revolves around the subject of commodity and competition. 

The man-hours required to build a distributed data system have never been so low. S3 is becoming the universal storage layer. Projects such as Apache Arrow, DataFusion, and Velox are open-source building blocks that system builders can compose into various types of data systems. Open table formats, such as Apache Iceberg, have dramatically reduced the complexity of building a data lake(house). We are seeing the slow but inevitable commoditization of the data primitives used to build a modern data architecture.

If the queue, the topic, the table, and even the database are becoming commodities, how do we build lasting data infrastructure businesses that can differentiate themselves to attract long-term customers and long-term success? When new competitors can spring up by leveraging these new high-level components, where is the moat? Should we expect a sector that resembles the Javascript framework wars of the 2010s, where seemingly each month a new hot framework became popular?

In some ways, we’ve already been living through this in the data/applications space. The ZIRP (the zero interest rate policy) era resulted in a period of “free money” which saw billions invested in venture capital. The result has been a swarm of start-ups with overlapping ideas and a crowded market. The Modern Data Stack epitomized this phenomenon, with a constantly changing pool of competing start-ups that each tried to bite off one specific chunk of the stack. 

But the buzz around the Modern Data Stack died when ZIRP died. People realized that many start-ups weren’t businesses; they were features. There are parallels to the data primitives being commoditized by S3, Apache Iceberg, and the database component projects. Building a data lake is a feature, it is no longer a business. Even building a database is becoming a feature. ClickHouse Cloud wrote about how they built their service in a year! Tomorrow, another group of talented engineers may come out of stealth, after a year of hard work, showing off their Postgres-compatible serverless database built on Arrow, DataFusion et al with S3 as its storage layer.

Finding (and keeping) the moat in this new era

If we look at the history of software development, we see that languages have become higher and higher level. Software got built on higher-level abstractions, and we developed more sophisticated tooling. The productivity of the individual programmer went up and with that, we built more and more sophisticated software. The commoditization of the data lake, the storage layer, and database components is this trend continuing to play out. The result, as before, will be more sophisticated software. As the cost of building data primitives decreases, more time can be focused on the capabilities of the platform as a whole. If data primitives are features, then platforms are businesses, platforms are the moat.

A platform is a coherent and cohesive set of features that seamlessly work together to provide a service more than the sum of its parts. Platforms are difficult to build, they comprise a thousand tiny things to be developed and maintained, and that make sense as a whole. They take years to build out and require a careful focus on core revenue streams that enable them to grow and develop over time.

“Beware of distractions disguised as opportunities”.  ― Attributed to Jim Ziegler.

The platform must avoid becoming encumbered by a myriad of features and capabilities that ultimately distract from the business's core values. There are always more features: some to please customers, some because a competitor has them, or some because they plausibly look like a way of extending the platform's reach. Every potential feature and capability should be viewed through the lens of a long-term vision, a story of what that platform is about. The world should know what that vision is and should see a platform that is concordant with it.

All organizations start with WHY, but only the great ones keep their WHY clear year after year.”  ― Start With Why, Simon Sinek.

The great platforms in the data infrastructure space, such as Databricks, Snowflake, MongoDB, and Confluent, have a strong and clear WHY. It was what attracted me to Confluent in the first place, how all the products worked together to form one cohesive vision. Confluent has the metaphor of the central nervous system (CNS). The CNS connects all parts of the body together, enabling the various parts of the body to work in unison for a common aim. Confluent uses the phrase Data in Motion over and over to communicate that its value is in the transmission and sharing of data. Everything Confluent does has been in line with its WHY, and its platform is a cohesive set of pieces that act in service of this WHY. Even the recent Tableflow announcement is consistent with this story, as Confluent sees Iceberg as another data-sharing primitive. It is very difficult to hoard data when part of your roadmap is Bring Your Own Bucket (BYOB).

“The only way people will know what you believe is by the things you say and do, and if you’re not consistent in the things you say and do, no one will know what you believe.”  ― Start With Why, Simon Sinek.

There are other examples of great platforms that have executed well over many years. Databricks has always focused on being the one-stop shop for data analytics, catering to the different personas of data engineers, analysts, and data scientists. They have been consistent both in terms of external messaging and the products they have developed, for years. They have evolved their platform carefully to encompass a wide array of capabilities while also staying consistent with their WHY. MongoDB, since its inception, has been synonymous with developer productivity, flexible data models and scalability and performance. Both of these companies have been rewarded for it.

So far I have focused on platforms but Chris Riccomini argues in his article, Databases are commodities. Now what?, that there are three ways for a database company of staying competitive: 1) build a platform, 2) build a vertical (a niche) and 3) build a multi-modal database. I would classify the third point more broadly (as I am considering the data infra space as a whole), and say that a company can build something more sophisticated, something new and innovative. There are still new technology frontiers out there to be built. The baseline has risen, and with it, the bleeding edge. Chris uses the example of HTAP, the still arguably unachieved dream of seamlessly combining the capabilities of OLTP and OLAP into a single database system. From the perspective of an engineer and researcher, I find the notion that there are still frontiers out there encouraging - I want to keep working on interesting data infrastructure!

I was worried that database commoditization might leave the space dull. After thinking through these opportunities, I’m excited.” Chris Riccomini.

In the case of Confluent, it has its own new technology frontiers. The principal new frontier in my mind is the unification of streams and tables. Confluent and other vendors in the streaming space are still working on making streaming as easy as batch and even blurring the lines between the two. There is still a lot of ground to cover to reach that goal.

Every plucky start-up that is a feature today should be racing to build a platform, attacking a niche area where the niche itself is their moat or pushing the frontiers of technology (by enabling new workloads not optimizing existing ones). A commodity start-up will (hopefully) have some angle that it is using to gain traction. But there is nothing stopping another copycat, or even the larger platforms, from adding that feature or angle to their own suite and extinguishing the unique value of the newcomer (or acquiring them). When the angle is “like X but faster”, or “Y but cheaper”, or “Z but implemented in Rust”; those are not WHYs, they aren’t even very good differentiators, and they certainly are not easily defensible.

(Spinal Tap reference for those of you not born in the prior millenia)

The Sisyphean Struggle

Coming back to my question at the beginning, I don’t see a future of start-ups and copycats turning the industry into a chaotic mess. The building blocks are making it easier to build primitives, but that only raises the table stakes for everyone. A metaphor for the continual race of table stakes vs the bleeding edge is that of evolutionary biology and the Red Queen hypothesis.

Progress and success are always relative. When the land was unoccupied by animals, the first amphibian to emerge from the sea could get away with being slow, lumbering, and fish-like, for it had no enemies and no competitors. But if a fish were to take to the land today, it would be gobbled up by a passing fox as surely as a Mongol horde would be wiped out by machine guns. In history and in evolution, progress is always a futile, Sisyphean struggle to stay in the same relative place by getting ever better at things.

This concept, that all progress is relative, has come to be known in biology by the name of the Red Queen, after a chess piece that Alice meets in Through the Looking-Glass, who perpetually runs without getting very far because the landscape moves with her." ― Matt Ridley, The Red Queen: Sex and the Evolution of Human Nature

As the building blocks evolve, companies respond by racing further ahead of the baseline, building ever more powerful platforms or ever more capable primitives that raise the bar again. This dynamic has existed since the dawn of technology. We may be in a phase change, but it’s not the first and won’t be the last. The companies that will thrive in this Sisyphean struggle will be those that embrace this change while also sticking to their WHY year after year. Every company should know why its customers chose them, beyond price and product features, if they want long-term and lasting success.

I recommend both Start With Why, and The Red Queen, both are excellent books!