The Adoption of Machine Learning Will Resemble the Adoption of Databases

by Bryan Offutt

Image created with Midjourney

Artificial Intelligence has captured our interest and fascination at Index Ventures for many years now. With investments in companies like Scale, Aurora, Covariant, Cohere, Arthur.ai, Lightning, DeepScribe, Gong and many others, we’ve been firm believers and advocates of the incredible potential of this technology.

As we head into 2023, we continue to look at AI as one of our most important investment areas. Our talented data and infrastructure team got together to collectively identify the major trends in AI and distill our insights into four key pieces. We’ll be releasing them once a day for the rest of the week in hopes that the series will be useful to other operators, entrepreneurs, and investors in the space.

The adoption of machine learning will resemble the adoption of Databases.

Like databases, every engineer will need to know how to use models, but very few will need to build them from scratch.

The foundation of application software for the past 50 years has been the database, but the foundation for the next 50 will be the machine-learning model. As a result, a basic understanding of machine learning and how models work will become an essential part of every engineer's toolbelt rather than the knowledge of a select few specialists. There will always be a place for machine learning engineers, but, like the folks who build database engines, their numbers will be few and they will work at a small number of large vendors.

In fact, the progress of AI/ML over the past decade already bears a striking resemblance to what transpired in the database world in the late 20th century. Progress in early databases had a few specific phases, each of which played out over roughly a decade.

The Getting Started Phase (1960s): First databases emerge. They are a powerful new concept, but are difficult to use. Accessing even simple data is quite complicated, and all of the onus for efficient retrieval was put on the developer.

The Compute’s Not Cheap Enough Phase (1970s): In 1970, Edgar Codd released a series of papers outlining the relational model for databases, giving us the row and column mental model that we all know and love. The beauty of this model was its flexibility. It provided powerful, simple abstractions that could be expanded upon (via custom schemas) to fit a variety of use cases. Though this was a magical moment in database history, it was originally met with a good deal of skepticism, particularly in the early parts of the decade. These systems were considerably easier to use and reason with, but they required a lot more computing power.

The Ease of Use and Commercial Explosion Phase (1980s): The decade relational databases boomed. As computing resources became cheaper, relational databases became considerably more cost effective, and their ease of use catapulted them into the powerhouse they are today. Particularly important was the way in which relational databases used query optimization to move large aspects of performance management into the database and away from the developer. This made using these systems require far less expertise than the network and hierarchical databases that preceded them. SQL (invented in 1976, but made standard in the mid 80s) became the lingua franca of databases, and large companies such as Oracle emerged as commercial powerhouses.

Ubiquity (1990’s–Today): Fast forward 40 years, and virtually every software application in the world uses a database (whether relational or otherwise). The knowledge of how they work is one of the first things that you learn as a programmer, and developers up and down the stack must have at least a basic understanding of how they work in order to be effective. Additionally, no company would ever think about building their own database—it would not be even remotely cost efficient to do so. Instead, they use off-the-shelf products that allow them to layer their use case specific data model (schema) on top. Some of these products are free and open source (Postgres); others come from commercial organizations worth billions. This has always been the case (the early databases were created by folks like IBM), but we think it’s still worth reiterating.

Interestingly, this progression very closely follows some of what we have seen transpire with commercial adoption of machine learning and AI.

The Big-Company-Only Phase (Pre-2017): At first, training a machine-learning model that did anything useful was a highly specialized and very involved endeavor. You had to find your own data. You had to find a sufficient amount of compute to run training on that data. You had to choose an architecture. You had to have a deep understanding of how to do hyperparameter tuning to optimize the output of training. All of this was costly, expensive, and frankly just plain hard. This was true even if you weren’t using any deep learning techniques.

The Early Transformer Phase (2017–2020): Like databases, the future started with a single paper. In the case of AI, it was a paper called “Attention Is All You Need,” which introduced an algorithm architecture called transformers. This was shortly followed by a slew of pre-trained foundation models (BERT, XLNET, GPT-2) that utilized this architecture to achieve SOTA results, particularly in language. These models were subsequently open-sourced, and similar to how one can add a schema to a relational database to fit it to a use case, users can fine-tune these base models on their own data.

The Billion Parameter and Billion Startup Phase (2020–Today): The current phase started with the launch of GPT-3 by OpenAI. It proved that transformer models could be scaled to billions of parameters without an asymptote in performance, and that one very large pre-trained base model could perform well on many different tasks. The combination of generalizability with the fact that OpenAI offered GPT-3 via an API led to an absolute explosion of companies like Jasper and Copy.ai building on top of these models. Just as query optimizers moved performance complexity from the application developer to the database vendor, this phase of AI is seeing the complexity of training move from the end user to the model vendor.

The Ubiquity Phase (The Future): Before we know it, saying a product “uses AI” will seem as silly and obvious as saying a product “uses a database.” Today, the average developer might not know how a query planner works, but they definitely use a database. Similarly, we don’t expect the average developer in five years to know how a transformer works. But we guarantee they will know how to use a model. Companies like Cohere and Twelve Labs are already creating these foundational building blocks for text and video, respectively, allowing users to access powerful models as simply as they would use a database.

As a testament to the growing ubiquity of these models, countless companies leverage AI as a core component, but do not advertise it as a key differentiator. For example, our portfolio company Gong uses speech-to-text extensively in their product, but the front page of the website has no mention of ML or AI. The model itself is an implementation detail—it’s important but not differentiating. What is differentiated is the fantastic product experience that Gong has built around speech-to-text models, much in the same way that Salesforce was differentiated because of the great product experience around a database, not because of the database itself. It’s this experience that has allowed them each to build multibillion-dollar businesses.

Published — Dec. 16, 2022