October 2023
From: Brian and Tobias
Subject: Data unlocking enterprise adoption of AI
In this edition, we want to talk about data. We believe that a series of Data Unlocks are needed to bring Generative AI into production in the enterprise.
The importance of data is not a new concept. I was recently speaking with a relative who’s 80 years old and worked with the IBM System/360 in the 70s. He fondly reminisced on his time at IBM as “the era of big data.”
Fast forward to the 21st century and we’re saying the same things. In the 2010s, a canonical “modern data stack” emerged. Many VCs poured thousands of words on pages explaining it, what it would look like, and why it mattered. MLOps received a boatload of funding, and some incredible companies – Snowflake, Databricks, dbt – were founded. Data has been the driving force behind a nearly century-old IT revolution.
However, the data stack of the 2010s is not the permanent data stack of the 2020s and beyond. The consensus suspicion is that the tooling we use needs to change for LLMs, and that opens up opportunities as VCs to make investments in new startups.
We won’t claim to know today how the stack needs to change specifically. In fact, engineers building LLM apps are still figuring that out. However, we do know what needs to get unlocked. LLM deployments are dragging in the mud of enterprises, and a lack of data infra is one reason why. We call it “the POC chasm.” Databricks recently published data that confirms this – although 88% of organizations surveyed are using Generative AI, 62% (70% of users) are still just experimenting with it, yet to productionalize.
Improving data for enterprises is a key bridge to ultimate adoption. This is multifaceted and includes qualities like security, speed, cost, lineage, organizational specificity, correctness, and access, among others. Let’s dive into these.
Security:
If AI is as dangerous as many tech luminaries suggest, how to make it safe is of profound significance. Far more practical than alignment questions, the enterprise needs to contend with new attack vectors from GenAI, internal usage of these tools and their most precious asset, their data, when building their own LLM solutions. We’re most interested in the low-level data classification, tracking and cleansing work that will need to be done for the enterprise to safely leverage their data.
Speed:
The second area is speed, or latency of inference – being able to run models to produce outputs and results quickly enough to deliver a great customer experience. We were talking to a data engineer at a large consulting firm. In her opinion, latency is the biggest issue with current enterprise AI deployments. This is easy to see even in simple consumer examples. Go onto any text-to-image generation platform. Images can take minutes to generate. For companies at scale managing inference across thousands or millions of users, speed is still a big unsolved problem. We’re in the “dial up phase” of GenAI.
Databricks articulates well that a core driver of this problem is the tradeoff on GPUs between latency and throughput (defined as tokens/second evaluated). The more powerful the model, the slower it is to respond. Etched is trying to tackle this problem through new hardware, but there are other approaches too. One that is gaining a lot of momentum in developer communities is NeuralMagic, which enables you to run LLMs on CPUs much more efficiently than GPUs.
Cost:
Data cost extends beyond the cost of inference. Inference is just the last leg of a long journey to extract value from data. Even in traditional data analytics, costs are problematic. “If I had a nickel for every time someone complained about their Datadog bill…” is a very un-funny but real joke VCs tell. Across every layer of the stack, costs are painful, and standalone companies have emerged just focused on cost optimization, most notably for storage and observability.
This same dynamic is true for unstructured AI workloads, too. The process of prepping and storing data, and then fine-tuning with it, will be costly. All of these cost problems emerge even before a model is ready for prime time and any inference has occurred.
Cribl is a template we’re considering in the context of AI. It has weirdly flown under the radar. According to a press release from the company in late October:
“Cribl… has surpassed $100 million in annual recurring revenue (ARR), growing from $1 million to $100 million in ARR in less than four years. Cribl becomes the fourth-fastest infrastructure company to reach centaur status, following Wiz (1.5 years), HashiCorp (3 years), and Snowflake (3.5 years), and just ahead of Datadog (4.5 years).”
An investor recently gave us a great explanation of Cribl – it’s a filtering mechanism for companies to decide what data does or doesn’t get fed into its observability tooling. The first use case was Splunk, but now that has extended to other observability solutions too. Engineering teams don’t want their observability tools to ingest all data. It’s a waste of compute and money. Cribl is an on/off switch for companies to decide what data gets passed through.
The power of this approach is something we’re considering. Whenever a startup is in the same category as Wiz, Hashicorp, Snowflake, and Datadog, it’s worth paying attention. We think there may be applications of the Cribl model relevant to data challenges emerging in the use of LLMs.
Lineage:
Lineage (or explainability) is also key. In fact, one of the most interesting companies we’ve looked at recently was in this area. Companies need to be able to understand how a model makes decisions and what data it uses, both for correctness and compliance reasons. Explainability comes up regularly as an important roadblock for enterprises to productionalize LLMs. OpenAI came out with a tool in May trying to solve this problem, and lots of other traditional model observability businesses, as well as new startups like Alvin, are tackling this problem.
Organizational specificity:
Since ChatGPT came out, everyone has been talking about the need for fine-tuning and domain-specific models. This is true – we have heard repeatedly that the general purpose models from services like OpenAI and Anthropic are a great starting point but not specific enough to drive enterprise value. Just watch Sam Altman’s recent developer day presentation – customization is a pillar of the presentation. Check out minute ~10. Fine-tuning and bespoke model deployment is a key initiative.
Many methods are emerging to help re-jigger foundation models to be more use case-specific. The two core methods we’re focused on are (1) methods to incorporate human preferences/feedback into the models’ learning (e.g. supervised fine-tuning and reinforcement learning from human feedback) and (2) retrieval augmented generation (RAG). Unlike human feedback mechanisms, RAG connects LLMs to external data sources and helps the model retrieve the right data at the right time. The methods here are changing quickly, and more standards will continue to emerge. We think the winning approach will be a combination of many methods, extending flexibility and interoperability to enterprises to use these tools together and in the right places.
Access:
Companies often don’t even have the right data to begin with to fine-tune or build their own model. There are opportunities to increase this access, specifically in regulated industries. No company represents this better than Dandelion Health, a company in our portfolio that is building the best multimodal clinical healthcare data set in the world, particularly relevant for AI healthcare businesses. We suspect that companies focused on data access will be specialized, like Dandelion. There are likely similar opportunities across verticals like financial services and education.
Synthetic data also plays an important role in the data access problem. Because companies lack the data they need, an ability to create a representative data set using AI is a big opportunity. More companies we speak to are using synthetic data in their approaches, and we think there are good investment opportunities in verticalized synthetic data providers. Applied Intuition is one company we’ve heard is doing well, and one component of their business is providing synthetic data specifically to the automotive industry.
Correctness:
As the old adage goes, “garbage in, garbage out.” Data needs to be accurate to get any value out of AI. We were recently talking to a founder who said that by the time companies experiment with fine-tuned models, they find that LLMs perform 20% worse than they expect because they don’t have rich enough data internally. An important solution is the ability to prep internal data to be usable by LLMs. I recently interviewed Brian Raymond, CEO & Founder of Unstructured. Unstructured is building the “ETL for LLMs,” allowing companies to extract and transform unstructured data sitting in HTML, PDF, and other text-based formats into data that is machine-readable by LLMs. You can read the interview with Brian here – I found it incredibly educational and fun.
The amount of activity happening at the infra layer of AI is dizzying – observability, reinforcement learning, embeddings generation, etc. To organize opportunities, we’ve taken a bigger picture view to Data Unlocks that can remove bottlenecks from enterprise adoption. We’re continuing to add to the above list, and it’s just the tip of the iceberg of data problems we see in LLM productionalization.
As always, please share thoughts and feedback, and let us know if there’s anyone in your network we should add to this distribution.
Until next time,
Brian & Tobias