May 2024
From: Brian and Tobias
Subject: The opportunity for orchestration
Hello all,
Welcome back to the May edition of the B&T infra newsletter. This month, we’re doing something a little bit different: an addendum to a previous version of the newsletter. In March, we wrote about Kubernetes, Slurm, and the need for better orchestration solutions for AI workloads.
That newsletter struck a nerve. VCs investing in AI Infra are awaiting a grand inference deluge as more and more AI solutions gain traction. We're in a fascinating moment where the underlying software – the foundation model – is more commoditized than the hardware infrastructure required to scale, hence the $2T in value creation from NVIDIA over the past 18 months alone and the rise of AI infra founders.
The market leader among these AI clouds is CoreWeave, who offers a fully managed platform purpose-built for AI workloads, through which some of the world’s leading AI labs and enterprises consume GPU infrastructure on their cloud. They recently closed a round valued at $19B, and they've developed an innovative approach to optimizing hardware within their platform, SUNK (aka Slurm on Kubernetes). In many ways, this is the solution we were looking for back in March when we wrote about AI orchestration and the challenge of K8s and Slurm, but we think there's still more to build.
SUNK: Slurm on Kubernetes
In our prior newsletter, we proposed that orchestrating AI training workloads was often a choice between Kubernetes and Slurm, where companies would inevitably need to modify Kubernetes if they used that solution. Both solutions have their pros and cons. Kubernetes is great for managing and deploying containerized workloads and scaling up/down, but bad at provisioning parallelized machines. Slurm is a traditional HPC scheduler good at running lots of parallelized workloads, but bad at scaling machines up and down in the event of different use cases; it’s totally static and inflexible to real-time workload changes and needs.
CoreWeave noticed this problem and built SUNK as the answer. For a great overview of SUNK’s capabilities, check out this video. The idea is to build Slurm on top of Kubernetes, which is to say that users no longer have to decide between the two. Typically, companies with AI workloads have a pool of compute running Kubernetes, and a separate pool of compute running Slurm. So, often companies decide which workloads to run with which scheduling software, facing tradeoffs between the two. SUNK removes that tradeoff.
More interesting though is that the use case SUNK truly unlocks for companies is to have the same cluster used for both training and production inference. In the past, this would have been really hard to pull off. Training is typically done with Slurm, and inference on Kubernetes. Because you would have to run these schedulers in different pools of compute, they would need to happen on different resources. More specifically, bursts of inference require available hardware just-in-time. However, because Slurm does not allow for auto-scaling up and down, accommodating those inference workloads on compute reserved for training would have been impossible, unless you just left a ton of the resources available and unused by training jobs.
SUNK gives Slurm workloads the auto-scaling powers of Kubernetes, meaning that inference and training workloads can share the same pool of compute. This is a massive benefit for companies, helping them use hardware as efficiently as possible. With SUNK, companies can in real-time deprioritize a non-urgent training run in favor of a much more urgent inference job. This unlocks efficiencies for customers with intense training and inference needs.
The problem of orchestration is real, but who will solve it?
SUNK shows us that status quo orchestration is bad. CoreWeave is building products to make GPU provisioning and running AI workloads more efficient because existing solutions don’t do a good enough job.
In fact, all the GPU cloud companies are investing into workload orchestration and optimization because this capability is a core means of differentiation. GPUs are the same to all the cloud players, so they can’t differentiate through the hardware. These companies need to invest in software and optimization capabilities instead. And, there are a lot of GPU clouds with varying prices, driven by how good they are at hardware and workload optimization.
Traditionally, we’ve thought about this solution as something either for the clouds or companies building AI products. We think both could still be customers, although they each present challenges:
Threading the needle on a solution here is tough because of the market dynamics: cloud vendors are incentivized to build in-house, and AI builders typically don’t want to manage low level, close-to-the-metal optimizations.
The opportunity in inference
We think there are still some opportunities despite this challenging reality, particularly in inference. When we talk to cloud providers, the vast majority of their orchestration and optimization attention is turned toward training right now. This makes sense. Training was a $47B market in 2023 and inference was only a $6B one. Inference is just further behind training by definition – you can only run inference once the models have been trained and actually work.
The result of this is a very robust and effective training stack, but more question marks when it comes to inference. Companies generally use Kubernetes for inference, which is effective but not optimized. Inference-as-a-service companies like Together AI and Baseten are going after the opportunity of inference specifically, but it is still early days.
The future that gets us excited is one where inference explodes, and models become fragmented, diverse, and therefore complex to orchestrate. Imagine a company like Superhuman serving different models for different use cases to a bunch of different customers. Tobias has a model for email completion in the tone of Tobias’ emails, but Brian has a differently fine-tuned model for the same purpose. They also have different models recommending who to send emails to, summarizing emails, etc. This means lots of small models running inference at different times – big spikes that are momentary and fleeting. Now extrapolate that use case to thousands of software companies bringing AI to market for millions of users and you have unprecedented scale. This future will also include local inference on models, given the new partnership between OpenAI and Apple. This won’t take away from compute reserved for large scale inference, but is part of the story for how inference becomes pervasive.
The orchestration in this world is massively complex. Companies would need to use the same hardware for different customers and different models, to make sure as little of your GPU fleet is sitting idle. This is an exercise in an insane auto-scaling service, which Kubernetes may or may not be equipped to handle. We are exploring whether this particular world, should it come to be, needs a new orchestrator.
This world is not fantasy. We are believers in a big inference market to come, and we’re seeing signs of this future already. NVIDIA is now delivering ~40% of data center volume from inference, a number that caught everyone by surprise (although Ben Thompson has written brilliantly about how that number is likely skewed by Meta). Some think inference will end up accounting for 80%-90% of AI compute at scale. In this world, we think inference orchestration becomes a massive problem that Kubernetes alone is ill-equipped to solve. We’ll be tracking this market and looking for opportunities to invest in teams building in it.
Before we end, we want to thank Max Hjelm, SVP of Revenue at CoreWeave (and a reader of this newsletter), for many of these insights. He taught us about SUNK and all the awesome work CoreWeave is doing to solve orchestration at scale.
As always, we welcome thoughts and feedback.
Until next time,
B&T