March 2024

March 2024

From: Brian and Tobias

Subject: Kubernetes for AI workload orchestration

Welcome back to the B&T infra newsletter. For those of you in NYC, hope you’re enjoying our first tastes of Spring.

In this edition, we talk about AI workload orchestration: why it’s an unsolved problem, why Kubernetes is an okay but imperfect solution, and why despite both of those realities it will still be extremely difficult to build a better Kubernetes for AI. These questions matter because AI progress is bottlenecked by hardware supply, and every layer of the stack needs to be optimized in order to get the most performance of the hardware we have. Let’s dive in…

‍

AI “workload orchestration” – the choice is Kubernetes

At face value, “workload orchestration” is an almost hilarious unclear phrase, but at the highest level, it refers to scheduling tasks – deciding the order and way in which infrastructure-related jobs happen.

To give a concrete example, when Brian ran a dating site circa 2010, the engineering team ran an algorithm multiple times a day to determine the degree to which each user matched with other users. Many coordinated tasks were involved in creating this matchy score, each of which required compute. Simultaneously, they also managed a plethora of mundane things like site traffic, email sends, credit card transactions, models checking for spam activity, etc. At the time, the right orchestration software didn’t exist, and all of this was done manually.

Orchestration is the abstraction of this process for developers. The king of software orchestration is Kubernetes (k8s), an open source project released by Google in 2015. K8s uses containers to make deploying, managing, and scaling (i.e. orchestrating) software applications much easier. In the ~10 years since its release, it has become the standard for enterprises to orchestrate increasingly complex fleets of software apps. This is no different for AI workloads and applications, where use of k8s is also the standard. OpenAI famously has one of the biggest k8s fleets in the world and has written about the extent to which it has scaled its k8s infrastructure.

OpenAI is not alone. MosaicML used k8s to orchestrate training for all its open source LLMs, and still does as part of Databricks. There is increasing evidence, as well, that k8s is the core orchestration solution for AI inference too. We don’t have to look further than GTC to realize this, as k8s is a core foundational layer of NVIDIA’s inference stack, including its newly released inference microservices offering (check out the big k8s logo above CUDA):

‍

‍

Where k8s falls short, and the return of Slurm

Despite the dominance of k8s in AI workload orchestration, it has a tremendous number of deficiencies. K8s is not built for AI at all, and therefore, many of the features/functionality you might want out of an orchestration layer for AI are missing in k8s. AI training requires provisioning a large fleet of parallelized machines, doing massive amounts of computation at the same time. K8s was not built for this type of synchronicity, and although it does offer parallel processing, we have heard this is hard to scale with k8s. Provisioning the right kind of hardware is also essential for AI – whether you get a GPU or CPU (or even more specifically, an A100 vs. H100 vs. specialized hardware vs. CPU, etc.) spinning on an AI workload matters a ton. K8s isn’t great at this either. And finally, AI workloads are really expensive, and require fine tuning around scaling up and scaling down machines at the right times to optimize for cost and performance. K8s is also deficient in this area. We’ve seen some startups trying to tackle this set of problems in novel ways. An example is Cedana, which is enabling checkpointing, pausing, and migrating compute jobs across any instance and cloud provider.

We were also recently talking to a Senior ML Engineer at Bloomberg about their k8s infrastructure. He had just been to Kubecon in Paris, and AI was unsurprisingly a big topic. He said that the reason k8s is struggling to meet the needs of AI is because k8s is notoriously difficult to manage if you have multiple clusters, which are a collection of nodes running applications via containers. Historically, the k8s control plane sits inside of a cluster, not on top of multiple clusters, which makes multi-cluster management difficult, regardless of whether or not the workload is at all AI-related. This is okay if you are a single cloud shop running one large cluster, but once you go multi-cloud, you need to run multiple clusters for applications hosted across those different cloud providers.

According to him, this has been a big deal in the context of AI. GPUs are really hard to get, and supply is fragmented across many cloud providers, forcing companies to become more multi-cloud than they’ve been before in search of whatever GPUs they can find. Multi-cloud means multiple k8s clusters, meaning terrible complexity and a disjointed control plane to manage a fleet holistically. Even within a single cloud, managing fleets of k8s clusters is hard enough, due to the operational overhead and cluster maintenance required. Our portfolio company Plural simplifies the intricacies of k8s fleet management by offering automated dependency management for cluster upgrades as well as monitoring and security tools, all within a single interface.

This problem was so severe for MosaicML that they developed a software layer on top of k8s to support all their model training. We recently chatted with a Researcher at MosaicML (now at Databricks) who told us that the process of building a software layer to optimize the k8s deployment they had in place was a long project involving many senior engineers, and that the management of that internal product also still requires a big team. We assume the same sort of reality exists across the other big model companies like OpenAI, Anthropic, and Cohere: k8s supplemented with bespoke software to orchestrate training (and likely inference) workloads.

We’ve also increasingly been hearing about k8s alternatives to address some of these problems, specifically Slurm. Slurm is a workload manager originally developed for high-performance computing (HPC) use cases. Slurm was released all the way back in 2002, but because it was created to serve supercomputers, it is really good at running many parallel jobs, something that’s also important for AI use cases. Slurm is making a definite comeback. Within a two day time span in March, it came up in four different conversations. That’s more times than we had probably heard of Slurm before in our entire lives.

There is also evidence that people aren’t only talking about Slurm, but actually using it for AI use cases. One of our portfolio companies manages open source stacks in support of AI projects for its customers. It recently deployed Slurm for the first time in one of its customers’ environments. We expect this won’t be the last time. Although much less powerful than k8s, it does possess some basic functionality that AI workloads require but were not architected into k8s.

‍

Why this doesn’t mean “k8s for AI” will work

The natural reaction to this writeup might be to go start working on an orchestration platform for AI. In fact, we are exploring a version of this idea and think there is a real opportunity here if done thoughtfully, although we have a lot of questions about the opportunity.

The biggest one is: who would the customer be? The likely reality is that the biggest customers for a really efficient AI orchestrator probably already have homegrown solutions that combine k8s with some internally built software (like what Mosaic has built). Replacing that kind of internal behemoth is an impossible sale. We believe that training will be won by a select few – there will be a very concentrated number of companies with the capital required to do large scale training. As a result, the training market for AI orchestration is probably already saturated with home-grown k8s-based products that are working well enough.

The hope, then, is that k8s is similarly deficient in inference as it is in training. Jensen’s depiction of k8s as an essential part of NVIDIA’s inference stack is a promising signal on this front. Our bet is that many, many companies will do massive amounts of inference in the years to come, from the model companies to inference-as-service providers, to full-stack apps, to enterprises with their own fine-tuned deployments of models. It is likely that there is a much bigger universe of customers for an inference-focused orchestration solution than a training-focused one, and that could present an interesting opportunity.

With that said, this too has many question marks. Perhaps the inference-as-a-service companies like Together effectively are this solution already, even if they do use k8s, just at a higher abstraction than companies are currently used to. Perhaps inference orchestration ends up being a less acute problem than training orchestration, eliminating a burning need for this solution. Perhaps something like Anyscale (commercialized Ray) is actually the solution for this, enabling features like gang scheduling via Ray clusters on top of k8s. And then of course, the big elephant in the room is that there was no k8s company that effectively commercialized the open source technology (with the only exception potentially being RedHat, a unique case). Why would it be different this time around?

We’re not sure yet what the answers to these questions are, but we think something will happen in this space. Lots of potential routes and solutions exist:

A fork of k8s focused on AI
A net new software focused on AI orchestration
A software layer that sits on top of an existing k8s deployment (similar to what other companies have already done)
Inference-focused solutions that argue a different approach entirely from the solutions that have been patched together for training
Inference-as-a-service with an AI version of k8s under the hood (Modal, but using k8s)

This is not an exhaustive list by any stretch. We’re looking to speak to experts in your network who could refine our understanding of the problem space, potential solutions, and what this means for startups.

For this edition, we want to give a special shoutout to Chris Gaun, Sam Weaver, and Michael Guarino for having conversations with us on this topic and providing feedback on the piece.

‍

As always, please reach out with thoughts and feedback. Also, we’ll be at RSA in May and would love to see as many of you there as possible. If you’ll be there too, let us know!

‍

Until next time,

Brian and Tobias

‍