September 2024
From: Brian, Tobias, and Gaby
Subject: The lagging infrastructure surrounding the GPU
Welcome to the better-late-than-never “September” edition of our Infra newsletter. We’re moving this one from B&T (Brian&Tobias) to BT&G where the ‘G’ is for Gaby Lorenzi, an Analyst at Primary who was instrumental in this month’s newsletter.
Our focus with this piece: the lagging infrastructure surrounding the GPU.
The GPU unlocked today’s AI revolution because of its ability to parallelize matrix multiplication, the function underlying Transformer models. This is the power move of the GPU. No one could have predicted the scale that it would drive – the number of racks being strung together, the volume of power, the memory demands, etc. – and so it’s no wonder that the technology surrounding the GPU is not up to the task. It’s as if the GPU is the Lebron James of hardware, stuck with a JV lineup. Upgrading the team represents a trillion dollar opportunity.
NVIDIA GPUs are delivered as systems that include memory chips, networking, storage, chassis, motherboard, cooling, power supply, and more, much of which is supplied by monolithic, old incumbents like SK hynix, Monolithic Power Systems, and Lattice, founded before the turn of the century. These companies represent trillions of dollars in market cap, and we think many are ripe for disruption. In this newsletter, we're going to discuss some of the biggest “ex-GPU infra” areas for investment: memory, power, networking, and cooling. For each, we will discuss incumbents, startups, and fresh opportunities.
Memory
Improvements to memory have dramatically lagged behind their compute counterparts, implying that we have hit “the memory wall” – compute speed outpaces the rate at which data can be transferred to and from memory. This has been brewing since the 90s when the phrase was coined. The result is sprawling GPU clusters for training and a drag on latency with inference. This is because huge LLMs need access to trillions of parameters which are stored in memory.
Samsung (~$300B market cap) and SK hynix ($89B market cap) have dominated the memory market for decades, and these companies are largely dependent on existing architectures, production methods, and, most importantly, revenue streams that come from their current memory chips.
There have been some interesting alternatives to existing types of memory, including:
Photonics is perhaps the most interesting innovation here – using light, not electricity, to move data. Photonic interconnects improve bandwidth, which is how fast data can be read and written from memory, dramatically.
There are well funded scale ups building photonic solutions:
Existing memory solutions are up against the laws of physics. Electricity needs a conduit to travel, which creates a transport tax. Light, conversely, moves freely, has no density, can be smaller, and takes less power to move. Photonics has emerged as a market making solution here. It is no longer the moonshot idea it was a few years ago, and represents the best example of a new technology yielding a step-change improvement. There may be more on the horizon.
Power
There are ~12 steps of energy transfer between a utility tower and the chip. From the source, energy moves through the data center, into the racks, into separate servers, into a chip, and then into the power systems that sit on the chip to send a single volt to a GPUs. Not only does every link in this chain have associated energy loss, which can be optimized, but the sheer volume of energy required for AI training and inference is breaking the system.
Mark Zuckerberg stated earlier this year that companies like Meta, “would probably build out bigger [GPU] clusters than we currently can if we could get the energy to do it.” Note the lengths Elon Musk went to acquire the 700MW needed to power X’s cluster of 100K H100s.
The market moment for power is unique, and likely more interesting than for memory, for two reasons:
On-chip power systems have largely been built by Monolithic Power Systems ($45B market cap) and Infineon ($42B market cap). Both companies sell into diverse markets, such as electronics, automotive, and medical. While power systems of old needed to diversify their customer base, the explosion of AI workloads makes AI-specific power systems potentially viable, opening up opportunities for new entrants. This dynamic is not completely dissimilar from the one in semis, where companies like Etched are now possible because of the demand for a singular architecture. Simultaneously, the need to optimize energy usage on chips has perhaps never been greater.
We’re looking at companies tackling this:
We expect to see many more opportunities, and are particularly excited about ones optimizing specifically for the GPU, gaining efficiencies over general purpose systems as a result.
Networking
The future of AI datacenters is 100K superclusters being built across an entire campus of buildings. In this future, networking is more crucial than ever. The largest spend in AI infra goes to accelerators and the second largest goes to networking.
In 2019, in an effort to secure their fortress, NVIDIA acquired Mellanox for $6.9B. This acquisition cemented NVIDIA, and its Infiniband technology, as the industry leader. Prior to Infiniband, Ethernet, dominated by Broadcom, had been the market leader, but has fallen out of favor with NVIDIA’s dominance and the rise of AI. Still, Ethernet is 5x-10x less expensive than Infiniband and important for certain elements of networking within the datacenter.
We believe there are opportunities in the market for lower cost solutions that don’t result in vendor lock in and over-reliance on NVIDIA.
In a market dominated largely by NVIDIA’s expensive Infiniband offering, startups that can maintain latency, while undercutting cost and avoiding NVIDIA lock-in, may have a real chance. We’re seeing it unfold with the momentum of Enfabrica and Cornelis.
Cooling
NVIDIA’s latest chip, Blackwell, will be 5x faster than the H100, but with a caveat. It only can achieve this performance if it’s liquid cooled. Most legacy data centers are built around air cooling, as liquid cooling requires radiators and water tubes to shuttle water and cool the machine. This is expensive, representing over 50% of the power requirement of a data center.
Fans and heatsinks, from the likes of Supermicro, are installed on the system, but facilities bear much of the cooling burden. Similar to the dynamics in power, incumbent cooling providers, like Schneider Electric, have largely served markets broader than just data centers. The need for cooling is increasing because GPUs run so hot, and performance gains are only possible with intense cooling capabilities. As a result, data centers will need to be built with much more sophisticated cooling in mind, and this could create new opportunities.
The biggest reason to believe startups may crack the cooling market is the massive expected growth in data center construction. As an entirely new scale of data centers get built, new cooling technologies may be necessary as well.
Our Thesis
The AI revolution is underway and today the biggest bottleneck is hardware.
$258B will go into capex investments around semis in 2025! The hyperscalers all plan to continue making comparable investments through 2030, and this will drive the greatest industrial build out of all time. If 1% gets captured by startups that will be 25 unicorns over the next decade. The stakes are high, but the technology, capital and supply chain obstacles are vast. Startups attacking this opportunity will need qualities along these lines:
In many instances, scaled companies and universities have an edge. Most of the startups mentioned in this piece came from alums of Intel, Broadcom, Micron, or other incumbents. We believe that teams that can combine deep industry experience with raw founder ferocity (see the Etched team as an example) will fare the best.
We would be remiss not to mention that Gavin Baker’s recent Invest Like The Best podcast was fundamental to our understanding of this market, and gave us extra ammo to dive into the deep end here. We also want to thank all the folks whose ideas helped us hone our thesis, including Amir Salek, Bill Leszinske, Matej Kajinic, Mason Hale, and many others.
As always, any thoughts and questions are appreciated. We’re excited to continue the conversation on this category, and are eager to meet the best founders building in this space. If you know someone working on the next great startup, please share this newsletter and send an intro.
Until next time,
B&T&G