A technical article from October on how xAI scaled up to the largest Nvidia AI cluster in the world:
First, the cluster is liquid-cooled (not water, btw) and is built with SuperMicro’s liquid cooling racks:
Here, xAI is using the Supermicro 4U Universal GPU system. These are the most advanced AI servers on the market right now, for a few reasons. One is the degree of liquid cooling. The other is how serviceable they are. … One example of this is how the system is on trays that are serviceable without removing systems from the rack. The 1U rack manifold helps usher in cool liquid and out warmed liquid for each system. Quick disconnects make it fast to get the liquid cooling out of the way, and we showed last year how these can be removed and installed one-handed. Once these are removed, the trays can be pulled out for service.
There are a few photos of this, too.
Other AI servers in the industry are built, and then liquid cooling is added to an air-cooled design. Supermicro’s design is from the ground up to be liquid-cooled, and all from one vendor.
Even for storage, xAI chose Super Micro:
In AI clusters, you generally see large storage arrays. Here, we had storage software from different vendors running, but almost every storage server we saw was Supermicro as well. That should not be a surprise. Supermicro is the OEM for many storage vendors.
And then for networking:
Here, xAI is using NVIDIA BlueField-3 SuperNICs and Spectrum-X networking. NVIDIA has some special sauce in their network stack that helps ensure the right data gets to the right place navigating around bottlenecks in the cluster.
That is a big deal. Many supercomputer networks use InfiniBand or other technologies, but this is Ethernet. Ethernet means it can scale. Everyone reading this on STH will have the page delivered over an Ethernet network at some point. Ethernet is the backbone of the Internet. As a result, it is a technology that is immensely scalable. These enormous AI clusters are scaling to the point where some of the more exotic technologies have not touched in terms of scale. This is a really bold move by the xAI team.
So, even though xAI uses ethernet instead of InfiniBand, they still chose Nvidia’s networking hardware and firmware.
If your computer uses an Ethernet cable, that is the same base technology as the networking here. Except, that this is 400GbE or 400 times faster, per optical connection than the common 1GbE networking we see elsewhere. There are also nine of these links per system which means that we have about 3.6Tbps of bandwidth per GPU compute server.
Of course, power is an important aspect of building a large data center. And here Tesla Megapacks come into play:
Outside of the facility, we saw containers with Tesla Megapacks. This is one of the really neat learning points that the teams had building this giant cluster. AI servers do not run at 100% rated power consumption 24×7. Instead, they have many peaks and valleys in power consumption. With so many GPUs on site, the power consumption fluctuates as the workload moves to the GPUs, and then results are collated, and new jobs are dispatched. The team found that the millisecond spikes and drops in power were stressful enough that putting the Tesla Megapacks in the middle to help buffer those spikes in power helped make the entire installation more reliable.
If you saw the interview of Nvidia CEO Jensen Huang from Brad Gerstner (I posted it a few weeks ago), this was the cluster he was talking about. What he didn’t say, but which Gavin Baker recently said on another podcast, is that many in the industry thought a cluster of this size was impossible. Follow that link for an interesting explanation and discussion.
xAi built this cluster to train Grok3. We’ll see if Gavin’s prediction that Grok3 will outperform everything else comes true or not.
Anyway, this is a lot to digest. While the fallout from SMCI’s accounting issues are hurt the stock, it was clear that their products were world class. What that means moving forward is anyone’s guess. The easy take-away is that companies such as xAI, Meta, etc., want more AI compute power and so Nvidia is still a lock there.
The question of Nvidia being cyclical is interesting. Using the obvious internet cycle as an analogy, the money moved from picks and shovels (Cisco and Juniper) to infrastructure (Yahoo, AOL) to applications (Google search, Amazon eCommerce, etc.), to new uses like Cloud Computing, and then new products/services to support those apps like security, databases, etc. How this plays out for AI is still a bit uncertain, but we’ve seen APP and PLTR take off (UPST was an early player, but fizzled - maybe too soon?), and it looks like Meta and xAI and Tesla may be the next wave.
What this means for Nvidia is that while it’ll continue to sell tons of AI chips, the money is going to move downstream (upstream? I can never figure that out). Huang knows this and has already started AIAAS, software, robotics, and other downstream uses of AI plays within Nvidia. But, pivoting a 100B a year business to new revenue streams isn’t easy. Amazon got lucky that its AWS came naturally out of its eCommerce support efforts (scale up for XMas rush, sell the unused compute the other 11 months of the year), but in general it’s hard for companies to pivot to all new businesses that are large enough to matter. Can Nvidia do it? I give it about a year, maybe 2, to figure it out.