Andy
Interesting!
Random thoughts:
- Looks like the “Cooling Distribution Unit” takes up an entire rack
- In addition to all the network, power etc. cables already crammed into your racks, you’re also going to need to find room for the tubes with the water. Glad my rack/stack days are behind me!
- I wonder how the water is cooled? Are there coils on the back dumping heat into the heat-aisle?
- …all that said, it looks like (…for this solution, anyway) you don’t need a major infrastructure overhaul to support liquid cooling; you just need the space and layout to accommodate the “Cooling Distribution Unit”
Hmm, I got a different impression with respect to infrastructure requirements. Yeah, of course you need the distribution unit, which I’m sure contains the heat transfer components, but you also need the coolant to be distributed to every liquid cooled device on every rack - in short, that’s a lot of plumbing.
This is precisely the infrastructure that was alluded to during the last SMCI conference call which is currently not in place. Hence, the bulk of their current order book is composed of air cooled devices which do not support the AI processors which run at higher power.
It did surprise me that the coolant is distilled water. A water leak would be devastating to all involved devices. I thought that a different coolant that would pose less of a hazard to the devices would be used - though I don’t know what that coolant might be.
There are many ways they can do it and I don’t know exactly how they plan on doing it. It could be we get many different ways. SMCI has a server that is all enclosed with liquid cooling and pumps. It is a server about the size of a large Desktop. That doesn’t seem like what they will use in the data centers because it would get to hot.
There is also Immersion style liquid cooling where the server is put down into a cold liquid.
I think if we watch NVDA and what they are doing with their DGX buildout we can see how the majority of them will do it. NVDA gave all the server companies their designs for the Blackwell chipset and I think the majority of them will follow NVDA’s design.
Edit: They mentioned on SMCI’s website, NVDA’s cooling method. NVIDIA HGX B200 8-GPU liquid-cooled system.
Andy
@brittlerock I’d have to investigate to know for sure, but my guess is you’d need one of these for each row. Thus no overall plumbing impact beyond running tubes from the CDU to the rest of the row.
Another thing that occurs to me is you can’t fight physics: no matter the mechanism with which they are cooled, the GPUs generate the same amount of heat. So doesn’t that mean the CDUs will need to dump a LOT of heat? Presumably out into the heat aisle?
…and doesn’t that mean that the air handlers will have to be upgraded anyway?
I can’t find anything from NVDA but since SMCI does work closely with them I think this might be the way NVDA is implementing it also. But this looks like SMCI is trying to get involved in the Data Center design of the Future. Notice the plumbing.
Here is the Cooling Tower. Looks like a tower for an Evaporative Cooling System.
Anyway here is the whole article.
Supermicro Total Rack Scale Liquid Cooling Solutions
Andy
Absolutely correct, the heat must be dissipated in some manner, but that doesn’t mean it must be internal to the server room. It could be pumped to a vent system that pushes it outside, much the same as heat is vented externally by a central air conditioning unit. Elsewise, liquid cooling would not really provide much of a benefit.
I was wrong; this DOES require hooking into the larger water infrastructure.
“Plug-And-Play Flexible Hose Kits
The Supermicro liquid cooling rack solution includes standard 1.25-inch CDU hose kit connections. The hose kit design makes it easy to connect racks directly to data center primary water supply or cooling tower. The Hose Kit ensures seamless integration with existing facility piping …”
I was heavily involved with a “green” HPC supercomputer a few years back that was the cutting edge of liquid cooled for a TOP500 listed supercomputer.
I assume we’re going down this rabbit hole for rack chassis and cooling equip from Aspen Systems or Supermicro? or perhaps NVIDIA’s just announced first foray into liquid cooling in their new high-end DGX BH200 systems?
I want to squash some incorrect assumptions.
-
Yes there are a lot of tubes. You have to purposely design a DC building around liquid cooling. There are some bandaid solutions (self-contained cooling units in rack doors, rack boxes or adjacent units) but going forward, this next wave of liquid cooled supercomputers has to be pre-planned in the buildout. Having water systems (even self-contained ones) traversing above/below/within your multi-million dollar supercomputer hardware requires a lot of care & safety.
-
CDUs do NOT dump heat into hot air aisles. It is typically a closed loop system that dumps to some type of cooling equip outside the building, like a set of chillers or a cooling tower. Something in the loop has to transfer the heat off the water, to get it back to the input temp.. This will occur outside the building - there is no way you are pushing that heat into the already overtaxed air cooling system within. Smarter DC designers will fully integrate the liquid cooling into the building environment itself to take double advantage in colder months (heating the building from it).
-
You may not need a DC overhaul to support liquid cooling (again, bandaid solutions exist), but for DC designs going forward you definitely will be factoring in those new pipes, closed loop water flows, chiller equip, …
-
More promising bandaids that avoid DC building changes are something like Motivair’s Heat Dissipation Unit (HDU), a self-contained liquid cooling system running through a core air-cooled heat exchanger… which can become an entire self-contained closed loop within a rack or standalone next to a rack. This type of system is nowhere near as energy-eff as a full DC/building-integrated one, but could provide enough savings that DCs use it as a stop-gap to gain extra life from existing buildouts.
-
NVIDIA is not likely to be using Supermicro (a system mfr) for its DGX cooling parts. SMCI is building solutions focused on its enterprise customers (buying raw systems) that are building their own DCs, in order to be a one stop shop. (SMCI buys from NVDA, not the other way around.) CEO Huang specifically called out Vertiv as a key partner when talking about liquid cooling during IR day at GTC, so I assume they are the one providing the raw parts for the DGX SuperPOD systems.
-muji
Andy
Immersion cooling is used on an industrial scale by some Bitcoin miners.
Bitcoin mining rigs run much hotter than most data center servers. This picture is from Riot’s Rockdale facility. I’m not saying data centers would do it like this - just showing that something similar has been done on a large scale.
For my part, mostly trying to learn more about liquid-cooled since it’s relevant to $NVDA, $SMCI and (…it sounds like, at this point) computing in general.
And: potentially there is useful alternate data to be had?
…news of new datacenters with infrastructure to support liquid-cooled and/or of datacenters getting retrofitted for liquid-cooled are probably indicators of current or upcoming (massive) GPU purchases.
Last but not least, I speculate that datacenter optimization is a perma-trend and it’ll be interesting to see how liquid-cooled plays into it.
Someone more knowledgeable than I can correct me if I’m wrong, but my guess is that retrofitting existing data centers is not a long term solution. Heat management requirements are taking a quantum step forward.
New data center will be designed from the ground up specifically to address the issue. Existing data centers (largely leasehold buildings) will be returned to the owner when the lease expires (or early exit is negotiated), but will be retrofitted as a stop-gap measure as the availability of appropriate structures in suitable locations is most certainly limited.
I believe there will be AI Data centers and Data Centers for everything else. You really don’t need water cooling for transportation of data for the internet.
Andy
Also from $VRT’s site:
This surprised me too:
“…for extreme densities, rack size typically increases in height and width.”
…meaning: massive GPU deployments drive not only PLUMBING changes, but they can even drive a change in dimensions of the racks used!
Convincing a Data Center Engineer to switch to a different rack size is not trivial; the standard for dimensions for racks was set clear back 1922 by AT&T and hasn’t changed much (…if at all) since then. Everything in the layout of a Data Center floor is driven by the standard rack size. So GPUs are literally driving a from-the-floor-up overhaul of how a Data Center is built.
Probably no coincidence that $VRT’s stock has TRIPLED since its 2021 highs.
Absolutely correct Andy. I should have been more explicit as I implied that every data center would upgrade to support liquid cooling. Conventional computing isn’t going away. It will coexist along with AI support data centers for years to come.
At the risk of going completely off topic, data center sites take a very long time to ultimately get built due to power requirements. Around my area, it takes around 3 years to be able to break ground due to waiting for the utility company to provide the requisite power infrastructure.
All of these data centers are being built and planned now and will be for many years. Water, in most cases, won’t be a problem.
If the building design needs to be changed to allow for liquid cooling, it will be done. If water is indeed the means to cool, I have no doubt engineers will figure it out pretty quickly. I think power may end up being the limiting factor.
AJ
In my area Google and Facebook both put in Data Centers in about a year. Switch, which was a Data Center Reit that went private, built their own Solar Field in my area, to power their Data Centers.
Andy
$MSFT is potentially/allegedly prepared to invest 100 BILLION dollars in a DataCenter.
" The planned U.S.-based supercomputer, referred to as “Stargate,” will be designed to house millions of specialized server chips to boost OpenAI’s artificial intelligence capabilities, the people added. "
" Executives expect to launch Stargate as soon as 2028 and expand it through 2030. They have already discussed alternative power sources, such as nuclear energy, for the supercomputer, which will likely require at least several gigawatts—enough to power at least a few large data centers today—to operate."
MILLIONS of GPUs. In a SINGLE datacenter.
This, to me, is NOT evidence that the AI Hype Cycle has peaked.