Nvidia Blackwell chips overheating

I am curious for the board’s take on these stories today that Nvidia’s Blackwell NV72 is experiencing issues with overheating.

According to insiders familiar with the situation who spoke with The Information , Nvidia’s Blackwell GPUs for AI and HPC overheat when used in servers with 72 processors inside. These machines are expected to consume up to 120kW per rack. These problems have caused Nvidia to reevaluate the design of its server racks multiple times, as overheating limits GPU performance and risks damaging components. Customers reportedly worry that these setbacks may hinder their timeline for deploying new processors in their data centers.

I expect that Nvidia will be able to resolve the heating issues eventually, but this does seem concerning for future chip iterations. Blackwell was a 30x speed up over Hopper and a 25x increased energy efficiency over Hopper. I am not sure we will see that level of efficiency gains going forward, if they are already reaching some of the fundamental constraints around AI server design.

On a side note, this could be beneficial for Astera Labs which is able to diagnose these types of heating issues and bottlenecks.

16 Likes

From the article:

While these adjustments are standard for such large-scale tech releases, they have nonetheless added to the delay, further pushing back expected shipping dates.

In response to the delays and overheating issues, an Nvidia spokesperson reminded Reuters about the collaborative efforts with cloud providers and described the design changes as part of the normal development process. This partnership with cloud providers and suppliers aims to ensure the final product meets performance and reliability expectations as Nvidia continues to work on resolving these technical challenges.

Even if this were a significant delay, all it would mean is that customers needing something that works today will simply buy more Nvidia Hoppers.

You sure about that? Astera Labs can diagnose communication issues (either in their retimers or between servers in their interconnects), but I don’t know how/if that applies to overheating on chip boards. I would assume there are temperature sensors in the boards that Nvidia is making.

20 Likes

I believe a delay in getting the NVLink72 out will impact the company from a competitive standpoint. Nvidia has had a year plus lead on the innovation in the field, and missing a big release allows competitors to gain ground. For example, AMD’s MI355X is supposed to come out next year and it has a 35x increase in inferencing from their previous model. Nvidia still has a ton of breathing room to press their competitive advantage, as the competition seems to be a full year behind.

Thinking along that lines that if the NVLink72 is having issues getting up an running that sales may skew more towards individual Blackwells, and will lead to more custom configurations from hyperscalers. While the COSMOS system doesn’t directly optimize thermal output, the other diagnostics and telmetry that Astera has can help build a system properly.

Nvidia uses their own propriatary system to troubleshoot system design issues, but now I am not as confident it is catching issues soon enough. It sounds like they have made multiple passes at trying to figure out the issue but they are still having heating issues on the NVLink72 specifically.

6 Likes

The Dell version wouldn’t apear to have issues:

PS: Dell beat SuperMicro to first ship.

15 Likes

These types of issues are really not a chip level issue. It’s a system issue. The chips are producing more heat than modeled and thus the system guys will need to adjust. Multiple ways to do that, including slight reduction in clock frequency, or operating the chip at the lower end of the voltage rating (more tightly regulating the voltage), etc.

This may require adjustments that are short term (above) and longer term, design changes of the racks or boxes that are so close to the hairy edge that they are seeing the issue. It’s possible (but unlikely) NVDA would slightly change something at the chip-level to tweak something.
But it will not result in customers not being able to use the chips, or a mass return of chips or other things I’ve read about.

Even with slight moderation of performance to contain a heat issue at the system level, these will still vastly outperform anything else available by many times over.

44 Likes