Why the Blackwell delay?

https://www.theinformation.com/articles/nvidias-new-ai-chip-is-delayed-impacting-microsoft-google-meta

Some good-sounding info from The Information.

Should make a window for AMD to grab up some more business, and to be farther along with successors to the current MI300 when Blackwell finally shows up.

The Blackwell design problem came up in recent weeks, as engineers at TSMC discovered flaws in preparation for mass production, said the two people involved with the Blackwell chip production.

The GB200 chips contain two connected Blackwell GPUs alongside a Grace central processing unit. The problem involved a processor die—a piece of silicon that holds circuits for a chip—that connected the two Blackwell GPUs. The snag decreased the yield, or number of chips TSMC was able to produce for Nvidia. Such problems typically prompt companies to stop production.

As a result, Nvidia has been making adjustments to the design and will have to conduct a new production test run at TSMC before mass production can begin, the people said.

Nvidia told at least one cloud provider that it might consider producing a version of the chip that only contains one Blackwell chip, in an effort to avoid the die issue and ship chips faster, according to someone who spoke with Nvidia about the delay.

TSMC initially planned to start mass production of the Blackwell chips in the third quarter and ship them en masse to Nvidia customers starting in the fourth quarter. The Blackwell chips are now expected to go into mass production in the fourth quarter, with the servers slated for mass shipment in the subsequent quarters if no further issue arises, they said.
…

Still, it is highly unusual to uncover significant design flaws right before mass production. Chip designers typically work with chip makers like TSMC to conduct multiple production test runs and simulations to ensure the viability of the product and a smooth manufacturing process before taking large orders from customers.

It’s also uncommon for TSMC, the world’s largest chipmaker, to halt its production lines and go back to the drawing board with a high-profile product that’s so close to mass production, according to two TSMC employees. TSMC has freed up machine capacity in anticipation of the mass production of GB200s but will have to let its machinery sit idle until the snags are fixed.

The design flaw will also impact the production and delivery of Nvidia’s NVLink server racks because the companies that work on the servers have to wait for a new chip sample before finalizing a server rack design.

5 Likes

I can’t see how this is a one quarter slip. If it is just a change to the microcode, maybe. It sounds, however, like a problem when the CPU core has a problem* where all the pieces work correctly, and the test threads work when the chip is lightly loaded. When the actual hardware is tested at full speed, there is an unexpected bottleneck.

Those of you who have read GEB will understand when I say that any problem not found during simulations is most likely emergent behavior. Any patches to fix the problem can be tested in simulations to avoid regressions. But to test that the actual problem is fixed will require hot lot test chips. So even if the problem is understood today, it will cause a more than one quarter slip. One year, or much more than six months is unlikely since nVidia can afford to put together a dozen separate teams to find and fix the problem. More than one solution found? Decide which one to use at the highest level that won’t get overridden and skip any intermediate decision levels.

*Notice I am avoiding use of bug or errata here. The fix(es) probably will appear in errata lists. It may take months to reduce a problem like this to a series of errata.

2 Likes