It is being widely reported that the mean time to fail for the new Frontier exascale supercomputer is only a few hours. They plan to have it fully operational by January 2023. The first round of fixes were for the HPE slingshot interconnect which will benefit all three US exascale projects. The current problems appear when the system is heavily loaded.
Though that sounds like a difficult situation, Frontier has 60 million parts, so it’s not surprising there are some “hiccups,” according to Whitt. Despite these issues and COVID-19-related supply chain delays, Whitt says the company is still on track for the rollout date, when Frontier will begin its actual job of running user programs and not just benchmarks.
Not a good headline for AMD I suppose. Is there any evidence that it is AMD technology in particular that is making the system unstable, or is it just the huge number of parts and the overall complexity?
from the article:
Currently, some of the problems are apparently related to the AMD Instinct GPU accelerators. “The issues span lots of different categories, the GPUs are just one,” said Whitt. He said the trouble is pretty evenly spread out amongst Frontier’s various hardware.
There are probably three categories of issues. The first is infant mortality. Some components are going to fail early. Hmm. Not a good explanation. If you test a million of the same types of components, there is going to be a constant (hopefully low) rate of failures. In addition, a few components are going to fail almost immediately. This is infant mortality.
The next problem is emergent behavior. Human beings are an emergent behavior of individual cells. The system is more than the sum of its parts. You want emergent behavior. Otherwise, a million PCs or so would provide the same results. But emergent behavior can be new AI results–or crashes. As an example, all the CPUs and GPUs are drawing power from the same wall socket–more like the system’s own power substation. The power supplies at each level try to isolate the CPUs and GPUs and provide clean power. Emergent behavior can result in communication channels through the power supplies. Am I stretching? No. My father built the power supplies for the Univac I. Oops! Certain programs caused the power to start fluctuating, and if not shut down quickly fuses blew. The fix was to vary the inductance and capacitance in the filters. There were lots of LC choices that worked, you just needed different choices in each of the power supplies. If I had a YouTube channel, I could show the simple experiment with some string and fishing weights that we got to play with while my father was getting ready to demonstrate to Pres Eckart why the power supplies needed different LC constants.
Finally, there are system-level issues. Since the individual nodes have their own OS, there should be no new OS issues. But the network connecting the nodes can have behavior that is/was unexpected. If you have a queue of messages to send, that queue can overflow. Obviously, this will have been designed for, and the nodes will stop sending and possibly shut the node down at the OS level to prevent overflow. But once the messages get into the network, there will need to be queues. Here you can’t tell the senders to stop sending. Well, you can, and early networks were designed that way with two-way signaling. But that is relatively slow.
That last sounds like the Slingshot problem. All in all, the normal problems with this class of hardware. There are several issues that can’t be fixed until you know they exist. That is why several months of testing were built into the schedule.