Aurora HPC still coming up short vs Frontier

How long does it usually take to deploy a system like this?

The Intel-powered Aurora supercomputer was widely expected to take the top spot from the AMD-powered Frontier, the #1 supercomputer on the Top500 list, but it took second place instead. However, Aurora did take the top spot in the AI-centric HPL-MxP mixed-precision benchmark, allowing Intel to lay claim to powering the fastest AI supercomputer in the world with 10.6 AI Exaflops of performance.

It’s noteworthy that Aurora is still not fully operational, so the entire system wasn’t used for any of the benchmark submissions. Aurora remains beset by numerous hardware issues like hardware and cooling system failures, operational errors, and network instability, among others (details in the last section below). The continued issues are a bit surprising—the system was first announced nine years ago, the second revision was announced five years ago (the first version was canceled), and the final components were installed eleven months ago.

The system houses 21,248 CPUs and 63,744 GPUs spread across 10,624 compute blades, but Argonne National Laboratory (ANL), which hosts the system, was again unable to submit a full Linpack run for the Top500 list.
Instead, Aurora placed second with 1.012 Exaflops, breaking the Exaflop barrier with 87% of the system active (9,234 of the full 10,624 nodes). This solidifies Aurora’s second-place position — Aurora’s first submission (with only half the system) also took second place, reaching 585.34 petaflops six months ago.

Ten long months passed between the final Aurora hardware being installed and when ANL submitted its benchmarks, raising questions about the source of the continued delay in standing up the full machine. We followed up with Intel on the matter.

“[…]Since we completed the physical delivery of the last compute node at the end of June 2023 (only 10 months ago), we have been working hand-in-hand with Argonne National Laboratory and HPE to fully stabilize and tune the system, including the compute nodes, storage system, fabric, power delivery, and cooling."

"We are also actively working on addressing stability issues like hardware failures, software bugs, cooling system malfunctions, issues with power supply, networking infrastructure stability, environmental factors, and operational errors,” the Intel representative said to Tom’s Hardware.

2 Likes