Sigma2, the steward of Norway’s supercomputing resources, has announced the first EPYC “Turin” (“Zen 5”) supercomputer win:
“The procurement competition to secure Norway’s next supercomputer has now been conducted. Hewlett-Packard Norge AS (HPE) won the competition and has consequently been awarded the contract, which has a value of 225 million NOK [USD ~20M]. This will be Norway’s most powerful supercomputer ever and will give a significant boost to national AI research and innovation.”
“HPE will deliver an HPE Cray Supercomputing EX system equipped with 252 nodes, each with two AMD EPYC Turin CPUs. Each of these consists of 128 cores. In total, the system will consist of 64,512 CPU cores. In addition, the system will be delivered with 76 GPU nodes, each comprising of 4 NVIDIA GraceHopper Superchips (NVIDIA GH200 96 GB) in total, 304 GPUs.”
I have been awaiting this announcement with bated breath to see whether Norway’s experience with the part-owned AMD-based LUMI supercomputer in Finland would lead them to select a configuration based on AMD GPUs and in particular the Instinct MI300A APU with unified memory (for simpler programmability). Alas, they instead went for a traditional CPU configuration for the majority of the nodes, while selecting Nvidia for the accelerated nodes, presumably due to reported software porting friction sited in their annual report:
“The two major providers of GPUs are AMD and NVIDIA, with NVIDIA being the largest and offering a more comprehensive development environment and software suite. By the end of 2023, supercomputer Betzy had 16 NVIDIA A100/40GB, and Saga had 32 NVIDIA A100/80GB and 32 NVIDIA P100. The largest proportion of our GPUs are on LUMI (10240 AMD MI250X). However, some users encounter difficulties using LUMI’s AMD GPUs. This is primarily because code written for NVIDIA doesn’t always run automatically or, in some cases, not at all without modifications.”
While perhaps inconsequential, I find this disappointing. It indicates that the programmability benefits of having unified memory in AMD’s Instinct MI300A APU solution are not (yet) persuasive for supercomputing. Instead, Sigma2 has selected separate partitions for traditional and accelerated workloads, respectively, while going with Nvidia for the accelerated partition, seemingly due to software ecosystem preference. I hoped they would see the proprietary ecosystem as a problem, with a positive and promising porting experience from working with LUMI. Sadly not.
Notably, the Nvidia GH200 GPUs with 96 GB HBM also have much less memory than AMD’s MI300X’s 192 GB HBM. But this didn’t sway them either.