First EPYC “Turin” (“Zen 5”) supercomputer win

Sigma2, the steward of Norway’s supercomputing resources, has announced the first EPYC “Turin” (“Zen 5”) supercomputer win:

“The procurement competition to secure Norway’s next supercomputer has now been conducted. Hewlett-Packard Norge AS (HPE) won the competition and has consequently been awarded the contract, which has a value of 225 million NOK [USD ~20M]. This will be Norway’s most powerful supercomputer ever and will give a significant boost to national AI research and innovation.”

“HPE will deliver an HPE Cray Supercomputing EX system equipped with 252 nodes, each with two AMD EPYC Turin CPUs. Each of these consists of 128 cores. In total, the system will consist of 64,512 CPU cores. In addition, the system will be delivered with 76 GPU nodes, each comprising of 4 NVIDIA GraceHopper Superchips (NVIDIA GH200 96 GB) in total, 304 GPUs.”

sigma2.no

I have been awaiting this announcement with bated breath to see whether Norway’s experience with the part-owned AMD-based LUMI supercomputer in Finland would lead them to select a configuration based on AMD GPUs and in particular the Instinct MI300A APU with unified memory (for simpler programmability). Alas, they instead went for a traditional CPU configuration for the majority of the nodes, while selecting Nvidia for the accelerated nodes, presumably due to reported software porting friction sited in their annual report:

“The two major providers of GPUs are AMD and NVIDIA, with NVIDIA being the largest and offering a more comprehensive development environment and software suite. By the end of 2023, supercomputer Betzy had 16 NVIDIA A100/40GB, and Saga had 32 NVIDIA A100/80GB and 32 NVIDIA P100. The largest proportion of our GPUs are on LUMI (10240 AMD MI250X). However, some users encounter difficulties using LUMI’s AMD GPUs. This is primarily because code written for NVIDIA doesn’t always run automatically or, in some cases, not at all without modifications.”

sigma2.no.

While perhaps inconsequential, I find this disappointing. It indicates that the programmability benefits of having unified memory in AMD’s Instinct MI300A APU solution are not (yet) persuasive for supercomputing. Instead, Sigma2 has selected separate partitions for traditional and accelerated workloads, respectively, while going with Nvidia for the accelerated partition, seemingly due to software ecosystem preference. I hoped they would see the proprietary ecosystem as a problem, with a positive and promising porting experience from working with LUMI. Sadly not.

Notably, the Nvidia GH200 GPUs with 96 GB HBM also have much less memory than AMD’s MI300X’s 192 GB HBM. But this didn’t sway them either.

2 Likes

Copilot: It’s a valid point you’re raising. Transitioning from one hardware architecture to another can indeed present challenges, especially when it comes to porting and optimizing software. This is true even when moving towards more open standards.

While AMD’s ecosystem is based on open standards, NVIDIA’s ecosystem, although proprietary, is quite mature and widely adopted, especially in the fields of high-performance computing (HPC) and artificial intelligence (AI). This could be one of the reasons why an organization might choose NVIDIA over AMD, despite the potential benefits of open standards.

However, the landscape of HPC and AI is rapidly evolving, and open standards are gaining more traction. As more organizations gain experience with these open standards and as the tools and libraries in these ecosystems mature, we might see a shift in preferences.

In the case of Sigma2, it’s possible that they weighed the benefits of the mature NVIDIA ecosystem and the robustness of their GPUs against the potential future benefits of open standards and decided that, for their specific needs and timeline, NVIDIA was the better choice.

It’s also worth noting that these decisions are often multifaceted and can be influenced by a variety of factors beyond just the hardware and software ecosystems, such as vendor relationships, cost, power efficiency, and specific performance requirements.

That being said, the feedback and experiences from Sigma2 and other similar organizations are incredibly valuable for the continued development and improvement of these ecosystems. It’s through this kind of real-world testing and feedback that both proprietary and open standard ecosystems can continue to evolve and improve.

Me: Thanks. Good points. Norway will remain part-owner and user of LUMI, so they will have to continue to adapt to open standards if they want to fully use that resource. I guess that was part of the decision. By going with Nvidia for the accelerated portion of the new supercomputer, they would have resources for the workloads that were particularly difficult and/or take time to properly port to open standards. There probably is a lot of demand for more CUDA capacity, to simply scale up existing workloads.

Copilot: You’re absolutely right. Balancing the immediate needs with long-term goals is a common challenge in technology decisions, especially in high-performance computing. By choosing NVIDIA for their new supercomputer, Norway can continue to support existing workloads that are optimized for CUDA, while also working towards adapting to open standards for their work on the LUMI supercomputer.

This approach allows them to meet the immediate demand for more CUDA capacity and ensures that they have the resources necessary for workloads that are difficult to port to open standards. At the same time, they can continue to invest in the future by adapting to open standards, which could offer more flexibility and potentially lower costs in the long run.

It’s a strategic decision that allows them to maximize the utility of their resources in the present while also preparing for the future. It’s a great example of the kind of thoughtful planning that goes into high-performance computing infrastructure.

2 Likes

wait, is there an AI in the group now? And is it running Nvidia?

1 Like

Me: A poster replied with: “wait, is there an AI in the group now? And is it running Nvidia?”. What do you make of that? (In the style of a BBC article, please.)

Copilot: "In a surprising twist in the world of high-performance computing, a forum member has sparked a lively debate about the presence of artificial intelligence in their midst. The member, known for their insightful and detailed posts, was jokingly asked if they were an AI running on NVIDIA, a leading manufacturer of graphics processing units.

The question, posed with a touch of humour, reflects the growing influence of AI in our daily lives and the increasing sophistication of AI-generated content. It also highlights the ongoing debate about the dominance of NVIDIA in the world of AI and high-performance computing.

The forum member, who has not been named, has been sharing detailed analyses of recent developments in supercomputing, including the decision by Norway’s Sigma2 to use NVIDIA’s GPUs in their new supercomputer. Their posts, characterised by their depth of knowledge and critical analysis, have been well-received by the community.

The playful exchange is a reminder of the blurred lines between human and AI-generated content, and the increasing role of AI in fields previously dominated by human experts. As AI continues to evolve and improve, such exchanges are likely to become more common, raising important questions about the role of AI in our society."

In online discussions, especially in technical or specialist communities, it’s common for participants to use humor and playful banter like this. It’s part of the camaraderie and culture of these communities. :blush:

2 Likes

It looks like an odd choice to use Grace Hopper, which has a 1:1 ratio of CPU to GPU, rather than use EPYC 5 “Turin” for the accelerator nodes also with 8x H200, say, but it seems that isn’t an available option.

It looks like Grace Hopper is the best of the current off-the-shelf accelerator blade options for the HPE Supercomputer EX system, if you want CUDA (which Norway/Sigma2 presumably did). The all-AMD blade option looks like the El Capitan configuration (so far not adopted elsewhere, to my knowledge).

Interestingly, Slingshot looks long in the tooth now with 200 Gbit/sec. Nvidia’s Infiniband does 800 Gbit/sec now, doesn’t it? Or is it not a relevant comparison? If it is, it is no wonder AMD and partners are rushing to define a faster networking standard.

It will be very interesting to see whether AMD can hold on to their leading position in supercomputing.

HPE Cray Supercomputing EX255a

The features of this accelerator blade are as follows:

• Two 4-socket AMD Instinct™ MI300a Accelerator APU nodes
• 128GB HBM3 per APU
• Up to 8 HPE Slingshot 200Gbit/sec ports per blade
• 0 or 1 local NVMe M.2 SSD per node (up to 2 per blade)
• 2 Board Management Controllers (BMC) per blade
• Cooled with cold plate

HPE Cray Supercomputing EX254n

The features of this accelerator blade are as follows:

• Two 4-socket Nvidia GH200 Grace Hopper Superchip nodes
• 96GB HBM3 per GPU; 120GB LPDDR per CPU
• Up to 8 HPE Slingshot 200Gbit/sec ports per blade
• 0 or 1 local NVMe M.2 SSD per node (up to 2 per blade)
• 2 Board Management Controllers (BMC) per blade
• Cooled with cold plate

HPE Cray Supercomputing EX235n

The features of this accelerator blade are as follows:

• Two 4-socket Nvidia A100 GPUs x 1-socket CPU nodes
• Support for the full AMD 3rd Gen AMD EPYC™ 7003 series processor stack
• 8 DIMMs per CPU socket (1DPC)
• Up to 64 GB DIMMs at up to 3200 MT/s
• Up to 8 HPE Slingshot 200Gbit/sec ports per blade
• 2 Board Management Controllers (BMC) per blade
• Cooled with cold plate

HPE Cray Supercomputing EX
AMD Servers: HPE and AMD EPYC™ – HPE ProLiant Server Systems | HPE

PS. By the way, the EX254n “Grace Hopper” accelerator blade is apparently HPE’s promoted supercomputing solution for generative AI:

Supercomputing solution for generative AI includes:

• Fully liquid-cooled HPE Cray EX2500 Supercomputers
• HPE Cray Supercomputing EX254n blades, each with eight NVIDIA GH200 Grace Hopper™ Superchips
• HPE Cray Software, HPE Slingshot, Cray ClusterStor
• HPE Machine Learning Development Environment
• NVIDIA AI Enterprise
• HPE Complete Care Service

Accelerate innovation and discovery with HPE’s supercomputing solution for generative AI
Supercomputing Solution for Generative AI | HPE

Here is some related commentary from The Next Platform on why HPE’s AMD-based AI system sales have been lack-luster (so far):

The way we see it, there are two things that are holding HPE back when it comes to AI servers. First, all of the “Antares” Instinct MI300A hybrid CPU-GPU accelerators that AMD could have made went into the “El Capitan” supercomputer being built by HPE for Lawrence Livermore National Laboratory. And nearly all of the Antares Instinct MI300X that AMD can ship went to the hyperscalers, notably Microsoft and Meta Platforms.

HPE Is Also Having Trouble Making Money With AI Servers (nextplatform.com)

It will be interesting to see how increased supply in the second half of the year affects uptake and sales of AMD-based AI systems.

1 Like

So… basically the obstacle to sales has been sales to other, higher priority customers and partners?

(No, the obstacle to sales has been production volumes.)

If AMD is selling all they can make, I’m … basically comfortable with that.

1 Like