Memory Bandwidth is the real future challenge

“When we look at performance today on our machines, the data movement is the thing that’s the killer,” Dongarra explained. “We’re looking at the floating point execution rate divided by the data movement rate, and we’re looking at different processors. In the old days, we had processors that basically had a match of one flops per one data movement – that’s how they were balanced. And if you guys remember the old Cray-1s, you could do two floating point operations and three data movements all simultaneously. So this is trying to get a get a handle on that. But over time, the processors have changed the balance. What has happened over the course of the next twenty years, from the beginning here is that an order of magnitude was lost. That is, we can now do ten floating point operations for every data movement that we make. And more recently, we’ve seen that number grow to 100 floating point operations for every data movement. And even some machines today are in the 200 range. That says there’s a tremendous imbalance between the floating point and data movement. So we have tremendous floating point capability – we are overprovision for floating point – but we don’t have the mechanism for moving data very effectively around in our system.”

The chart shows how generationally is has gotten worse and worse. And moving to HBM2e and even HBM3 or HBM4 and HBM5 memory is only a start, we think. And CXL memory can only partially address the issue. Inasmuch as CXL memory is faster than flash, we love it as a tool for system architects. But there are only so many PCI-Express lanes in the system to do CXL memory capacity and memory bandwidth expansion inside of a node. And while shared memory is interesting and possibly quite useful for HPC simulation and modeling as well as AI training workloads – again, because it will be higher performing than flash storage – that doesn’t mean any of this will be affordable.

Having won the Turing Award gives Dongarra a chance to lecture the industry a bit, and once encouraged to do so, he thankfully did. And we quote him at length because when Dongarra speaks, people should listen.

“I have harped on the imbalance of the machines,” Dongarra said. “So today, we build our machines based on commodity off the shelf processors from AMD or Intel, commodity off the shelf accelerators, commodity off the shelf interconnects – those are commodity stuff. We’re not designing our hardware to the specifics of the applications that are going to be used to drive them. So perhaps we should step back and have a closer look at the how the architecture should interact with the with the applications, with the software co-design – something we talk about, but the reality is very little co-design takes place today with our hardware. And you can see from those numbers, there’s very little that goes on. And perhaps a good –better – indicator is what’s happening in Japan, where they have much closer interactions with the with the architects, with the hardware people to design machines that have a better balance. So if I was going to look at forward looking research projects, I would say maybe we should spin up projects that look at architecture and have the architecture better reflected in the applications. But I would say that we should have a better balance between the hardware and the applications and the software – really engage in co-design. Have spin-off projects, which look at hardware. You know, in the old days, when I was going to school, we had universities that were developing architectures that would, that would put together machines. Illinois was a good example of that – Stanford, MIT, CMU. Other places spun up and had had hardware projects that were investigating architectures. We don’t see that as much today. Maybe we should think about investing there, putting some research money – perhaps from the Department of Energy – into that mechanism for doing that kind of work.”


This ties in nicely with the previous article you posted about Samsung embedding a vector engine inside their HBM. Many creative (and expensive) solutions are cropping up. There are several INTC customers that are buying Xeon Sapphire Rapids with HBM to help solve this problems. I suspect the Genoa-X will also be very popular with the bandwidth starved crowd.


Yes, indeed it does tie in with the Samsung article. It also brings to mind HP’s “The Machine” that was to be based on memristor technology.

> Speaking at Discover 2015, Sarah Anthony, systems research project manager at HP, addressed the Machine’s flattened memory architecture as she pointed to the mechanical mockup. “Here in this one node volume, we have terabytes of memory and we have hundreds of gigabits per second of bandwidth off the node, and that’s really important because we’ve changed what I/O is. It’s not I/O, it’s a memory pipe,” she said.
*> *
> “It’s going to provide a great foundation for ultra-scale analytics, but it has a significant impact on the system software. If you think about it, the essential characteristics of the Machine are that you have this massive capacity in terms of memory, tremendous bandwidth and very low latency. This is going to cause us to make modifications in the operating system and the software system on top of that,” continued Rich Friedrich, director of Systems Software for the Machine at HP.

Maybe it will resurface. Here was a prototype that didn’t get very far:

It’s not close to what the company envisioned with The Machine when it was first announced in 2014 but follows the same principle of pushing computing into memory subsystems. The system breaks the limitations tied to conventional PC and server architecture in which memory is a bottleneck.

0 of 13 minutes, 48 secondsVolume 0%

The standout feature in the mega server is the 160TB of memory capacity. No single server today can boast that memory capacity. It has more than three times the memory capacity of HPE’s Superdome X.

The Machine runs 1,280 Cavium ARM CPU cores. The memory and 40 32-core ARM chips – broken up into four Apollo 6000 enclosures – are linked via a super fast fabric interconnect. The interconnect is like a data superhighway on which multiple co-processors can be plugged in.

The connections are designed in a mesh network so memory and processor nodes can easily communicate with each other. FPGAs provide the controller logic for the interconnect fabric.

Computers will deal with huge amounts of information in the future and The Machine will be prepared for that influx, Bresniker said.

In a way, The Machine prepares computers for when Moore’s Law runs out of steam, he said. It’s becoming tougher to cram more transistors and features into chips, and The Machine is a distributed system that breaks up processing among multiple resources.

The Machine is also ready for futuristic technologies. Slots in The Machine allow the addition of photonics connectors, which will connect to the new fabric linking up storage, memory, and processors. The interconnect itself is an early implementation of the Gen-Z interconnect, which is backed by major hardware, chip, storage, and memory makers.

1 Like