“When we look at performance today on our machines, the data movement is the thing that’s the killer,” Dongarra explained. “We’re looking at the floating point execution rate divided by the data movement rate, and we’re looking at different processors. In the old days, we had processors that basically had a match of one flops per one data movement – that’s how they were balanced. And if you guys remember the old Cray-1s, you could do two floating point operations and three data movements all simultaneously. So this is trying to get a get a handle on that. But over time, the processors have changed the balance. What has happened over the course of the next twenty years, from the beginning here is that an order of magnitude was lost. That is, we can now do ten floating point operations for every data movement that we make. And more recently, we’ve seen that number grow to 100 floating point operations for every data movement. And even some machines today are in the 200 range. That says there’s a tremendous imbalance between the floating point and data movement. So we have tremendous floating point capability – we are overprovision for floating point – but we don’t have the mechanism for moving data very effectively around in our system.”
The chart shows how generationally is has gotten worse and worse. And moving to HBM2e and even HBM3 or HBM4 and HBM5 memory is only a start, we think. And CXL memory can only partially address the issue. Inasmuch as CXL memory is faster than flash, we love it as a tool for system architects. But there are only so many PCI-Express lanes in the system to do CXL memory capacity and memory bandwidth expansion inside of a node. And while shared memory is interesting and possibly quite useful for HPC simulation and modeling as well as AI training workloads – again, because it will be higher performing than flash storage – that doesn’t mean any of this will be affordable.
…
Having won the Turing Award gives Dongarra a chance to lecture the industry a bit, and once encouraged to do so, he thankfully did. And we quote him at length because when Dongarra speaks, people should listen.
“I have harped on the imbalance of the machines,” Dongarra said. “So today, we build our machines based on commodity off the shelf processors from AMD or Intel, commodity off the shelf accelerators, commodity off the shelf interconnects – those are commodity stuff. We’re not designing our hardware to the specifics of the applications that are going to be used to drive them. So perhaps we should step back and have a closer look at the how the architecture should interact with the with the applications, with the software co-design – something we talk about, but the reality is very little co-design takes place today with our hardware. And you can see from those numbers, there’s very little that goes on. And perhaps a good –better – indicator is what’s happening in Japan, where they have much closer interactions with the with the architects, with the hardware people to design machines that have a better balance. So if I was going to look at forward looking research projects, I would say maybe we should spin up projects that look at architecture and have the architecture better reflected in the applications. But I would say that we should have a better balance between the hardware and the applications and the software – really engage in co-design. Have spin-off projects, which look at hardware. You know, in the old days, when I was going to school, we had universities that were developing architectures that would, that would put together machines. Illinois was a good example of that – Stanford, MIT, CMU. Other places spun up and had had hardware projects that were investigating architectures. We don’t see that as much today. Maybe we should think about investing there, putting some research money – perhaps from the Department of Energy – into that mechanism for doing that kind of work.”