Some very slick tricks for pushing some processing all the way down to the memory modules to reduce data movement and thus get more performance, less power consumption. If you can have a software layer that puts the right parts of the processing at the right layers you can probably achieve a lot. More on Samsung’s processing-in-memory chip here: Samsung testing memory with built-in processing for AI-centric servers • The Register where they had an implementation of this capability with Xilinx AI accelerators. (What ever happened to Xilinx, anyway? hmm?)
Samsung has built a claimed first-of-its-kind supercomputer containing AMD datacenter GPUs affixed with its processing-in-memory chips, which the company said can significantly improve the performance and energy efficiency of training large AI models.
The supercomputer, disclosed Tuesday at an industry event in South Korea, includes 96 AMD Instinct MI100 GPUs, each of which are loaded with a processing-in-memory (PIM) chip, a new kind of memory technology that reduces the amount of data that needs to move between the CPU and DRAM.
Choi Chang-kyu, the head of the AI Research Center at Samsung Electronics Advanced Institute of Technology, reportedly said the cluster was able to train the Text-to-Test Transfer Transformer (T5) language model developed by Google 2.5 times faster while using 2.7 times less power compared to the same cluster configuration that didn’t use the PIM chips.
“It is the only one of its kind in the world,” Choi said.
…
One big reason why the PIM-powered supercomputer has so much horsepower is that each PIM chip uses high-bandwidth memory (HBM), which the industry is increasingly turning to for handling high-performance computing and AI workloads. …
What makes Samsung HBM-PIM chips different from HBM implementations by other companies is that each memory bank on the PIM chip includes a processing unit inside. This, according to the South Korean electronics giant, reduces bottlenecks associated with moving data between the CPU and memory by shifting some of the computation inside the memory itself.
Samsung hopes to spur adoption of its PIM chips in the industry by creating software that will allow organizations to use the tech in an integrated software environment. To do this, it’s relying on SYCL, a royalty-free, cross-architecture programming abstraction layer that happens to underpin Intel’s implementation of C++ for its oneAPI parallel programming model.