New AVX-512 instructions

…these are going to be added to Sapphire Lake. They seem to be aimed at machine learning, but I can’t find a reasonable use case:

Intel is adding some new instructions to Sapphire Lake as part of AVX-512 that will be aimed at AI training. It will add a new set of 10 registers, and a way to configure the registers, and parts of registers as single precision, b16, and various sizes of integers. Then one instruction (well four integer variants and one floating-point) will multiply two matrices, assuming they fit in the space. If it doesn’t and you have a large matrix to multiply, you can set things up to multiply parts of matrices, then crunch away.

I was asked if it made sense to build a BLAS library to use these new instructions. My first answer was, “Why?” If you need to do operations on small matrices, the code your compiler, BLAS, or LAPACK uses is fine. If you have lots of small matrices of INT8, INT16, or B16 values? Your AI/machine learning system already deals with them. But if you are spending days on a single ML task? Uh, how many GPUs are you using? I don’t expect that doing any ML math on a CPU chip is going to win compared to high-end GPUs. And, yes, there are specialized systems sold for doing machine learning. I don’t expect anyone to replace them with a box of Sapphire Lake chips.

So who is going to use it? I see no point in Intel adding these instructions without also providing a BLAS to go with. Intel might, but they haven’t asked me to write it. :wink: If they do get one, and it supports Integer and Boolean arrays along with floating-point, I’m sure they will print benchmark results using it. Compared to what? A CPU without those instructions? A GPU?

It might be fun to write an emulation library that could compete with Sapphire Lake. For some definitions of fun. Word to the wise, I would write it in Ada. Why? I could start out with benchmarks for the existing code in Ada. Then I could deal with cases like INT8 time B16 etc.

Does anyone else have a clue about using these instructions? I’m going to crosspost to the Intel group. I very seldom crosspost.

The Intel MKL that is part of oneAPI already contains BLAS routines that are optimized using the new Intel AMX instructions in the Sapphire Rapids. In the past, Intel has compared the performance of the new AMX instructions to CPU’s using just the older AVX instructions and have reported somewhere around a 10x performance improvement in some AI training and inference algorithms.

IF a machine is dedicated to training and inference it is likely they will build it with GPU, Tensor core, or Neural Net accelerators. The new AMX would be useful for general purpose machines that occasionally need to do training or inference.
Alan