No more AVX-512 for Alder Lake

https://www.tomshardware.com/news/intel-reportedly-kills-avx…

Strange – seems like it works pretty well but Intel is dropping it just the same.

As reported by Igorslab, Intel is reportedly killing off AVX-512 enablement on Alder Lake CPUs for good. To do this, the chipmaker will likely release a new microcode update to all Alder Lake-supported motherboards that prevent any AVX-512 enablement workarounds from being successful. Additionally, the company focused hard on bringing AVX-512 to mainstream consumers with Rocket Lake; however, that won’t be the case with Alder Lake.

AVX-512 has been a strange and confusing story for Intel’s 12th Gen Alder Lake platform. In our review of the architecture itself, Intel said the AVX-512 FMA is fused off entirely on Alder Lake’s Golden Cove Performance cores. We believed that the AVX-512 instruction set was physically disabled with no way to re-enable it via BIOS or other trickery.

However, it didn’t take long for Alder Lake users to realize that disabling the E-cores within the motherboard’s BIOS opened up the option to enable AVX-512 on the P-cores anyway. According to another report from IgorsLab, it seems this ability came about by accident. Motherboard manufacturers could re-activate AVX-512 with hacks to the microcode in the motherboard UEFI/BIOS. Nearly every motherboard vendor has taken advantage of this, making it a mainstream option. Intel, however, was firm on its stand that Alder Laker doesn’t officially support AVX-512, and enabling the instruction set could lead to errors.

What’s even more bizarre is that performance from the AVX-512 instruction set, in specific workloads, is very effective and efficient. For instance, a month ago, we covered a story regarding an open-source PlayStation 3 developer RPCS3 discovering significant performance improvements to emulation performance once when the developer enabled AVX-512 on a Core i9-12900K.

IgorLabs also tested the strengths of AVX-512 on Alder Lake and found it to be excellent. The German publication noted that power efficiency with AVX-512 was more efficient than AVX2, surprisingly enough. It’s a far cry from Rocket Lake’s implementation, where AVX-512 was more of a power hog than anything else.

Just keep in mind, not all workloads benefit from AVX-512. So it’s best not to assume that the E-cores are underwhelming and disabling them for the sake of AVX-512 is always worth it. There are still plenty of workloads that benefit from having both the P-cores and E-cores working in conjunction.

Why why why? hmm. Some other way of delivering the same value that fits better with their overall strategy? like pushing GPU capabilities instead? Or… something else? Strange.

1 Like

Why why why? hmm. Some other way of delivering the same value that fits better with their overall strategy? like pushing GPU capabilities instead? Or… something else? Strange.

Four possibilities: 1) Intel wants to make AVX-512 a Xeon extension only. 2) Intel has figured out that the complexity (time and silicon area) cost of AVX-512 isn’t worth it. That’s what AMD says, but it may be an effect of cache sizes data pathway widths. 3) Intel has a new set of SIMD extensions they want to field, which are incompatible with AVX-512. 4) It actually doesn’t work, but Intel doesn’t want to do a massive recall.

The first seems possible. The second? Whether or not AVX-512 as such is a benefit is a close call. Yes, there are other instructions within AVX-512, but the headline operation is multiply and accumulate (MAC or FMAC). Using multiple registers to do Ay+Z where A and Y are two or four wide times the SIMD factor, so up to 16 rows of your actual (double precision) arguments. (Complex number arrays are something I won’t go into here.) If you are doing trigonometric or exponential computations fine. But simple arithmetic, you jam up against the bandwidth from L1D to registers. AMD just upped it from two reads and one write to three reads and two writes except with 256-bit wide values. I expect AMD to eliminate that restriction with Zen 4. Intel could be planning to increase that even further than in Alder Lake.

Option three is unlikely. Not the new instructions part, but the incompatibility. That can always be finessed with a feature bit, that when enabled, disables AVX-512. Option four is probably the winner. Imagine that the last bits of the second argument occasionally get trashed. It could happen only when an interrupt occurs or with denormalized numbers. Or just for certain bit patterns. Since most floating multiplies round away any effect of the last bytes of either argument, it would take work to find, and not affect most computations.

If that case is true, Intel had two choices, one a huge recall, and possibly some SKUs that couldn’t be easily replaced. The other choice? Say Alder Lake doesn’t support AVX-512 and make it clear that it takes unsupported BIOS revisions or whatever to activate it.

That last is my bet. If anyone with access to an Alder Lake CPU in a system with AVX-512 implemented, I’d be happy to tell you how to test it.

4 Likes

I agree with your assessment. I suspect the answer may be related to your #4, in that AVX512 is very power hungry and Alder lake is already bumping into a speed/power wall without it.

The Golden cove core in AL has an extension on top of AVX512 called AMX or advanced matrix instructions. Here is a brief summary of it:
https://www.nextplatform.com/2021/08/19/with-amx-intel-adds-…

Generally speaking, AMX is designed to accelerate AI/ML workloads. It is also likely they are reserving AVX512/AMX for the pricier Xeon line.
Alan

2 Likes

Alan said: Generally speaking, AMX is designed to accelerate AI/ML workloads.

Nice, although I reported two typos, one in what is probably an Intel-sourced slide. :wink:

I don’t expect every or even any chip to complete a (vector) multiply and add for 1K bits in one clock cycle, but who cares? (Well, one effective clock cycle of latency.) Ten registers, if I can fire off one MAC every clock cycle for three different MACs? Great. Now we need the bandwidth to get data from the caches to the T registers. The simplest way to do matrix multiplications is to only compute and store C(x,z) once (in A*B+C → C). With three sets of registers each will keep its version of C(x,z) and add them together after every y loop. Or you can have three y loops running interwoven, for different values of B. Reading A becomes your O(n^3) memory access.

Hmmm. Just realized during review. Three streams fit nicely for the complex multiplications needed in FFTs (Fast Fourier Transforms) see Karatsuba multiplication for details.

Antonio,

Why why why? hmm. Some other way of delivering the same value that fits better with their overall strategy? like pushing GPU capabilities instead? Or… something else?

Or perhaps forcing people who need the feature(s) now disabled to buy a new system?

But if I were the customer who needed that feature, that would be motivation to replace the affected system with a competitor’s product.

Norm.

4) It actually doesn’t work, but Intel doesn’t want to do a massive recall. […] Option four is probably the winner.

I agree. However, they may not have any known errata yet. It could just be that, as an unsupported feature, it has not been validated to the normal high standards, and the lawyers within Intel may foresee trouble leaving a poorly validated feature enabled, even if unsupported. Stating that is unsupported may not be enough cover, should an issue arise and customer complaints be filed.

Even if the feature has been validated to the normal high standards, the lawyers may be arguing the risk is still too high to leave an unsupported feature enabled. Supposedly fully validated features may still reveal unknown errata after release (FDIV, Bulldozer, etc.).

Or perhaps forcing people who need the feature(s) now disabled to buy a new system?
You appear to be under the misimpression that AVX-512 was enabled in the first place for Alder Lake.
It wasn’t. Intel has not changed their stance that it is not a supported thing.

So if someone needed that feature, they would not have bought an Alder Lake system to start with.

The new report here is only that Intel is apparently making it more difficult (possibly impossible) to use these instructions that they’ve already said shouldn’t be used.

I wouldn’t be surprised if there was a bug that Intel is trying to prevent people from being affected by. Eachus seems to think that he could easily find the bug if he had access to such a machine. I’m sure if he wanted to he could reach out to the sources mentioned in the article. But IMO it’s unlikely that he would be able to find a bug in a reasonable amount of time. If it were just operating on a single 64-bit piece of data, if it did 1 operation per clock at 5GHz, it would be (2^64 / 5_000_000_000 ) seconds to exhaustively test. Which in years is 116 years. This isn’t just a single 64 bit piece of data that it operates on though. It is much larger amounts of data. And each additional bit doubles the time needed. So if you have two 64-bit inputs, you don’t need just 232 years - you would need 10^21 years. (To put that in perspective, the sun is ~10^9 years old). Even if you claim that you don’t need to do exhaustive testing - that you think you can select a subset of data that’s likely to hit the corner cases where there are likely bugs, it’s still going to be a huge data space that you’d need to test. And it seems the sources in the article have done a fair amount of testing already and weren’t able to identify any issues. So it is not anything obviously apparent - it could be significantly more rare than the Pentium FDIV bug. OR it could be completely unrelated to the data being operated on - it could be there’s some unexpected/undesired interaction with other instructions/operations. Or it could be as others have mentioned that it wasn’t validated sufficiently so isn’t known whether it has bugs and they’re limiting their liability in case it does. Or it could be something other than that as well.

3 Likes

I wouldn’t be surprised if there was a bug that Intel is trying to prevent people from being affected by. Eachus seems to think that he could easily find the bug if he had access to such a machine.

It is not difficult to code as long as you don’t rely on the suspect instructions in the checking code. The potential problem case to test for is when low order bits from the 105-bit multiply are made visible by subtracting a number almost the same as the first or second operand as the addition part of the MAC (multiply and accumulate). In other words, you are computing (1.0 + k) * x - 1.0, where k is 1 to 255* times the lsb (least significant bit). The result should be k*x. Divide that by k (which uses completely different logic) and see if you get x. Sounds simple, the tricky part is that you need to build a sequence of 16 multiplies where what I described is the last multiply. So you stuff the first 14 multiplies with a sequence that equals zero and doesn’t have any non-zero bits in the low-order 32-bits of any operand. The penultimate multiply of course is -1.0 times x. What values to use for x? There are about a dozen obvious edge cases, like 1.0-lsb (all mantissa bits set). Then test random x values for a couple of weeks. (Actually, see below, both Honeywell and Stratus were running this test, or one like it as low-priority background tasks on several computers in the development labs.

If you want to do the job that I would if I were doing the testing for a CPU or compiler manufacturer, go ahead. You may want that program to run for the next few years.

Stuff you probably want to ignore.

Um, I wrote substantially similar code at Honeywell for their DPS6 line of small machines. When I arrived at Stratus, they were already trying to fix the divide routine for their 600 line of products. But I wrote basically this code as test code for the Ada and PL/I compilers. Getting FP64 code right is not that easy. I wasn’t on that particular standards committee (it is/was IEEE not ISO) but I attended several meetings as liaison from ISO/IEC JTC1/SC22 subgroups. When I retired, AFAIK I was the last expert on how to implement fixed-point arithmetic right. (Adding and subtraction just uses standard integer arithmetic, as do multiplying or dividing by an integer. It is the type conversions that are tricky. PL/I and Ada both support/require fixed-point support. Ada doesn’t require support for fixed-point code with non-binary values for 'Small unless you support one of the annexes. Cobol of course requires decimal decimal point support, :wink: so the Ada Cobol support annex F requires that, along with some additional details in annexes B and G.

End of may not want to know stuff.

I’d code it up right now but I’m am off to a doctor’s appointment. The only tricky part in the coding is that you probably need to do the AVX-512 MAC as a code insert.

  • I’d do all those cases, but I suspect that if there is an arithmetic error you will see it in most or all of those cases.

Bob,

It is not difficult to code as long as you don’t rely on the suspect instructions in the checking code. The potential problem case to test for is when low order bits from the 105-bit multiply are made visible by subtracting a number almost the same as the first or second operand as the addition part of the MAC (multiply and accumulate). In other words, you are computing (1.0 + k) * x - 1.0, where k is 1 to 255* times the lsb (least significant bit). The result should be k*x. Divide that by k (which uses completely different logic) and see if you get x. Sounds simple, the tricky part is that you need to build a sequence of 16 multiplies where what I described is the last multiply. So you stuff the first 14 multiplies with a sequence that equals zero and doesn’t have any non-zero bits in the low-order 32-bits of any operand. The penultimate multiply of course is -1.0 times x. What values to use for x? There are about a dozen obvious edge cases, like 1.0-lsb (all mantissa bits set). Then test random x values for a couple of weeks. (Actually, see below, both Honeywell and Stratus were running this test, or one like it as low-priority background tasks on several computers in the development labs.

Yup, the programming a computer to compute f(x)-1 or 1-f(x) for small values of x where f is any continuous function for which f(0)=1 is a very bad idea – and the same is true of programming a computer to compute f(1+x) or f(1-x) for small values of x if f(1)=0. There’s sometimes a good mathematical alternative, with the trigonometric identity

1 - cos x = 2 sin^2 x/2

probably being the most widely known example – the right side of this identity computes very well, thank you, while the left side is highly susceptible to Loss of Precision (LoP) errors for small values of x. But if there’s no mathematical equivalent, one must resort to adding the significant terms, omitting the constant, of a Maclaurin series expansion of the function. Here, the calculation

f(x) = (1 + x)^a - 1

that arises in Khoury’s formula for probability of radar detection of a target turns out to be a double whammy. The mathematical rearrangement

(1 + x)^a - 1 = exp(a ln (1 + x)) - 1

is straightforward, but ln (1 + x) also is susceptible to LoP errors for small values of x. Thus, in this example, one has to resort to adding the significant terms of the Maclaurin series for both ln(1 + x) and exp(y) to dodge the LoP problem.

Of course, one can write the program to check the value of x before resorting to adding terms of a Maclaurin series and use the actual formula whenever x is sufficiently large to be out of the LoP danger zone.

Norm.

Yup, the programming a computer to compute f(x)-1 or 1-f(x) for small values of x where f is any continuous function for which f(0)=1 is a very bad idea – and the same is true of programming a computer to compute f(1+x) or f(1-x) for small values of x if f(1)=0. There’s sometimes a good mathematical alternative, with the trigonometric identity…

Yup, but here the intent is to find out if all of the bits of a canonical expression required to be correct are, in fact, correct. For this type of test doing (1+x)(1+y)-1 for small x and y is just what you want. That particular case where x and y are some small multiple of lsb (least significant bit of the mantissa). It seems odd that the 64-bit integer value 1 (one) represents 1.0+lsb, but it makes generating and reading (in binary) the values you need to work with easy. :wink:

Hmm. A bit more. The killer case for testing floating-point hardware is (1.0+lsb)(1.0-lsb)-1.0 I didn’t even propose it above. The expected answer is -lsb^2. The Cray 1 was notorious for not getting this case (and others) right.

Bob,

The killer case for testing floating-point hardware is (1.0+lsb)(1.0-lsb)-1.0 I didn’t even propose it above. The expected answer is -lsb^2. The Cray 1 was notorious for not getting this case (and others) right.

I would not expect very many computers to get that calculation correct. The problem is that

(1.0+lsb)(1.0-lsb)=1.0-lsb^2

mathematically, but the computation typically rounds off to either 1.0 or 1.0-lsb before the subtraction of 1.0, resulting in a value of either zero or lsb respectively. This is a case where you have to do the math manually and program the computer to compute lsb^2 to get the correct answer. But the bonus is that lsb^2 is a lot less computation than (1.0+lsb)(1.0-lsb)-1.0 so the numerically stable computation clearly is more efficient than the raw formula.

Norm.

I would not expect very many computers to get that calculation correct. The problem is that

(1.0+lsb)(1.0-lsb)=1.0-lsb^2

mathematically, but the computation typically rounds off to either 1.0 or 1.0-lsb before the subtraction of 1.0, resulting in a value of either zero or lsb respectively.

(What follows reads as a bit rude. I’m not intending to be rude, but it is too late at night to consider re-writing it.)

Read the documentation. IEEE 754-2008 is very clear on the subject. . If you do this as two instructions, you do get that intermediate rounding. But the various (three and four operands) MAC instructions are not supposed to do intermediate rounding. (I think the AMD four operand MACs were a part of SIMD 4.x for some value of x, and are no longer supported.)