Sapphire Rapids delays: full of errata

https://www.tomshardware.com/news/intel-sapphire-rapids-had-…

According to Igor’s Lab, Sapphire Rapids had about 500 bugs that required the company 12 steppings to fix them.

Intel’s fouth Gen Xeon Scalable Sapphire Rapids’ processor will not only increase core count to up to 60, but will bring in numerous new features, including Advanced Matrix Extensions (AMX), Data Streaming Accelerator (DSA), CXL 1.1 protocol, DDR5 and HBM2E memory support, PCIe Gen 5 interface, and many more. But the host of additional features increase probability of hardware bugs, so Intel had to fix almost 500 of them, Igor’s Lab reports.

So far, Intel has released A0, A1, B0, C0, C1, C2, D0, E0, E2, E3, E4 and E5 steppings of Sapphire Rapids processor to fix nearly 500 bugs. Given that modern processors integrate tens of billions of transistors, it is inevitable that have a certain number of bugs. They are called erratas and are mitigated with microcode or even software updates. But 500 erratas seems overwhelming, as does 12 respins considering that a respin costs tens of millions of dollars.

Although it is expensive to build new respins, the more pressing issue is that Intel has to delay release of its next-generation datacenter CPUs. Right now, Intel targets 2023 calendar week 6 to 9 (Feb. 6, 2023 to March 3, 2023) launch window for high-volume Sapphire Rapids processors. Meanwhile, some SPR products may launch on 2022 calendar week 42 and 2022 calendar week 45.

1 Like

Excellent leak giving us some insight into what happened.
I think this is emblematic of many of the problems with the earlier Intel administration; they allowed the engineers to run amuck and put too many high risk features into processes and products.

As a decoder reminder on the steppings, when the letter changes it is an all layer change and a start from scratch. When they just add a number to the end it is just a few layer change and most of the material in the production line can still be used for the new stepping. It takes at least a year to bring up a full stepping, while this few layer change stepping can be done in something like 6 months or potentially even less than that.

Note that they are not really talking about errata in the final shipping product, but all the bugs they had along the way.

Based on this leak, two socket Sapphire Rapids starts shipping middle of October. Four and eight socket designs look like early November. They don’t broadly release until early 2023.

There is still plenty of time for more problems to be found creating additional delays.
Alan

3 Likes

According to Igor’s Lab, Sapphire Rapids had about 500 bugs that required the company 12 steppings to fix them.

Good grief, Charlie Brown!

So far, Intel has released A0, A1, B0, C0, C1, C2, D0, E0, E2, E3, E4 and E5 steppings of Sapphire Rapids processor to fix nearly 500 bugs. Given that modern processors integrate tens of billions of transistors, it is inevitable that have a certain number of bugs. They are called erratas and are mitigated with microcode or even software updates. But 500 erratas seems overwhelming, as does 12 respins considering that a respin costs tens of millions of dollars.

I assume that most (if not all!) of these errata did not require new transistor layouts. Transistors not wired in take no power, so it is customary (or at least it was) to leave hundreds of extra gates for fixing errata. I wonder if Intel ran out of extra transistors and needed full spins. I look at that D0, E0 and wonder if Intel failed to “go big or go home” when adding “extra” gates for future errata.

The first AMD Duron was shipped on A0 silicon. It had the advantage of being derived from the Athlon in the same generation. But I did a (definitely not complete) look at AMD Revision Guides for Zen processors. I didn’t see any revisions other than A0, A1, B1 and B2. Also AMD seems to have a single sequence of errata numbers for all Zen processors. Even though they are into the 1300s many were deleted before being logged.* There seem to be a dozen remaining errata per product, most marked “No fix planned.”)

  • Creating lots of errata early on is good. During the design phase, before you bend any metal (er, silicon), fixing the documentation is good; adding a missing detail is even better. These errata are usually deleted, well not included in errata documentation. Sometimes the documentation is correct, and the detailed design has a mistake. Finally, there are errata of the “You silly git!” flavor. Of course, there are no stupid gits involved (at least among those who read, write, or obey the documentation. For example, an errata that says the time between the OS doing X and Y must be at least 50 nanoseconds. This will be marked as “No fix planned.” Why keep it around? The next version of the chip or even a faster version of the current chip shows up, this needs to be rechecked. The silly gits are those who try to optimize the OS without checking the errata and overclockers who expect everything to work perfectly. I remember many years ago showing a group that was overclocking math coprocessors that when correctly clocked, they got consistent (and correct) results and that overclocked, several different wrong values showed up in no particular order.

So if you want to try for overclocking records, remember to read the errata and adjust the BIOS, OS, and test programs accordingly.