Big interview unpacked

https://wccftech.com/amd-ryzen-7000-desktop-cpus-am5-platfor…

During the latest episode of PCWorld’s ‘The Full Nerd’ webcast series, guests Robert Hallock (AMD Director of Technical Marketing) and Frank Azor (Chief Architect of Gaming Solutions) answered a series of questions and further detailed the Ryzen 7000 Desktop CPUs and the features we will be seeing on the AM5 platform.

AMD Details Ryzen 7000 Desktop CPUs & AM5 Motherboard Platform Features
So there were a range of questions asked and both Robert and Frank did an absolutely splendid job in answering them to the folks over at PCWorld. We have slowly been getting more details on the AMD Ryzen 7000 CPU and AM5 platform showcased since the Computex 2022 unveil and well, let’s start off with the latest information.

Some of the things that were confirmed in the interview include:

Ryzen 7000 is 125W TDP / 170W Package Power
Ryzen 7000 5.5 GHz Demo Was In Stock-Spec (No Overclock)
Ryzen 7000 Double L2 Cache is IPC-Benefit
Ryzen 7000 CPUs Have 28 PCIe Gen 5 lanes (24 Usable)
1:1 Infinity Fabric Clock (No Frequencies Mentioned)
B650 Motherboards will support overclocking (like B550 series)
Integrated RDNA 2 GPU supports both Video encode/decode
Integrated RDNA 2 GPU For Commercial/Diagnostic Purposes
Smart Access Storage Details (Requirements Highlighted)

"So what we want to clarify is that it’s a 170 Watt socket power which with AMD, that spec is PPT (Package Power) for us. That doesn’t mean that every CPU is going to go up to 170 Watts but it’s 30 (Watt) higher than the socket AM4 power cap which was a 142 (watts). And we did this to mainly improve multi-thread performance as many of the core count chips were actually held back in overall compute performance by relatively modest socket power.

“The other point that I want to make is that by raising the minimum required socket power or minimum spec, you also raise the power delivery with every motherboard built to that spec so you get more robust power characteristics on all the boards which we are pretty excited about as well, It should be good for people who want to experiment with overclocking, people who appreciate premium board designs.”

–Robert Hallock (AMD Director of Technical Marketing)

As for the 5.5 GHz clock speed gaming demo, Robert reassured that the frequencies were entirely on stock spec. The motherboard used was a reference X670 design and the cooling was a standard ASETEK 280mm AIO cooler. It is also obvious that no overclocking was involved since the clocks varied between 5.1 to 5.5 GHz.

AMD Details Ryzen 7000 Desktop CPUs & AM5 Motherboard Platform Features 3

AMD showcased some insanely fast frequencies with the same Ryzen 7000 CPU sample hitting up to 5.52 GHz but we did see some variation in the clock speeds which started at 5.1 GHz and went up to the max 5.52 GHz speed which everyone is talking about. Interestingly, Robert states that in the respective game demo, they saw most of the threads clocking up to 5.5 GHz (that’s 32 threads for the prototype that was used). The 16-core Ryzen 7000 prototype was produced around Late April or Early May so AMD could still squeeze more headroom out of this chip if they want to or just let overclockers do the job.

2 Likes

Very interesting. There has been a lot of speculation as to whether AMD was sandbagging on Zen4’s performance or if Zen4 is weaker than expected. Some even going so far as to call it DOA.

Many of these comments have been driven by the fact that the IPC uplift from the 5nm process is 15% per TSMC and AMD was saying only that single threaded performance gains was >15%. So was AMD sandbagging or was it simply that the Zen4 architecture IPC gain was not much above the 15% gain from the node?

Ryzen 7000 Double L2 Cache is IPC-Benefit

We also do not know which 7000 processor was shown out rendering the Intel i9 12900K by 31%. It may not have been the top sku. AMD has done that before.

I am starting to lean towards the opinion that AMD was sandbagging.

Very interesting. There has been a lot of speculation as to whether AMD was sandbagging on Zen4’s performance or if Zen4 is weaker than expected. Some even going so far as to call it DOA.

Many of these comments have been driven by the fact that the IPC uplift from the 5nm process is 15% per TSMC and AMD was saying only that single threaded performance gains was >15%. So was AMD sandbagging or was it simply that the Zen4 architecture IPC gain was not much above the 15% gain from the node?

Not sure whether I am ignorant or not but to me a 7nm to 5nm shrink should be related to 7nm^2 to 5nm^2 increase in performance at the same clock speeds and power, in other words almost a double. A 15% performance increase seems trivial for a shrink. 12nm to 7nm was a pretty good performance increase. I don’t know where I am wrong but if not, sandbagging would fit my understanding.

Not sure whether I am ignorant or not but to me a 7nm to 5nm shrink should be related to 7nm^2 to 5nm^2 increase in performance at the same clock speeds and power, in other words almost a double

Well, since you’re not sure, let me clear it up for you - you are ignorant.

7nm to 5nm allows double the number of transistors in the same space (approximately)
Double the transistors doesn’t mean double the performance (with the very small outlier exception of massively parallel things like GPUs ~20 years ago)

A CPU like Zen4 they have to look at what they will use the extra transistors for. They could (probably) just double the number of cores and be done. For very parallel things that’d basically double performance. But it’d be 0% improvement for single-threaded performance. And for 99% of what the CPUs are used for it’d be no benefit (or nearly no benefit) - and wouldn’t be worth the premium price. Instead they look at what they can do that gets them smaller increases in performance on a much wider range of use cases. Maybe it’s adding some extra cache. Maybe it’s changing the floating point pipeline or the integer pipeline. Maybe it’s adding new instructions that speed up only a few applications, but provide a lot of benefit to those use cases. Throw everything together, and 15% from one generation to the next is a realistic result.
Since you brought up 12nm to 7nm - what was the improvement from 12nm (Zen+) to 7nm (Zen2)?
Was it 3X like you’d calculate from 12^2/7^2?
No, it was 15% (https://www.techspot.com/article/1876-4ghz-ryzen-3rd-gen-vs-… )

1 Like

Thanks for that foo1bar. I hope we are agreed that that a 2:1 shrink potentially increases the transistor density by 4. So why is only a 15% increase seen in the chips? Something else must not be scaling in the same way, clock speed? voltage? power?

It would aid my understanding as an investor what to expect from a shrink. Possibly one for the hardware gurus here.

A CPU like Zen4 they have to look at what they will use the extra transistors for. They could (probably) just double the number of cores and be done. For very parallel things that’d basically double performance. But it’d be 0% improvement for single-threaded performance.

Wait, so there would be no speed increase at all from the process shrink alone? I thought that could at least allow faster clock speeds. Plus, independent of the clock speed, is there no speed increase from the transistors being closer together and so the electrons would have less far to travel?

https://www.tsmc.com/english/dedicatedFoundry/technology/log…
N5 technology provides about 20% faster speed than N7 technology or about 40% power reduction. ( for equivalent designs)

========

AMD, for each design, decides what combination of speed increase and power reduction they want. With the socket power going up, it appears Zen 4 will focus more on faster speeds.

Plus, independent of the clock speed, is there no speed increase from the transistors being closer together and so the electrons would have less far to travel?

There is no speed increase independent of the clock speed.

Transistors being smaller and closer together might allow faster clock speeds.
It’s more complex than the electrons have less far to travel.
It’s more about resistance and capacitance of the circuits, the switching time of the transistors, and the voltages and currents in the circuits.

Back to why there is no speed increase independent of the clock speed:
Everything in the chip is done to the cadence of the clock. Some parts of the chip there are very very complex things (like partial results of a multiplication) being done from one set of flip-flops (clocked element) to the next. In other parts of the chip there’s only buffers and wires carrying a value from one flip-flop to the next.
Everything is to the cadence of the clock - so if the clock is 3GHz (333 ps) then the circuit has up to 333ps to get it’s results ready at the input of the next flip flop. If those results are ready in 250ps or ready in 300ps, doesn’t matter - so long as they’re ready by 333ps. If it’s not ready by 333ps, then you get bad results. So why does shrinking the transistors help? It helps because now you can (hopefully) see every circuit compute it’s results a little faster and if everything is ready in 322ps then you can run the clock at 3.1GHz. Or maybe you lower the voltage a little, so the circuits aren’t ready in 322ps, but they are still ready in 333ps. And you keep the clock at 3GHz, but you are able to reduce the power by 10% (or maybe more).

Now there might be a performance increase independent of the clock speed.
It is possible to make changes such that what was done spread across 6 clocks is now done spread across 5 clocks. Or to make changes so that there is less time that the circuits aren’t doing useful work. (ex. make it so you’re more likely to find the data/instructions that are needed in the cache, rather than having to wait to get it from an L2 cache - or from main memory (which results in much larger amounts of time that the circuits are idle). If I can get something done in 5 clocks instead of 6, that’s a 15% improvement for that item - So like maybe a FP multiply gets 15% faster, so overall FP performance improves by 12% (because there are things other than just the multiply that didn’t get faster) And maybe that change helps total performance on various benchmarks by anywhere from 2% to 12% depending on what the benchmark is doing.
Now - I don’t know what changes they made - so this is a purely theoretical example. I went with FP as an example because I think it’s something people can understand as being part of the overall performance.

1 Like

AMD, for each design, decides what combination of speed increase and power reduction they want. With the socket power going up, it appears Zen 4 will focus more on faster speeds.

A long time ago in semiconductor years, say before the 130 nm node, a die shrink got you more parts per wafer, faster speeds from the faster switching times of smaller transistors and faster overall speeds from the shorter paths the signals had to take. RADM Grace Murray Hopper used to hand out foot long wires as a visualization of “one nanosecond is one foot” or if you prefer 30 cm. Actually, you are doing good to get eight inches (20 cm). When I was young, and Grace was preaching to the choir, computers had cycle times in (tens of) kilohertz.* Transistors, and various logic protocols and transistor types starting with RTL (resistor transistor logic) resulted in clock cycles above one MHz. Then CMOS became the technology to beat, and it has stayed that way for 40 years. Right now I can look down and see, occasionally, 4950 MHz CPU clocks. Translated to silicon reality, you have six centimeters or just over two inches to play with. Shave it down to four centimeters or just over one and a half inches per clock-cycle, and you see why a lot of the performance progress has come not from adding or multiplying two numbers in registers faster, but from using those “extra” transistors from the die shrink to bring uncore into the CPU chip (or chiplets) to avoid the timing cost of going from one chip to another.

That’s the (relatively) good story. Now for the bad news. Signals don’t go through wires. They travel in the environment around the (copper) conductor. If you have two parallel wires, you get crosstalk. That limits how close together the wires can be. In addition, you get parasitic capacitance. That slows the signals down. And then there is the elephant in the room. As these wires get smaller and smaller, their resistance increases. That also slows things down and, just to make the hardware engineers’ life more miserable, heats up the traces, and the die as a whole. How do you deal with this? In part with fancy coolers when the chips are assembled into a system. Much more important is to decrease the amount of current through the traces. To do that, and keep the speed fixed, requires reducing the capacitance in both the traces and the transistors.

FinFETs are nice, not just because they allow you to switch a transistor–at the same clock speed–with a smaller current. The three-dimensional nature of the FinFET helps reduce parasitic capacitance and crosstalk. One of the “tricks” TSMC is using in its 5 nm process family is to make the fin taller than in the 7 nm family. This means either less power required to flip the transistor or faster switching if you use more current. (I guess I’m relying on everyone to know the two fundamental electric equations. E=IR voltage equals current times resistance, and P=IE Power (watts) equals current times voltage. These formulas and lots of variations are known as Ohm’s Law.) What do they mean to a chip designer? They can choose a voltage, and design the chip to have a particular speed at that voltage, or they can pick a clock speed and voltage and spend more transistors, where necessary to meet the design goals. I’ll just throw in here that the problem they are trying to solve is NP-hard. If your goals are set too high there may be a solution, but you don’t have millions of years to find it. (This is what happened to Intel at 10 nm. They choose “stretch” goals which may have been possible, but there was no chance of finding a solution in billions of man-years.)

Back to the specific case of AMD’s Zen 4, they chose to design to power, voltage, or clock speed, but not all three. My read is that they designed to a given clock speed and voltage while designing the new motherboards to provide more power if needed. They came out with slack, and at the top end, they are spending that to push clock speed. Notice that the rule of thumb for overclocking is that power required goes up as the cube of the voltage, while clock speeds go up linearly with voltage. In the HPC realm, performance per watt is starting to become the main performance measure. If you have a building that can provide 10 megawatts to your new system, you don’t really care about other limits.

Bottom line, is Zen 4 going to be wonderful or just average? Expect most motherboards to have four DDR5 DIMM slots, one per channel. Four memory channels, not two, and DDR5 will soon start to outperform DDR4 significantly. How fast will 670E mobos support? I dunno, but I think that the key support is in the I/O die on the CPU, and that will be supporting reasonable speed 12-channel DDR5 memory.* Don’t expect the L3 size to increase (other than a 3d version), but AMD has confirmed that the L2 cache will be doubled. Let’s look at one other jaw-dropper. I expect Renoir to top out at 16 cores, but AMD has Bergamo also Zen 4, that might show up in 24 or 32-core Zen 4 desktop products. My thinking is that Bergamo will be optimized for one thread per core, but 16 cores per chiplet. The magic here would involve twice the L1D and L2 (2 Meg) per core. There is just too much new coming in Zen 4 to expect a mediocre chip, and a desktop version of Bergamo will give AMD a do-over early next year. But I don’t expect them to need it.

  • Yes, I programmed and operated computers with drum memory. Not as storage but as working memory. They had 2+1 instructions, add A to B, and take the next instruction from C. This allowed clever programming to get 7 or more instructions per 16 2/3 millisecond drum revolution.

** I expect there will be motherboards with two DIMMs per channel, eight total, but expect those to be for workstation-class systems.

3 Likes

Thanks, good explanation.

Now there might be a performance increase independent of the clock speed.
It is possible to make changes such that what was done spread across 6 clocks is now done spread across 5 clocks. Or to make changes so that there is less time that the circuits aren’t doing useful work. (ex. make it so you’re more likely to find the data/instructions that are needed in the cache, rather than having to wait to get it from an L2 cache - or from main memory (which results in much larger amounts of time that the circuits are idle).

So, these changes would at least require a chip redesign, right? Or were you thinking more of compiler changes?

So, these changes would at least require a chip redesign, right?

Yes.
Since we’re talking about changing from one process node to the next, you’d be doing a chip redesign no matter what.
But there is differences in how much change is going in to a new chip. Changing a pipeline to do things in 5 stages instead of 6 is a significant change to the logic/microarchitecture - so would be a major redesign. Increasing an L1 data cache from 32KB to 64KB might be a less significant logic/microarchitecture change, but likely still result in significant changes to the floorplan for the core/chip. Increasing an L3 cache would probably be less disruptive (as it’s typically on the outside edge of the chip) And even if there’s no logical changes (which I think basically never happens when changing process nodes for a CPU family) there would still be changes to the more analog circuitry. For example, clock structures like PLLs are likely to need to be redone in the new process, as are I/O buffers.

stages instead of 6 is a significant change to the logic/microarchitecture - so would be a major redesign. Increasing an L1 data cache from 32KB to 64KB might be a less significant logic/microarchitecture change, but likely still result in significant changes to the floorplan for the core/chip. Increasing an L3 cache would probably be less disruptive (as it’s typically on the outside edge of the chip).

One of the biggest changes for Zen 4 Raphael and Genoa is doubling the size of the L2 per core cache from 512k to 1 Meg. Since the eight cores on a CCD/chiplet share L2, you will see this stated as 4 Meg of L2 (3 Meg per chip on 6 core per chip parts). I do not expect L1D or L1I to change, as pointed out above, changing the L3 size is a lot easier. (AMD is expected to have 3d versions of Genoa with substantially larger L3. There may be a desktop halo part as well.)

Remember also that there are two Zen 4 chiplet designs. I can guess at the cache sizes for Bergamo and any desktop and laptop versions, but at this point, they are definitely guesses. I would not be surprised to find out that while I am typing this, AMD has a supercomputer or two running register level simulations on different loads to make those decisions. Or they may have been done months ago. (AMD may have chosen the designs to simulate for Genoa to include the possible Bergamo cache sizes.) The important point to remember is that the major use for Bergamo is expected to be large database instances. I’m expecting a smaller L3, no changes to L1 sizes, and look at the results from those simulation runs to decide on the L2 size. :wink:

A more optimistic prediction for Zen 4 performance :
A subsequent post on tech blog Chips and Cheese suggests this could be as much as 40%, while IPC (instructions per clock) could increase by 25%. The article goes on to say that early samples of AMD’s less EPYC processors show a 29% speed improvement over the current generation, despite having the same number of cores and clocks.

https://www.techadvisor.com/news/pc-components/amd-zen-4-380…

1 Like

A more optimistic prediction for Zen 4 performance…

Realize that right now AMD is trying to do two things: Make sure enough AM5 motherboards are available when Raphael launches, and not to Osbourne their current products. So at this point in time, AMD needs to soft-pedal Zen 3 comparisons to Zen 4 and keep the launch date fluid. Once 2Q2022 closes AMD can start hyping Zen 4. Well…if sales won’t have a chance to ramp up before September 30th, AMD might continue dragging its feet.

My guess for a shipping date is mid to late August. Clean the shelves of Zen 3 and Zen 3+ machines with sales starting into the Back to School season, with Raphael catching the end of it. Of course, there is an elephant in this room. AMD just started shipping Rembrandt (Zen 3+ laptops) and will expect to see more Rembrandt unit sales in Q2 and Q3 than AM4 and AM5 desktop chips combined. That argues for a Raphael launch after Labor Day.

1 Like

Transistor density is valuable. But IIRC you can look at what may happen by shooting for too much density, as likely happened to some Intel products. If you don’t put the transistors close together, you don’t gain the benefits of shorter distances between signal. But if you put the transistors close together, and pack in a lot more transistors, you can get hot spots that actually worsen performance. It’s a balancing act, part of why it can take so long from tape out to shipping as they test the right densities to improve the overall speed without causing heat based problem.

15% has been considered a decent speed jump for most of this century IIRC. If you get 15% IPC as the main gain from a laser shrink, and stick in an extra core or two, you can boost both single threaded and multithreaded performance while avoiding hot spots. If there is a specialized need for more of either, and enough demand to justify it, I’m sure AMD would redesign to either further boost IPC or to bump up the number of cores without requiring exotic cooling on less safe platforms such as laptops. I think the core boosts are why AMD is doing so well with servers. The fewer SKUs you have, the more benefits you get from minor tweaks to slightly improve your entire product line.

Unless Intel clearly passes AMD in servers (where the margins are best) I’m not going to worry. AMD is at least one node behind Intel and still producing equal or superior CPUs. That all speaks to the superiority of the chiplet designs AMD is using. AMD is taking that extra money and using it for synergistic acquisitions (Xilinx, Pensando) that are steadily increasing the demand for AMD’s chips. Unlike every prior AMD CEO, I really can’t blame Dr. Su for any significant mistakes. If Intel does pass AMD, AMD can start ordering smaller SOTA wafers, pay more for them, charge more form the CPUs, and pass Intel. These are known improvements AMD has in hand when they need them. I suspect the new power levels are based in large part on what AMD will have to deliver to start ahead of Intel at 5nm or smaller production nodes.

1 Like

If true this is great news (am struggling to find an interpretation where this is bad news) :

>15% gain in single-threaded work, >35% overall performance gain (multi-threaded workloads), >25% performance-per-watt gains

https://www.tomshardware.com/news/amd-ryzen-7000-leak-releas…