AMD GPU chiplet design

AMD superstars and Corporate Fellows Sam Naffziger and Michael Mantor, with colleagues Mark Fowler and Mark Leather, have just had their patent on a new GPU chiplet design published. It seems they have cracked the scaling problem beyond reticle size.

“A graphics processing unit (GPU) of a processing system is partitioned into multiple dies (referred to as GPU chiplets) that are configurable to collectively function and interface with an application as a single GPU in a first mode and as multiple GPUs in a second mode. By dividing the GPU into multiple GPU chiplets, the processing system flexibly and cost-effectively configures an amount of active GPU physical resources based on an operating mode. In addition, a configurable number of GPU chiplets are assembled into a single GPU, such that multiple different GPUs having different numbers of GPU chiplets can be assembled using a small number of tape-outs and a multiple-die GPU can be constructed out of GPU chiplets that implement varying generations of technology.”

CONFIGURABLE MULTIPLE-DIE GRAPHICS PROCESSING UNIT - ADVANCED MICRO DEVICES, INC. (freepatentsonline.com)

It will be interesting to see what they can bring to market.

3 Likes

The MI300 maybe?

Mike

More than just the reticle size problem – the idea of having chiplets that can flip between being independent GPUs and pieces of one large GPU based on workload is pretty interesting, to say the least. Or am I reading too much into this, believing that this can be done dynamically at runtime?

1 Like

That isn’t where the problem lies. The issue is memory bandwidth, or more to the point, how to manage caching so that you don’t throw away more performance than you gain. Let me put up a strawman, then how you can improve it. You have two chips, both of which read the same data from memory. For balance each chip gets alternate priority in requesting cache lines. If a chip skips its turn, the other chip can put in a request. The timing on this is easy. Reading a cache line will probably take say, two memory bus clock cycles, which will be dozens of chip clocks. Two clocks is optimistic, with a 256 or 512 bit wide memory bus (from graphics RAM). Remember the memory returned will take multiple clocks since you can’t stall the bus waiting for the data. It will come back when it is ready–quickly if from the CPU caches, slower if from main memory. So you need the incoming data to be tagged with its address.

Anyway we now have a merged stream of requests out, and a merged stream of data in. Is it better to have data only go to the requesting chip, or have both chips accept the data? Probably tag the data with a thread or process id, and if both chips are running the same process, the data will go to both. Think of a game running along with an operating system display, or a desktop. The game will run on both chips, the other processes (that have screens) on just one chip.

If all the data is read only, we are done. A lot of game data will be read only, like textures. But we need to be able to write, sometimes to main memory, but most importantly to one or more display buffers. (Yes, triple buffering is a thing, and if your refresh rate is high enough, very nice, no rips or tears, etc.) A win here is to have display data to be write only. You can put the screen buffer on one chip, but that means a lot of data streaming from one chip to the other. Or you can get fancy and have each chip responsible for half the screen. Not always a win, but splitting the screen into tiles is something that nVidia pioneered recently. If you do that, you can have the screen buffers in main memory. (Or in what the chips think of as main memory, the data will go to the CPU and get stored wherever the memory management puts it.)

I’m not going to add up what requires extra silicon and I/O pins/connections. The extra silicon required isn’t too bad. It is those nasty I/Os. You will need a hundred or more pins just for the connection between the two pins. It should be possible to put in a switch to turn the extra stuff off, but it is probably easier to use wired logic (instead of transistors) to turn the extra bits off. May take a few pins if you want to do it off chip, or some laser cuts to make a single chip variant. Probably the way to go, as other laser cuts will be used to scavenge defective chips as lower performance versions.

1 Like

Good to see you here again.

1 Like