What seems clear is that we are going to have chiplet components within a socket or across sockets with some kind of interconnect between it all. With AMD and Xilinx, it will be Infinity Fabric. Many, many generations of it, and maybe supporting the CCIX or CXL protocol on top of it, which should be possible if Infinity Fabric is indeed a superset of PCI-Express with AMD HyperTransport features woven into it. Don’t get hung up on that. There are good latency reasons for wanting to package up many things into a hybrid compute engine and make a big socket. But maybe the best answer, in the post-Moore’s Law era, is to stop wasting so much silicon on functions that are not fully used.
So, what we would like to see AMD do is this. Create a high performance Zen4 core with all of the vector engine guts ripped out of it, and put more cores on the die or fatter faster cores on the die. We opt for the latter because on this CPU, we want screaming serial performance. We want HBM3 memory on this thing, and we want at least 256 GB of capacity, which should be possible. And a ton of Infinity Fabric links coming off the single socket. Top it at 500 watts, we don’t care. Now, right next to that on the left of the system board we want a killer “Aldebaran” Instinct GPU, and half of an MI200 might be enough – the Instinct MI200 has two logical GPUs in a single package – or a full MI300, due next year with four Aldebaran engines, might be needed. It will depend on the customer. Put lots of HBM3 memory around the GPU, too. To the right of the CPU, we want a Versal FPGA hybrid with even more Infinity Fabric links coming off of it, the Arm cores ripped out, the DSP engines and AI engines left in, and all of the hard blocked interconnect stuff also there. This is an integrated programmable logic engine that can function like a DPU when needed. Infinity Fabric lanes can come off here to create a cluster, or directly off the GPUs and CPUs, but we like the idea of implementing an Infinity Fabric switch right at the DPU.
Now, take these compute engine blocks and allow customers to configure the ratios they need on system boards, within a rack, and across rows. Maybe one customer needs four GPUs for every CPU and two DPUs for every complex with a single Infinity Fabric switch. In another scenario, maybe the GPUs are closer to the DPUs for latency reasons (think a modern supercomputer) and the CPUs hang off to the side of the GPUs. Or maybe CPUs and GPUs all spoke out from the DPU hub. Or maybe the CPUs are in a ring topology and the GPUs are in a fat tree within the rack. Make it all Infinity Fabric and make the topology changeable across Infinity Fabric switches. (Different workloads need different topologies.) Each component is highly tuned, stripped down, with no fat at all on it, with the hardware absolutely co-designed with the software. Create Infinity Fabric storage links out to persistent memory, pick your technology, and run CXL over top of it to make it easy.
There is no InfiniBand or Ethernet in this future AMD system except on head nodes into the cluster, which are just Epyc CPU-only servers.
If we were AMD, that’s what we would do.