Https://www.eetimes.com/amd-and-untether-take-on-nvidia-in-mlperf-benchmarks/

https://www.eetimes.com/amd-and-untether-take-on-nvidia-in-mlperf-benchmarks/

For the first time, the latest round of MLPerf inference benchmarks includes results for AMD’s flagship MI300X GPU. The challenger showed comparable results to market leader Nvidia’s H100/H200 current generation hardware, though Nvidia won overall—but only by a tight margin.

Nvidia was also challenged by startup Untether, showing its first MLPerf benchmarks in which its SpeedAI accelerator beat various Nvidia chips on power efficiency for ResNet-50 workloads. Google also submitted results for Trillium, its sixth-generation TPU, and Intel showcased its Granite Rapids CPU for the first time.

AMD MI300X

AMD submitted its first results for its Nvidia-challenging data center GPU, the MI300X, showing its performance in single and 8-chip systems for Llama2-70B inference. A single MI300X can inference 2520.27 tokens/s in server mode or 3062.72 tokens/s in offline mode, while 8x MI300Xs can do 21,028.20 tokens/s in server mode and 23,514.80 tokens/s in offline mode. The figures show fairly linear scalability between system sizes. (As a reminder, the offline scenario allows batching to maximize throughput, while the more difficult server scenario simulates real-time queries with latency limits to meet.)

These results are very similar (within 3-4%) of Nvidia’s results for H100-80B on the same workload for 8-chip systems. Compared to Nvidia’s H200-141GB, which is effectively the H100 with more and faster memory, AMD is more like 30-40% behind.

AMD has positioned its 12-chiplet MI300X GPU directly against Nvidia’s H100 and is widely seen as one of the most promising commercial offerings to challenge team green’s hold on the market. MI300X has more HBM capacity and bandwidth than Nvidia’s H100 and H200 (MI300X has 192 GB with 5.2 TB/s versus H200’s 141 GB at 4.8 TB/s), which should be evident in the results for inference of large language models (LLMs). AMD said 192 GB is large enough to hold the whole Llama2-70B model plus the KV cache (an intermediate result) on one chip, avoiding any networking overhead from splitting models across multiple GPUs. MI300X also has slightly more FLOPS than H100/200. Parity with H100 and lagging H200 may leave AMD fans a little disappointed, but these initial scores will no doubt improve in the next round with further software optimizations.

Software-wise, AMD said it made extensive use of its composable kernel (CK) library to write performance critical kernels for things like prefill attention, FP8 decode paged attention and various fused kernels. It also improved its scheduler for faster decode scheduling and better prefill batching.

AMD also previewed its next-gen Epyc Turin CPUs in combination with MI300X; the improvement was fairly marginal at 4.7% in server mode or 2.5% in offline mode versus the same system with a Genoa CPU, but it was enough to push the Turin-based system faster than DGX-H100 by a small amount. AMD Turin CPUs are not on the market yet.

2 Likes