Chinese Arm takes integer performance crown?

https://www.tomshardware.com/news/china-designed-128-core-cp…

When Alibaba’s T-Head subsidiary introduced its 128-core Yitian 710 processor consisting of 60 billion transistors and produced on TSMC’s N5 node, it made quite a splash in the world of CPUs. This week the company published official performance results of the chip at SPEC.org, an industrial benchmark, and revealed that the chip is actually the world’s fastest processor in SPEC CPU2017 integer workloads, as noticed by ServeTheHome.

Alibaba’s T-Head Yitian 710 datacenter system-on-chip integrates 128 Arm v9 cores operating at up to 3.20 GHz with 1MB of L2 cache per core and 128MB L3 cache per chip. The SoC packs eight DDR5-4800 memory channels that can provide up to 307.2 GBps of bandwidth, as well as 96 PCIe 5.0 lanes to attach high-performance solid-state storage, network cards, and other devices. The chip is exclusively used by Alibaba Cloud, which developed its proprietary Panjiu servers specifically for the Yitian 710 SoC. Panjiu can be used both for general-purpose and accelerated AI workloads, but to test the CPU in the SPEC CPU2017 benchmark Panjiu was used purely as a number crunching machine.

The tested Alibaba Cloud Panjiu server was based on a 128-core Yitian 710 operating at 2.75 GHz and mated with 512GB of DDR5-4800 (using eight 64MB modules). The machine run Anolis OS release 8.6 installed on a 240GB SATA SSD.

The machine’s baseline SPEC CPU2017 integer rate reached 510 (3.984 per core), but there is no benchmark for the peak score. Meanwhile, even 510 is 15% higher when compared to AMD’s 64-core EPYC 7773X processor, which scored 440 baseline result (6.875 per core). The highest baseline result achieved by Intel’s 36-core Xeon Platinum 8351N processor is 266 (7.38 per core), whereas the best rate hit by Ampere’s Altra 80-core machine is 301 (3.7625).

While Yitian’s and Ampere’s per core results look less impressive than their baseline results, the massive number of cores and the overall integer capability speak for themselves. If one needs an extreme integer rate, then the 128-core monster from Alibaba’s T-Head looks to be the processor of choice.

Now, while integer rate of Yitian 710 in SPEC CPU2017 is nothing but spectacular, for some reason Alibaba Cloud did not submit any floating point results of its platform. Perhaps, the floating point unit of the processor is not as impressive as its integer unit, or maybe software and/or CPU microcode still have to be polished off. In any case, at present CPU2017 floating point rates are dominated by AMD’s EPYC 7773X-based machines.

1 Like

Those single core scores are pretty pathetic for an N5 product, but if you put enough cores on one die you can get pretty good performance for a highly parallel application.
Alan

2 Likes

Those single core scores are pretty pathetic for an N5 product, but if you put enough cores on one die you can get pretty good performance for a highly parallel application.

But you always have to deal with Amdahl’s Law. The longest (sequential) single thread puts a floor under the running time, no matter how many (parallel) CPU cores or GPU shaders you can throw at everything else. On supercomputers, distributing the data, starting the threads, and collecting the results tends to be that long sequential thread. This is why many HPC programs use only a part of the system. Using more threads/nodes can make overall execution times slower.

It also means that faster CPU cores can improve overall performance in these systems–the GPUs are used to run the applications, and the CPUs to manage the data. With good application design, you can start the GPUs up well before all the data needed by a node is present. I tend to use eight chunks of work on each GPU. Why 8? I can use an unsigned subtract on the various numbers instead of integer or floating-point divides. :wink: I learned that trick when many computers didn’t have divide instructions or used a micro-code routine.

I think Amdahl’s law is most appropriate for a single process where a routine that might take say 50% of the time if improved by a factor of say 2 only improves the process speed by 25%. The law doesn’t work as well for a PC network where the longest sequential thread can be distributed to all available threads.

We have been upgrading the networks here to run multi gigabit ethernet from a control machine to the server and server to the switches and also added a lot more machines. We are up to 2000+ threads at one time now (when they are all working). The control computer doesn’t have to do much except provide a filename to be processed and then add up around 50 numbers to running totals upon receipt of the results for that filename. I suspect the model would still work well with much higher thread counts. The control machine and the server through which all requests and results are passed are effectively idle.

Instead we have a different problem, at it’s largest the dataset to be processed is 180,000 files. We need to get the results from all files back before we can move to another iteration. This means that for the last (2000-1) files the PC network is not fully utilised. A particularly difficult file can hold up the next iteration for minutes, even though we order the sequence of files into speed order, slowest (in normal use) first. The problem there is each iteration is testing a non-normal use case.

Idle times on the largest dataset are not too bad perhaps 5% but can be particularly bad on small subsets. An iteration’s time to complete is limited by the processing time of the slowest file.

We have been upgrading the networks here to run multi gigabit ethernet from a control machine to the server and server to the switches and also added a lot more machines.

Just curious–moving to AWS or similar is infeasible or not cost effective?

The last time I looked, it was to Azure or the then Microsoft offering that I was being pushed (around 10+ years ago), I am not up to date with current pricing. The prices did seem very high compared to what I was then using. They charged on CPU useage which might be fine (cheaper) for an occasional light usage such as an office but not for a flat out CPU work running hundreds of CPUs on all cores/HW threads at around 4ghz.
There were some technical issues that I wasn’t confident that they could fix such as handling large volumes of data. On my LAN processing files from each PC’s local disk takes around 130 secs, from the server it took 25418 secs. So even if something is going wrong/not optimal with the server test, I would need the data to be held locally. Not sure the vendors would do that.

Prices might be different now but even if they were now attractive I would want to utilise my existing kit until it was no longer cost effective.

Prices might be different now but even if they were now attractive I would want to utilise my existing kit until it was no longer cost effective.

I suggest you have another look. And specifically at AWS, and the ARM-based Graviton instances. Crazy fast interconnects, crazy cheap compute etc…

Your 100% utilization workload and existing kit may still keep you on prem for a while, but the hyperscalers are really running off with the entire on-prem datacenter business at this point. Global 2000 is moving…

The law doesn’t work as well for a PC network where the longest sequential thread can be distributed to all available threads.

Then it is no longer the highest pole in the tent. Not a complaint, I used to do a lot of lowering tentpoles. Even though at the time I was most concerned with reducing system loads on our local Multics server (BCO). Many times cutting down a particular thread ended up reducing the total CPU time in addition to finishing faster.

I think we are about to see AI workloads dealt with by adding hardware instructions. There have been a lot of x86/x64 SIMD instructions added to deal with matrix multiplication. Right now neural net instructions are starting to show up. The alternative is to use FPGAs, but since the neural net processing tends to be a client application, I think that both AMD and Intel see adding that to desktop, laptop, and smartphones.

Sorry but made a mistake with a decimal point (it was late here when replying and I only have one good eye for reading), processing time on server data took 2542 secs, so around 20* worse than running from a local disk. Even so that’s very bad. I have seen it do better than that at 2* worse. But even at 2 * worse one would be paying good money just to move data from the server to the PC.

I suggest you have another look. And specifically at AWS, and the ARM-based Graviton instances. Crazy fast interconnects, crazy cheap compute etc…

Your 100% utilization workload and existing kit may still keep you on prem for a while, but the hyperscalers are really running off with the entire on-prem datacenter business at this point. Global 2000 is moving…

I would be interested to know about pricing. It’s difficult to imagine how any company that makes a decent margin could offer something cheaper and as good. I don’t need regular backups, security or even air conditioning. The service machines only have exactly what’s needed and sufficient for the job.

I think we are about to see AI workloads dealt with by adding hardware instructions. There have been a lot of x86/x64 SIMD instructions added to deal with matrix multiplication. Right now neural net instructions are starting to show up. The alternative is to use FPGAs, but since the neural net processing tends to be a client application, I think that both AMD and Intel see adding that to desktop, laptop, and smartphones.

I welcome these developments. Adaptimg the code to use them is the difficulty, unless the compiler is good at finding the situations where they can be used effectively. I imagine a compiler is included for FPGA work.

1 Like

I would be interested to know about pricing. It’s difficult to imagine how any company that makes a decent margin could offer something cheaper and as good. I don’t need regular backups, security or even air conditioning. The service machines only have exactly what’s needed and sufficient for the job.

When you’re AWS you can subsidize some workloads using the profit on others to overall achieve e.g. adoption goals. I’m assuming that the Graviton instances are subsidized to some extent-- although performance-wise they turned out to be “just fine” for workloads where I’ve been concerned, the overall cost savings vs. x86 for a past employer was >40%. (Fairly heavily loaded clusters running Presto or now Trino – big data SQL engines-- on terabytes or petabytes of data.)