What the devil is this thing? 66 threads per core, for some definition of “thread” and “core?”
Looking through it, there appears to be DARPA funding for “hyper sparse data” sources. If the CPU is waiting long periods of time for data it would make sense to greatly increase the thread count per core. Certainly just a science project.
Well, there are definitely real use cases for this, though. Graph analytics at scale is… increasingly important. And it’s really never a good fit for most HW you might run it on. The market for such tech would be small but deep-pocketed.
DARPA certainly fits your definition for a customer. With all the new technologies on this chip it seems more like a testbed to check them all out though. Perhaps it does become a product for the government at some point…
I guess I look at something like this from a different viewpoint. Between compiler and programming language work, and work on small fast machine code sequences, my first question is, “How do I use this thing efficiently?” Second, of course, is “What is this chip good for?”
What’s the problem? Intel is trying to solve the same problem that VLIW (very long instruction word) machines addressed. It seems to me that if code to use the chip efficiently is generated, you run into the same problems. Huh? Let’s say I try to divide processing a large array across all those threads. If the locality of reference can get the data into L2 cache, then great. But this is where the large number of threads kills you. No matter how you slice the data, you have too much of it trying to crowd into the caches. Thrashing is a very real risk. So the first “solution” is to idle some of those threads to ensure that data once loaded into cache will still be there when it is needed.
None of this is rocket science, and unlike rockets, testing your inner loop code is quick and cheap. The trick is to know how to figure out what is going wrong without the diagnostics making changes in the inner loop code you are working on. Modern processors have excellent support for performance tracing, but what you really want is a dump of the caches every few hundred cycles. Not available, and I have found that minimal diagnostics combined with lots of thinking is much more effective than wearing your eyes out trying to make sense out of a stack of trace printouts. Also, I have problems dealing with more data than I can fit on part of a display. Will Intel’s compiler people generate some incredibly fast code for this? Sure for things like FFTs and some simulations, it should be possible to get all those threads playing nice with each other and with cache. But working with graphs with millions of nodes? Not quite the right product.
Just at a guess: a second generation version of this could perhaps package a large chunk of memory with it, like the Instinct MI300 does. Tens or hundreds of gigabytes would unblock things in a way that wasn’t feasible previously.
I’m trusting that DARPA is working with Intel on fitting the product to the use case. (Will there be commercial derivatives of this? I have no idea. But the demand is there.)
“Just at a guess: a second-generation version of this could perhaps package a large chunk of memory with it, like the Instinct MI300 does. Tens or hundreds of gigabytes would unblock things in a way that wasn’t feasible previously.”
It is not the amount of memory, or even the latency. You can work around the latency by using instructions that prefetch the data you need.* The problem is that shared caches are just that, shared. If almost all the needed data fits into per-thread caches, fine. But what happens when all 55 threads want data from the same cache? A performance disaster. Unless you have a way to send the same cache line to all 55 threads at the same time, and code which can take advantage of this feature. Even if you need to use one thread to manage this cache magic, you can now replace 54 sequential cache accesses with one. I’d have to spend some time on it, but I think this feature would allow efficient code for the DARPA problem.
- I used to say that my favorite x86 instruction was PREFETCHNTA, but I may have switched my favorite to POPCNT which counts the number of bits set in its argument. At least when doing graph theory. Counting the number of bits set in an arbitrary length string used to drive me crazy. Not that I couldn’t do it, but given four different approaches, none was best. It depended on how long the bit-string is, and where it is found. POPCNT has thrown out almost all of that. There are still some cases where each byte can be used to index into an array: 0,1,1,2,1,2,2,3,1… There is a (distributed) cost to get the array into cache. But the (CPU clock cycle) time is often less than the cost of loading the bit-vector from main memory. Depends on a lot of factors which if either is faster, but they often on huge vectors come out identical, even when using PREFETCHNTA to keep from polluting cache.