Nvidia Competition Update

https://www.investors.com/news/technology/nvidia-seen-fendin…

An IBD article on Nvidia. Sigh, is now ranked #8 in IBD. Mores the shame. The article names some ASICs that want to compete for Nvidia’s customers in certain aspects of AI, primarily deep learning (which is the largest segment presently to my understanding). The article goes on to say, basically, from this analyst, that no other option is likely to unseat NVDA given their whole product, product road map, software engineers and lead, and CUDA.

That the AI market is a multi-decade product cycle.

What the analyst does not mention, or it does not make it into the article, is that the GPU is always a moving target. So not only does the competing technology has to be better than what Nvidia offers (or otherwise, why leave the industry standard) and it has to stay materially better to go through the trouble of adopting a new standard.

Something we have talked about many times. Where to look for competition is finding a function or market that Nvidia is underserving, that the new ASIC can better serve, and then build out from there.

To date the only such chip is Google’s tensor flow. And Google is providing it as a service in its cloud, and not selling the chips. Google seems to be doing a good job of creating the infrastructure around it as well.

However, as we discussed earlier (probably on NPI as it went technically in-depth, I was in one of those moods) that Google was playing with its performance figures and was not comparing apples to apples with Nvidia. That Google’s processor was at least 30% more expensive per unit of processing (whichever unit they used).

Thus Nvidia’s latest top line AI GPU is superior to the latest Google GPU, at least for most things. It was quite technical as to when a tensor becomes superior (and nvidia includes a tensor on its latest chip) and it is not for all applications.

Anyways, good discussion and update on Nvidia.

Tinker

14 Likes

Tinker…just wanted to correct some terminology so people don’t get (too) confused

To date the only such chip is Google’s tensor flow.

Google’s hardware for AI is called a TPU (Tensor Processing Unit).
TensorFlow is the name of an open source Google software product that is an AI “framework” that is dominating the market for AI development. Of course it is free, like all the other frameworks. There are a dozen other popular frameworks but TF is dominant at this point. And, of course, it works on every hardware platform including Nvidia GPUs. It is mostly for research and product development – i.e. the “training” of AI.

Thus Nvidia’s latest top line AI GPU is superior to the latest Google GPU, at least for most things.

Google doesn’t have a GPU.

But be aware that Nvidia’s GPU with tensor cores (V100) and Google’s TPU are apples and oranges. Nvidia’s tensor cores do 16-bit floating point (with 32-bit accumulation) whereas Google’s TPU does 8-bit math. So the TPU is really only good for inference and not training. Different products, different use cases, different markets. You need more precision when training than during the inference after training is complete.

Mike

20 Likes

mschmit…Some comments from an old retired parallel processing neural network guy.

Google doesn’t have a GPU

True they don’t have a Graphics Processing Unit as in Nvidia’s systems which have evolved from
video game image generation into more general purpose higher precision devices.

But Google’s second generation TPUv2 chip is fully capable of training and internally has
16GB of HBM, 600 GB/s mem BW, Scalar/vector units: 32b float accumulation but reduced accumulation
for multipliers with a 45 TFLOPS capability. They are not designed to go in a gamming PC but rather
to be connected together into larger configurations in data centers.
http://learningsys.org/nips17/assets/slides/dean-nips17.pdf

The vast majority of AI and especially deep learning applications don’t require more than 8 bits in fact
one question being asked by Google Will very-low precision training (1-4 bit weights, 1-4 bit activations)
work in general across all problems we care about?. They are achieving very high
throughput and solving many diverse problems with their approach.

Nvidia is by far the leader in AI processors currently but Google looks like it has the potential to take
a bigger portion of the cloud based AI business. Another portion of the future market will be
specialized AI chips for individual user applications . . face recognition, stand alone language translators
and things we haven’t even thought of. I’m skeptical the current Nvidia high precision GPU will be an
economical component in car autopilots.

With all those nice comments about Google’s TPU I think Nvidia’s product still has a bright future and
I personally have some of their stock.

RAM

6 Likes

Frankly, Google obviated the truth by defining a Cloud TPU as a four-chip board and then compared that board to an NVIDIA Maxwell GPU accelerator that was two generations old at the time.,There was no need for all the fancy footwork. The Cloud TPU chip itself is very fast, at 45 Trillion Operations Per Second (TOPS)—more than twice the performance of NVIDIA’s Pascal GPU Accelerator. At 125 TOPS though, NVIDIA’s Volta GPU is even faster—thanks in part to its TensorCore feature. So, from a raw performance standpoint, chip to chip, NVIDIA Volta V100 is almost 3 times faster (125 vs. 45 TOPS). The big disclaimer is that this applies if (and only if) your model can take advantage of TensorCores, which perform a 4x4 matrix multiply in a single clock cycle.

https://www.forbes.com/sites/moorinsights/2018/02/13/google-…

The article concludes that Nvidia’s Volta is 33% less expensive per unit of performance than is Google’s latest TPU.

Intersting article. Technically, but written in a manner that the average reader can understand.

Point being, Google’s latest generation of ASIC is an alternative, and is not necessarily superior. Depends on the task and context.

Further, the article discusses limitations of tensors. “if (and only if) your model can take advantage of TensorCores, which perform a 4x4 matrix multiply in a single clock cycle.”

Also, again, Google is not producing these chips for sale. No one I know is producing these chips for sale. Google is selling usage of these chips on its cloud platform. Google, along with Amazon and Microsoft and every other cloud offers Nvidia. Nvidia is the one common denominator.

I admit though, as RAM states, ASICs for specific usages may be built to replace what Nvidia offers in the future for many specific functions. Musk is certainly trying this (although has nothing yet) with its autonomous driving roadmap. Currently powered by Nvidia. In the future…we shall see. It is always a moving target and as the Google example shows, it is not easy to beat what Nvidia offers. Does not mean it is impossible and companies will continue to try.

It also illustrates that it takes a company like Google to put together the entire whole product. More than just a faster chip is needed. There is an entire ecosystem around the chip that makes it work and building the ecosystem to scale is not an easy thing to do.

Tinker

5 Likes

<<<But the real disappointment comes from the pricing strategy: why would anyone pay more to get the same job done? Yes, a 4 die Cloud TPU can get the training done faster, but it is >2X slower than a 4 GPU instance on AWS. I can only surmise that the pricing is in part due to the high costs of the extra HBM2 memory chips, widely believed to be over $300 per TPU die, (or $900 more than the 16GB needed for Volta). Keep in mind, though, that Google is getting the ASIC at manufacturing costs, so even the HBM2 delta does not fully explain the higher pricing. Hopefully, these prices will come down after Google irons out whatever wrinkles are limiting the quantities.>>>

Headlines be what they are, this is an ASIC vs. GPUs and it cannot be said that the TPU is superior to the GPU. Certainly the price of the chip will come down at greater volume and it looks like the HBM2 memory is part of the high cost, but this said, no one will have higher volume than Nvidia on its GPUs and incorporated tensor as in the Volta.

Makes one like to invest in something like Square instead when there is no need to assess the technology like this. Then again, over the last 2 years, no stock I follow has outperformed Nvidia. In fact the ranking goes Nvidia, SHOP, ANET, and then Square from the stocks I follow. Nutanix not so well as Nutanix is recovering from its prior pummeling post IPO rally.

So there is something to be said for following technology details as well.

Tinker

4 Likes

It also illustrates that it takes a company like Google to put together the entire whole product. More than just a faster chip is needed. There is an entire ecosystem around the chip that makes it work and building the ecosystem to scale is not an easy thing to do.

Tinker:

Ecosystem…yes that is a good word for what AI needs and will continue to need. How can there possibly be “explosive” growth in AI without some ecosystem…such as what CUDA brings courtesy of NVDA.

I may be misinterpreting the general market sentiment but it would truly shock me if the world would allow the “evil one”…Google…to control AI. I rather expect that each of these huge behemoths (GOOG, FB, AMZN, etc.) to build chips/processes that are unique to their environment needs.

From an industry perspective, universities teach CUDA…shall they be destined to teach a GOOG language, AMZN, FB, etc.?

This massive paradigm shift simply needs a neutral party (of which NVDA seems to fit that bill) to provide the tools, software and hardware for any company to compete. NVDA may not have the databases that the GOOG’s, FB’s, AMZN’s may have but that is OK…they at least put most companies on equal footing with capabilities.

So generically, NVDA just makes sense…but for niche sectors…we should expect GOOG to do what is has done and no doubt what other large companies may try to do as well.

But again, NVDA has a singular focus…not so for these massive conglomerates.

4 Likes

Duma,

A final snippet from the article that is a truism: “Long story short, advanced silicon is hard to do—even if you are Google!”

Tinker

2 Likes

The vast majority of AI and especially deep learning applications don’t require more than 8 bits in fact
one question being asked by Google Will very-low precision training (1-4 bit weights, 1-4 bit activations)
work in general across all problems we care about?. They are achieving very high
throughput and solving many diverse problems with their approach.

This is true. And… (way too technical for this board)
In the future, as more ~novices get into deep learning, they will generally use a network model that has been pre-trained (on big hardware) and use transfer learning from that model to their specific use case and do a light-weight retraining (maybe with 16-bit, 8-bit or less).
Truly mission critical applications (autonomous cars, medical) will probably not do this, but many other applications will.

Mike

1 Like

I do have a question about this pre-training. Obviously you can train the network. But does each new autonomous computer system, say in a car, or drone, or robot, on the edge, unconnected to the network, need to be individually trained, or can the training for one autonomous edge object be transferred to all others? And if so, how?

Mike if you have any information on this or anyone else I would be interested.

Tinker

From an industry perspective, universities teach CUDA…shall they be destined to teach a GOOG language, AMZN, FB, etc.?

Yes, there is widespread teaching of CUDA. But people doing machine learning want to be abstracted away from that low level, mostly. And to do this they use a framework (Google’s TensorFlow, Caffe (Berkeley), Caffe2 (FB), PyTorch, PaddlePaddle (Baidu) etc.) The frameworks work on just a laptop CPU, but also on big hardware with GPUs…so you can debug your code on your laptop with a small dataset…then run big training jobs on more powerful HW. And never see a line of CUDA code since it is in a low=level library written by an expert in that.

Mike

2 Likes

I do have a question about this pre-training. Obviously you can train the network. But does each new autonomous computer system, say in a car, or drone, or robot, on the edge, unconnected to the network, need to be individually trained, or can the training for one autonomous edge object be transferred to all others? And if so, how?

Good question.

In a car, for example, Tesla would train the neural network once and replicate the software on each car, identically. This is about the same as regular software delivery today. (If a car were to observe a new situation the data would be recorded, sent back to the factory for retraining, more retesting, then a wide update to all cars.)

“Transfer learning” is where someone else, for example, has extensively built and trained a neural network to recognize hundreds of classes of objects (animals, plants, household items, etc.) Now let’s say you want to build a neural network to do something not done before …recognize hundreds of different species of birds. You start with the original network, cut off the last layer that does the final classification, put your own classifier in, then retrain with your data. You may have just saved 95% of the effort compared to starting from scratch. This is because the original neural network has already learned how to find the edges, features and fine details needed in your task. This will only work well if the model you start with has sufficient depth to be able to learn the fine details your application requires.

This is a bit analogous to learning one sport with a ball. Then when you learn a second sport with a different ball you still know how to play a game with some rules on a court or field, you still know how to run, jump, throw, etc even though some of the details are different. You didn’t start over as a 2 year old just having learned to walk.

Mike

20 Likes