I have seen TAM estimates for inferencing ranging from about what data center is to many multitudes more, so it’s obviously very important. Here’s some interesting tidbits from Jensen.
“We start – we’ve been sampling our Tesla P4, which is our data center inference processor. And we’re seeing just really existing[exciting] response. And this quarter, we started shipping, we’re looking outwards. My sense is that the inference market is probably about as large in the data centers as training, and the wonderful thing is everything that you train our processor will inference wonderfully in our processors as well. And the data centers are really awakening to the observation that the more GPUs they buy for uploading inference and training, the more money they save.”
“The video Tensor RT is really the only optimizing inference compiler in the world today and it targets all our platforms. And we do inference in the data center that I mentioned earlier. In the embedded world, the first embedded platform we’re targeting is self-driving cars. In order to drive the cars, you are basically inference or try to predict your perceive what’s around you all the time and that’s a very complicated inference matter.”
“And then for Jetson, we have a platform called Metropolis and Metropolis is used for a very large scale smart cities where cameras are deployed all over to keep city safe. We’ve been very successful in smart cities just about every major smart city provider and what’s called intelligent video analysis company whether almost all over the world as using NVIDIA’s video platform to do inference at the Edge, AI at the edge”
“And then we’ve announced recently success with FANUC, the largest manufacturing robotics company in the world, Komatsu, one of the largest constructions equipments company in the world to apply AI at the Edge for autonomous machines. Drones, we’ve several industrial drones that are inspecting pipelines and expecting power lines, flying over large spans of farms to figure out where to spay insecticides more accurately. There is all kinds of applications. So, you’re absolutely right that inference at the Edge or AI at the edge is a very large market opportunity for us and that’s exactly why TensorRT was created”
Will ASICs and FPGA or whatever “disrupt” Nvidia. Maybe but if I was involved in those businesses I’d be concerned about getting disrupted by Nvidia before it’s even figured out what you can do.
<<<and the wonderful thing is everything that you train our processor will inference wonderfully in our processors as well. And the data centers are really awakening to the observation that the more GPUs they buy for uploading inference and training, the more money they save.”>>>
Nvidia is the volume leader. The more volume of chip production, in semiconductors, the more efficient it becomes. NVDA has sky high margins. Seems that NVDA is giving volume discounts, and since most customers who need inferencing already have training in their data centers, and thus already have a large volume (thus volume) discounts on the GPUs. The ASICs and FPGAs will need to provide more value per dollar than the volume discounts that NVDA is offering here. And by definition no one will have more volume already in the market to enable these discounts.
This said, I do not know how much value ASICs provide even against volume discounted GPUs will provide. I cannot imagine that FPGAs are going to be “cheap” so the total value had better be much greater than what GPUs provide if FPGAs are going to disrupt much of what NVDA’s chips are currently doing.
For those interested in this topic Nvidia has an interesting study on training an inference here. It’s a little lengthy but easy to understand. Shows why Volta architecture and Nvidia accelerator software sent such a schockwace through the industry.
They also address FPGAs and Google’s TPU. After reading this I’m convinced FPGAs, while they may have a place in AI, do not present too great a threat to Nvidia’s business or future business. They are way behind in performance and the requirement to reprogram the hardware every time a user makes a change to the throughput in this emerging highly dynamic field makes their rapid rise downright impractical on an end to end efficiency front. ASICs for the same reason are not practical, but to a greater extent. Every network that is trained would have to have its own ASIC developed and is specific to that network. Hundreds maybe thousands of networks are developed every day. There may be specific instances where ASIC is both the most viable and most efficient processors but for the masses of neural networks that are developed the ASIC path won’t be. Let’s say for instance a company develops an Autonomous vehicle platform. They spend months or years and tens of millions of dollars to develop an ASIC processor to run their system(it would not be able to run someone else’s system). After a period of time new sensors, new techniques, or new software become available or it becomes necessary to correct some flaw in the platform. For anything of significance you may have to repeat the development process for a compatible ASIC and update models going forward or have to recall and replace ALL previous processors in addition to adding or updating the individual component. A potentially devastating cost. In GPU processed system all that might be additionally required is a software update, probably over the air.
The Google TPU might be a different bear. It is significantly behind the Volta in performance for training and inference. But I wouldn’t put google in a corner. For sure they are developing a follow up. And they are now a Volta user at scale. In all of the early Volta announcements Google was the only cloud titan curiously not mentioned including the previous earnings call. But in the most recent call Google is indeed lapping up Volta. Maybe when Amazon was the first to launch Volta and subsequenttly dropped pricing for Pascal training while keeping Volta at a premium, Google was forced to join the party. Or maybe it was planned and just not announced. What is the potential in tensor only processing and how versatile will it be? Will the software be able to mix with EVERY deep learning framework in use the way Nvidia does? Would Google want it to? Would they market it and make it available to competitors in the compute for hire space? Important questions to keep an eye on.
Of other interest, Nvidia’s specific inference targeted chips, P4, P40, and Jetson are Pascal based. I’d bet their engineers are working on bringing Volta archetype to these platforms just as they did for drive with Xavier. Look for more Volta announcements at GTC in March.
The Google TPU might be a different bear.
What is the potential in tensor only processing and how versatile will it be? Will the software be able to mix with EVERY deep learning framework in use the way Nvidia does? Would Google want it to?
First, some terminology corrections and definitions.
“Tensor” is a mathematical term being used by Google (TPU, TensorFlow) and Nvidia (TensorRT) as a marketing term. Very simply a tensor is a vector or an array of numbers. The current machine learning (mostly deserved) hype cycle kicked off with “deep” neural nets and convolutional neural nets. “Deep” because the programmers are using many layers, and the layers are mostly made up of 2D and/or 3D convolutions. These 2D and 3D data structures are arrays or “tensors” that are being processed. Thus, the technical reason for picking the marketing term “tensor.”
To answer your question…YES every deep learning network (or model) is going to be using convolutions. The most compute intensive and probably about 50-75% of all the layers for most applications will be convolutional layers.
Next you need to consider what bit precision needs to be used. Historically, GPUs mostly processed 32-bit floating point (as required by 3D graphics). Recent Nvidia and AMD GPUs also handle 16-bit floats. This means you lose some accuracy but can store and process twice as much data using the same amount of memory and bandwidth. Researchers originally only used 32-bit floats to get the most accuracy and that is what was available and easy to use. But then found this is probably overkill most of the time and found that you might be 99% as good with fewer bits. And if you go to 8-bit integers you might be 98% as good (percentages are application dependent).
So Google’s TPU is 8-bit integers and the Nvidia Volta adds 8-bits for convolutions.
So not ALL applications can use these 8-bit shortcuts, but most probably can. Maybe no big deal in a voice assistant if it isn’t quite as accurate but is ~4 times as fast (~4x fewer servers needed). But maybe reading an MRI to do early cancer detection will still use 32-bit floats everywhere to get the best accuracy possible.