- In particular, an open source project from Meta (PyTorch) and another from OpenAI (Triton) have apparently cracked open the NVIDIA/CUDA armor.
In Nvidia official conferences they are talking about the benefit of Triton and have multiple presentations on how to setup Triton to work with Nvidia chips. I’m still new to Nvidia but I just assumed Triton was a Nvidia owned product because of how they were talking about it glowingly.
Either I’m confused what’s going on here, or the author is mistakenly thinking Triton is purely competitive and not offering some benefit to Nvidia too?
Excerpt From the December 14 Nvidia special call,
So we’ll look at a couple of inference – the key inference software from the NVIDIA inference platform before we dive into the actual use cases in the financial services. The first one is a Triton Inference Server. Triton Inference Server is an inference serving software for fast, scalable and simplified inference serving. The way it achieves all of that is by doing all these things that you see here in this chart, starting from support for any framework.
So regardless of whether it’s the machine learning or deep learning model, it supports all the popular frameworks, like TensorFlow, PyTorch, XGBoost; and then intermediate formats like ONNX; and inference frameworks like TensorRT, even basic Python and more. By doing this, it allows our data scientists to choose whatever framework they need to develop and train the models and then helps in production by streamlining the model execution across these frameworks. It also supports multi-GPU, multi-node execution of inference, of large language models.
The second benefit of Triton is it can handle different types of processing of the model, whether it is real-time processing or off-line batch or accept – it accepts video or audio as an input and has a streaming input. And also it supports pipeline. Because today, if you look at any AI application, any actual AI pipeline, it’s not a single model that works. And we have preprocessing, we have postprocessing, and there are many models that actually work in sequence or some in parallel for specific inference. And so it supports that pipeline.
The third benefit is Triton can be used to run models on any platform. It supports CPUs, GPUs. It runs on various operating systems and of course on the cloud, on-prem, edge and embedded. So essentially, it provides us a standardized way to deploy, run and scale AI models.
And it works with many DevOps and MLOps tools, like Kubernetes, KServe MLOps, platforms on the cloud and on-prem. And this is how it’s able to scale the models based on demand.
It’s able to offer all of these benefits without leaving any performance on the table. It provides the best performance on GPUs and CPUs; and it has unique capabilities like dynamic batching, concurrent execution. And thereby, it not only provides very high throughput and – with low latency, it also increases the utilization of the GPU. So essentially maximizing the investment, maximizing the ROI from those compute resources.