Great discussion:
• 70% of all AI workloads are on Nvidia chips. About 28% on Google (thanks to Google Search and Google Ads, two of the largest money-making AI apps today, along with TikTok and Meta). So if you look at workloads people are purchasing, it’s 98% Nvidia.
• Google buys Nvidia chips for Google Cloud - to rent GPU compute time to customers. Probably because of CUDA.
• Patel says Nvidia is dominant because of a "three-headed dragon:
- Software: “Every semiconductor company in the world sucks at software - except for Nvidia.”
- Hardware: Nvidia gets to the newest technologies first.
- Networking
As Brad says, multiple competitive moats.
Patel goes on to point out that the Blackwell racks are huge - 3 tons, and only Nvidia can do it all in-house.
“Building a chip is one thing. But building many chips that connect together, cooling them, networking them…is a whole host of things that other semiconductor companies don’t have the engineers for.”
• Blackwell’s Performance TCO is 5X Hopper.
• “The cost for delivering LLMs is tanking, which is going to induce demand.”
• Nvidia has a lot more software than just CUDA for training. But, CUDA is essential for training, as this is the development stage, and engineers are constantly trying new things and it’s not worth spending time optimizing things themselves. They rely on CUDA/Nvidia being fast/good enough off the bat with their development tools.
But, on the inference side, which is deployment, customers like Microsoft can see benefits to hiring engineers and tuning the models to run on cheaper hardware since those apps will run for 6 months - much longer than a training try.
• Patel believes that companies are upgrading their non-AI data centers in order to gain power to run new GPU installations in those data centers. Essentially, the new CPUs are also more performance per watt and per rack, so upgrading those frees up rack and power for new AI racks and workloads.
• Synthetic data generation is just getting underway and will increase the results of training compared to training on the entire internet today.
• “When you look at The Street’s estimates for capex, they’re all far too low…This whole scale is over narrative falls on its face when you look at what the people who know the best are spending on.”
• Nvidia’s source of capital is a lot different than Cisco’s back in the day. And the private market contribution today is much smaller (accounting for inflation) than it was back in the Dot Com Boom days. Today, the source of the money is cash flows from the most profitable companies in the world.
• GPT4 cost millions of dollars to train, but it’s generation billions of dollars in revenue.
• Consumer is paying 50X more per query now, but they’re getting value out of it because they’re getting things they couldn’t get before at any cost. Example is for code development - spending more is still cheaper than human coders. Gives examples of making $300k/year programmers 20% more efficient, or replacing 100 developers for 75 or 50 - those are “so worth using the most expensive model.”
“The cost for intelligence is so high in society”
• Memory is growing faster than GPU. Nvidia’s highest cost is HBM memory, not TSMC.
• The only reason people buy AMD GPUs is because they have more memory in the package. Patel: “Maybe we can’t design as well as Nvidia, buyt we can put more memory on it… The software isn’t nearly as good, the compute elements aren’t nearly as good but by golly they’ve got more memory bandwidth per dollar.”
• AMD is missing software, they won’t spend the money to build a GPU cluster for themselves to develop software. “Which is insane.” Meta and Microsoft are helping them, some. But, AMD’s share of total AI revenue will decline despite revenue growing next year.
– More after I eat dinner —