I think these bottlenecks are currently preventing $NVDA from fully realizing the technical and sales potential of its ecosystem:
Lack of liquid-cooled facilities to move its GPUs into
Supply constraints
Networking bottlenecks (Infininband, Ethernet)
PCIE bottlenecks
imo, Huang dodged an Analyst’s question about GPUs at scale overwhelming air-cooling systems. He could have taken the question as an opportunity to map out for us how, over the next few years, $NVDA is going to keep GPU supply moving into datacenters that currently don’t really have adequate cooling. He could have provided estimates of when liquid-cooled infrastructure will be available at scale and how they will manage in the meantime. He didn’t. Instead he dodged the question by saying GPU demand looks good in the short term.
I’m long $NVDA, it’s a double for me so far and I currently don’t intend to sell a single share; I think it’ll probably be in my portfolio for a long time.
However I think there is a risk to the investment thesis worth monitoring: there is a risk that the bottlenecks listed above will buy time for competitors who no doubt are working madly, around the clock, to find solutions that generate less heat, use less power, take up less physical space and require fewer physical components to make.
If someone else out there comes up with a chip/ecosystem that reduces these things by orders of magnitude, they just may have a fighting chance thanks to cover provided by the bottlenecks that currently prevent $NVDA from achieving its full potential to dominate.
I realize speculation about it is probably not very productive and that the better thing to monitor is numbers. I’m just proposing this list of bottlenecks as an additional way of monitoring the investment thesis.
I think this is a reasonable consideration. If I may offer a counterpoint, however, I would argue that $NVDA is patently aware of these issues and is likely applying their vast human and monetary capital to solve these problems whether internally or through collaborations/acquisitions, etc. Jensen did mention a few times just how many companies are part of their supply chain, and they are consistently investigating other avenues to optimize their process.
To that end, they even mentioned during the call that they will take a hit to GMs in the next few quarters to ensure a more seamless customer experience/transition.
In short, I believe it’s reasonable and healthy to consider these bottlenecks as potential hindrances, but there is some possibility that $NVDA continues to innovate and pave the way, smoothing these processes while solidifying their market dominance.
The whole industry shares these bottlenecks, so I don’t see how these allow competitors to “catch up”. Could they sell more right now without bottlenecks? Sure. So, is that your point?
@MFChips well it’s basically a land-grab right now with multiple providers of AI functionality competing to get Developers/Organizations permanently locked into their ecosystem, via their software stack.
My point is these bottlenecks may hinder or slow $NVDA’s efforts to achieve vendor lock-in (e.g. CUDA).
If the bottlenecks turn out to be severe enough, they could buy other vendors time to catch-up with competing solutions.
Also, if ultimately there are not enough facilities available that can handle the heat that $NVDA’s ecosystem generates, we could theoretically see an over-supply of $NVDA GPUs until sufficient water-cooled floorspace is available.
The problem is unique to $NVDA, because nobody else is stressing the limits of the entire (planet-wide!) ecosystem the way $NVDA is. Granted, a nice problem to have. But I wonder if it could be the basis of a legitimate bear thesis at some point.
I also thought he dodged the question which was clearly about heating. He gave a long answer about the supply chain and demand for Blackwell.
I am not sure how Supermicro’s accounting saga may end up impacting Nvidia as they were the biggest champion of liquid cooling which would address this issue.
Do you think Nvidia is going to be able to maintain this monopoly on the market indefinitely?
Looking at other recent cases where a company had an enormous lead on competitors, such as AWS or ChatGPT the competitors eventually got close to producing an equivalent product.
My biggest concern is related to the law of large numbers. They added 5B of net new revenue this quarter and are guiding to 2.5B of net new revenue for the upcoming quarter. If we look back since the AI boom started, the net new revenue added per quarter is,
6.5B → 4.5B → 4B → 4B → 4B → 5B → (guided 2.5B)
As Nvidia laps the growth quarters coming up, simply adding 4-5B of net new revenue per quarter is not going to move the needle as much. However, a counter point to that is that profitability is growing faster than revenue. This past quarter EPS growth was 110% while having 94% revenue growth.
I am still holding a position in Nvidia, but it is smaller than it used to be. I think it would be wise to at least consider some of these drawbacks.
Seems to me that these bottlenecks are due mostly to physics, i.e., they are issues which competitors will face as they get to similar levels of performance and thus will be bottlenecks for them as well.
As for their monopoly, their silicon chips is that at this time.
But they are increasingly moving up the food chain by providing the whole solution. It’s a matrix of chips on a board. It’s the interconnect of these.
The reason Foxconn is mentioned? They are a manufacturing arm of NVDA, not a customer. NVDA is innovating at what we refer to as “top down”. I don’t view them as a chip company, they are a solutions company. Any vacuum left by SMCI is filled by a solution provided by NVDA and a partner company.
If SMCI had liquid cooling and the others did not, that is a hiccup, not a deal-killer. Dell, HPQ or any good box vendor can add liquid cooling and is probably already doing so. I just saw a liquid-cooled system up close. It is not rocket science.
Liquid cooling adds cost to the overall package, but it is not particularly hi-tech or difficult to implement. Perhaps the board or system designers were taken by surprise that their air-cooled heat sink was not enough.
Air-cooling: The chip generates lots of heat. The air-cooled solution is a slab of metal called a “heat sink”. The heat sink is secured on top of the chip that is making the heat. The slab of metal has fins like the fins on the barrels of a motorcycle engine. Air flows past the fins and cools the fins which in turn cools the chip. The box has fans to blow the air past the fins of the heat sink.
Liquid-cooling: The slab of metal has the heatsink encased in a liquid-tight metal package. Using small hoses, the liquid is pushed through the metal package past the heat sink to cool it, which in turn cools the chip. Liquid like water or anti-freeze transfers heat much better than air does. The hose leads out to a radiator with a fan to blow external air through the radiator. In bigger systems, the liquid could go outdoors to a cooling tower. The extra cost is the pump, one or more radiators like your car has, fans on the radiators, the hoses and metal fittings plus the more expensive package that lets the liquid flow through it.
So far as technology goes, this would be easy to build, but very annoying if you just built out an acre of server farm expecting air-cooling to be enough. Retrofitting the air-cooled solution with liquid cooling would be expensive. Customer will be angry at someone who sold them the system, but will reduce clock speed or voltage by a small amount to fit in their “thermal window” and make a note to buy a liquid cooled system on the next order.
Since NVDA sells the top performing solution, they will stay with NVDA and order liquid next time.
Over the last year and a half they have given guidance that is 1.88B to 2.5B above the most recent reporting quarter. The most recent quarter they beat the guidance by the most in dollars with a beat of 2.58B with the previous best at 2.51B. So they are consistently guiding about 2-2.5B above their current revenue.
I’m reading that management does not see any major slow-down in revenue growth in the coming quarter.
Maintain the monopoly indefinitely? Not in the sense of “indefinitely” being decades. But it will be maintained for at least the next couple years… at least (being repetitive).
And I would categorize AWS and ChatGPT completely differently. One is a hybrid/infrastructure operation… that takes anyone years to duplicate. By contrast, ChatGPT is software. Theoretically, a better competitor could arrive later today.
This is an interesting (probably marketing pitch) video about the difficulty of Nvidia’s scaling issues, compared to that of Cerebras.
While very impressive, I don’t know how far Cerebras is from ‘taking over’ Nvidia’s market, or whether it is even in the same league when it comes to offering solutions at a higher level.
@Fooledbydesign SUPER interesting and relevant video, thanks!
Like you say, we’ll have to monitor Cerebras to see if/how they can deliver integrated solutions.
They certainly seem to have a good handle on challenges and solutions, at the engineering level, for scaling GPU/memory capacity and inter-connectivity.
If what Cerebras is saying is true; we can add a fifth item:
5. A die-level architecture that cannot feasibly scale past two GPUs per die, vs. Cerebras already has a die architecture that scales to FIFTY GPUs per die. …AND also addresses #3 and #4 in $NVDA’s list of bottlenecks
That is an interesting take because most people I’ve seen say Nvidia has an insurmountable competitive advantage is because of their software CUDA. A lot of hardware developers are already familiar into their CUDA system.
Plenty of companies on the board are pure software plays but still have competitive advantage. Reddit’s source code is open source, and AppLovin says they could open source their code. ChatGPT is still about six months to a year ahead of competition, and they benefit from the scale of all those users giving feedback and new data into the system.
Since Nvidia’s hardware designs are public, any competitor can look at the issues coming up with heating and design more effectively for that issue. Nvidia is locked into their designs already. They still have a huge lead on competitors, a lot due to their software, but I am doubting they will be 90%+ of the market a few years from now.
This contradicts what SuperMicro has said about requiring 2+ years and thousands of engineers to build their liquid cooled system. HPE doesn’t have any liquid cooled systems at all last time I looked, and Dell’s solutions are behind SuperMicro’s.
There has been liquid cooled systems for home PCs for awhile, but this isn’t the type of system being discussed here. The liquid cooling for the rack of a Nvidia Blackwells is a state of the art solution, and all liquid cooling is not the same.
Liquid cooling is not particularly difficult to accomplish. Liquid cooling in one form or another, has been around server farms and mainframes for decades. It does, however, require design differences at every level of the server system. Plumbing for the circuit card, plumbing on the rack, fittings between the card and the rack, plumbing across the racks, and so on. I described in my earlier post liquid lines to the heat-generating components. Other solutions immerse the whole PC board in liquid. Dell, HPE, Asus, lots of others offer solutions. You can use a search engine with the phrase “liquid cooled server” or similar.
Here is a system similar to what I described in my earlier post:
I was in a friend’s server room that looks just like that 6 weeks ago.
Microsoft uses a two-phase system on their own server farms.
Perhaps SMCI pioneered some newer method? But if SMCI spox said “thousands of engineers” to develop a liquid cooled server rack/farm, that seems like an exaggeration.
I agree, two years and thousands of engineers (or maybe they said engineering hours which is a very different metric) seems like more than it should require. If that information was in a press release, I didn’t see it.
The only thing different about the SMCI solution for water cooling is that they needed to integrate it with their “building block” design. I have absolutely no knowledge of how much additional time this might take, but the advantage is that it gives SMCI customers a high degree of customization options. That probably gives SMCI a marketing edge, but I don’t think it makes their implementation functionally much different than any other vendor.
All the same, I do believe that SMCI is ahead of their competitors with respect to providing data center scale liquid cooling. I don’t know to what extent their competitors have comparable offerings at this time, but in a few months at that will probably be moot.
However, I admit I’m speculating here. I suppose it could take longer to be able to provide enough servers to build a data center, let alone multiple data centers. Time from design to delivery can be quite long as all the suppliers in the chain that need to respond to the design changes must adopt their manufacturing and scale up 2nd tier suppliers. If those suppliers are responding to multiple server manufacturers the time, delivery may take longer than one might imagine.
““These data center capabilities represent an important step forward with increased energy efficiency and flexible support for emerging workload,” said Prasad Kalyanaraman, vice president of Infrastructure Services at AWS. “But what is even more exciting is that they are designed to be modular, so that we are able to retrofit our existing infrastructure for liquid cooling and energy efficiency to power generative AI applications and lower our carbon footprint.””
I wonder what they’ve done to be able to retrofit their existing infrastructure for liquid cooling?