Nvidia, Ethernet, SMCI, xAI (Grok3), even Tesla

A technical article from October on how xAI scaled up to the largest Nvidia AI cluster in the world:

First, the cluster is liquid-cooled (not water, btw) and is built with SuperMicro’s liquid cooling racks:

Here, xAI is using the Supermicro 4U Universal GPU system. These are the most advanced AI servers on the market right now, for a few reasons. One is the degree of liquid cooling. The other is how serviceable they are. … One example of this is how the system is on trays that are serviceable without removing systems from the rack. The 1U rack manifold helps usher in cool liquid and out warmed liquid for each system. Quick disconnects make it fast to get the liquid cooling out of the way, and we showed last year how these can be removed and installed one-handed. Once these are removed, the trays can be pulled out for service.

There are a few photos of this, too.

Other AI servers in the industry are built, and then liquid cooling is added to an air-cooled design. Supermicro’s design is from the ground up to be liquid-cooled, and all from one vendor.

Even for storage, xAI chose Super Micro:

In AI clusters, you generally see large storage arrays. Here, we had storage software from different vendors running, but almost every storage server we saw was Supermicro as well. That should not be a surprise. Supermicro is the OEM for many storage vendors.

And then for networking:

Here, xAI is using NVIDIA BlueField-3 SuperNICs and Spectrum-X networking. NVIDIA has some special sauce in their network stack that helps ensure the right data gets to the right place navigating around bottlenecks in the cluster.
That is a big deal. Many supercomputer networks use InfiniBand or other technologies, but this is Ethernet. Ethernet means it can scale. Everyone reading this on STH will have the page delivered over an Ethernet network at some point. Ethernet is the backbone of the Internet. As a result, it is a technology that is immensely scalable. These enormous AI clusters are scaling to the point where some of the more exotic technologies have not touched in terms of scale. This is a really bold move by the xAI team.

So, even though xAI uses ethernet instead of InfiniBand, they still chose Nvidia’s networking hardware and firmware.

If your computer uses an Ethernet cable, that is the same base technology as the networking here. Except, that this is 400GbE or 400 times faster, per optical connection than the common 1GbE networking we see elsewhere. There are also nine of these links per system which means that we have about 3.6Tbps of bandwidth per GPU compute server.

Of course, power is an important aspect of building a large data center. And here Tesla Megapacks come into play:

Outside of the facility, we saw containers with Tesla Megapacks. This is one of the really neat learning points that the teams had building this giant cluster. AI servers do not run at 100% rated power consumption 24×7. Instead, they have many peaks and valleys in power consumption. With so many GPUs on site, the power consumption fluctuates as the workload moves to the GPUs, and then results are collated, and new jobs are dispatched. The team found that the millisecond spikes and drops in power were stressful enough that putting the Tesla Megapacks in the middle to help buffer those spikes in power helped make the entire installation more reliable.

If you saw the interview of Nvidia CEO Jensen Huang from Brad Gerstner (I posted it a few weeks ago), this was the cluster he was talking about. What he didn’t say, but which Gavin Baker recently said on another podcast, is that many in the industry thought a cluster of this size was impossible. Follow that link for an interesting explanation and discussion.

xAi built this cluster to train Grok3. We’ll see if Gavin’s prediction that Grok3 will outperform everything else comes true or not.

Anyway, this is a lot to digest. While the fallout from SMCI’s accounting issues are hurt the stock, it was clear that their products were world class. What that means moving forward is anyone’s guess. The easy take-away is that companies such as xAI, Meta, etc., want more AI compute power and so Nvidia is still a lock there.

The question of Nvidia being cyclical is interesting. Using the obvious internet cycle as an analogy, the money moved from picks and shovels (Cisco and Juniper) to infrastructure (Yahoo, AOL) to applications (Google search, Amazon eCommerce, etc.), to new uses like Cloud Computing, and then new products/services to support those apps like security, databases, etc. How this plays out for AI is still a bit uncertain, but we’ve seen APP and PLTR take off (UPST was an early player, but fizzled - maybe too soon?), and it looks like Meta and xAI and Tesla may be the next wave.

What this means for Nvidia is that while it’ll continue to sell tons of AI chips, the money is going to move downstream (upstream? I can never figure that out). Huang knows this and has already started AIAAS, software, robotics, and other downstream uses of AI plays within Nvidia. But, pivoting a 100B a year business to new revenue streams isn’t easy. Amazon got lucky that its AWS came naturally out of its eCommerce support efforts (scale up for XMas rush, sell the unused compute the other 11 months of the year), but in general it’s hard for companies to pivot to all new businesses that are large enough to matter. Can Nvidia do it? I give it about a year, maybe 2, to figure it out.

72 Likes

Hi Smorg,
I so appreciate your contributions here.

When you wrote here, “ The easy take-away is that companies such as xAI, Meta, etc., want more AI compute power and so Nvidia is still a lock there.”

And on another thread recently, “ When you look at the fundamentals right now, the only thing stopping Nvidia from doubling today is TSMC’s production capability. There was a report recently that AMD cancelled some production at TSMC and Nvidia took it all.”

I agree with your reasoning for these assertions; and, I’ll add here that all the talk about LLM scaling not continuing has been completely debunked with the fact that Musk and his best of the best engineers have achieved “full coherence” throughout this “the largest super cluster in the world”.

No one has achieved coherence with anything greater than 30-40k nodes and so there was reason for skepticism until now. We should see a large leap in capability in January or February, when Grok 3 is out of this training you mentioned.

If this happens, Nvidia growth is off to the races, IMO.

18 Likes

That’s not what happened with AWS. Amazon noticed an internal problem they had with building and expanding their computer programs. They didn’t want all of their engineers to keep solving the same problem over and over, so they created building blocks for their developers. Once that was done they noticed other developers had the same issues and decided they could offer this as a service. Then they took 3 year from deciding it was business opportunity until they launched the product. Amazon saw a problem and decided to put their resources to solving it and it was highly profitable.

Its not a big pivot if you have solved a problem you have encountered with your main business. No one understands the limitations of current software/hardware of Nvidia chips more than Nvidia. So this would give them a jumpstart in solving those problems before anyone else.

Drew

19 Likes

I know that’s what the interwebs say today, but having lived through it at the time and looked at the SLAs the initial service provided, it is what indeed happened. The SLA (Service Level Agreement) promised different levels of service from Thanksgiving on than for other times of the year. Unfortunately, I don’t have a copy of that SLA today.

Business-wise, it was a very big pivot, and it was very controversial on Wall Street, especially with the continued investment required that made it appear Amazon wasn’t actually making money. We even had discussions about it here in 2016.

The Innovator’s Dilemma book discusses how companies struggle to pivot their main business, including examples from the disk drive industry where even pivoting to a new size drive was problematic, as well as pivoting in adopting new technologies (like excavators from steam to hydraulic).

IBM has pivoted twice - once from mainframes to PCs, and then again to consulting services. The latter hasn’t gone as well as they hoped.

Tesla is in the middle of a couple pivots right now - from selling cars to robotaxi services. And a whole new business with Optimus robots. Their prior pivot to add solar and battery power solutions hasn’t been a failure, but hasn’t had the success some of us investors had hoped.

For Nvidia to pivot from mainly hardware design (GPUs, CPUs, boards, networking, boards, servers) to software will be a big deal, and there’s no guarantee they will be able to pull it off successfully. Sure, Nvidia already has a bunch of software talent on board, but they haven’t made a splash with any applications. They’ve been working on self-driving for years, but their efforts seem more about getting the actual autonomy players to choose Nvidia hardware than to actually use Nvidia software in products.

Jensen Huang is super-smart and I’m sure understands all this. However, his company is an aircraft carrier in a sea of F50 foiling catamarans.

23 Likes

Thanks, what I’m still trying to figure out is what role Astera Labs will play in future deployments. The long article I linked goes into a lot of detail on the different vendor products that xAI used, but Astera Labs was not listed. Now, that could be because their products aren’t a box that you’d see on a data center tour, and in fact, their products are inside those boxes.

As we’ve discussed, Nvidia is going to do well, and NVDA is going to do fine, but the stock isn’t going to 5X from here, at least in my lifetime. So, my strategy is to continue looking at two areas:

  1. Nvidia adjacent companies. Smaller companies who products are used in AI data centers that could see volumes ramp up several times over what is used today.

  2. The next step in AI businesses. It seems everyone, including me, is looking at the development of the Internet to understand the rise and fall of business involved with the development of AI. And so we make the analogy that Nvidia is the Cisco of AI. I can’t figure out who the Juniper is, lol (maybe that’s Nvidia too?). But those companies peaked as the Internet portal business took off (Yahoo, AOL, etc.). And those companies peaked as Search replaced Portals. And then applications like eCommerce and cloud computing - and then all the adjacent services needed to support that, from security to databases to online storage, etc.

But, even there, while today we’d look back and say Google was an obvious winner, at the time there was AltaVista, AskJeeves and even people thought the already successful portals would simply add search to their offerings. And while we all know the Pets.com failure, how did we know that an online bookseller wouldn’t suffer the same fate?

So, for AI, who are the new Portals? Are they Microsoft with its slew of various CoPilot integrations with Office365 and Meta, or will they be all new companies like ChatGPT or even companies not on anyone’s radar? And what will be the industry-defining application of AI? Will that be a consumer/retail application, or a business application the scale of Salesforce? Or, is Musk the genius that will re-invent Tesla as an AI company with self driving cars and autonomous robots for households and factories?

It’s always easy in hindsight to say “oh yeah, it makes sense that company dominated,” but it’s a lot harder to do that contemporaneously.

38 Likes

It doesn’t happen often, but imo every now and then there ARE clear signals to be had, if one is lucky enough to find them.

For instance, in its early days as a public company, Google had an insurmountable lead that imo was actually pretty obvious, even at the time.

I had read through an open-source college course on indexing text and was impressed with the audaciousness and complexity of indexing the entire Internet.

It then struck me that Google had scaled their indexing operations months before anyone else had. They were WAY ahead of everyone else, they were more efficient at it and at the time I thought it was clear nobody could possibly catch up to them; they were going to be the first company to index the entire Internet.

Too bad that’s before I started picking stocks :confused:

8 Likes

My recollection is that Google’s success wasn’t so much the indexing as it was the concept of “page rank,” to determine the order of the search results presented to users. That is, Google looked at how many other pages referenced a particular page (and the “quality” of those pages) and used that to set the order of presentation.

Of course, the algorithm is much more sophisticated now, being customized by the searcher’s history and probably using lots of AI.

There, I’ve now reset us back to being on topic: AI. :slight_smile:

9 Likes

I’m trying to figure out how much of a threat quantum computing is to AI, since the math behind both GPUs and quantum computing is linear algebra. If/when we get universal-gate quantum-superior computers, would it be more efficient to do the linear-algebra-based computing on a QC rather than a GPU? If “Yes” then how much more efficient…efficient enough to put a dent in the demand for GPUs?

$MSFT and $GOOG have both made some substantial claims about their QC efforts recently.

5 Likes

To your first point:
I submit your written assertion that on a risk/benefit bases, any small niche player in the DataCenters space could not measure up to Nvidia. But, without getting ‘to much into the weeds’, I am open to possibilities.

To you’re second point:
I’m keep in the front of my mind, the source of which I do not know, that ‘being early is as good as being wrong’ when it come to customer adoption.

I believe Jensen Huang, CEO of Nvidia, when he says ‘only 5% of Datacenters have been accelerated so far’.

I’ve noticed a lot of engagement with sovereign States about AI by Jensen, recently. I won’t venture a guess at how quickly governments will move into building their own clusters.

It’s frustrating to try and predict customer adoption outside of large governments, much more so internationally🤯. But, I’m not going to wait to follow the numbers, on an emerging niche player, when Nvidia numbers are what they are.

I guess this post was in place of my portfolio summary this month.

Tesla 33%
Nvidia 27%
Service Now 18%
Pure Storage 11%
Zscaler 6%
Cash 5%

27 Likes

Here’s an example: Say there’s a company making a thingamajig™ that connects to an Nvidia server rack and does something useful/better than anything else. Say that company was only selling one thingamajig for every 100 Nvidia racks. Now, maybe because of Blackwell, maybe because of Ultra Ethernet, maybe because HBM 4 - whatever - this company will now sell 50x more thingamajigs than before. This company will easily 10X its revenue stream if it can ramp production and scale support. That’s the kind of “Nvidia adjacent” company I’m looking for.

Maybe Astera Labs is one such company?

33 Likes

Bugger me this …
I believe that the nature of the largest platform shift in human history is the New Industrial Revolution that is generative AI. Instead of consumable things, like what was cranked out of factories in the past, we are now cranking out tokens of Intelligence. Because programming is now intention driven, this is a shift to a platform that simply makes great companies more profitable.

Once an upstart develops ‘a better way’ of doing something, even if completely disruptive, given this same goal, now made public, AI will now generate the path toward its creation by others.

In the case of AEHR, I believe Tesla was able to pretty much take what AEHR did and make it better/unnecessary.

Instead of looking for niche players that will undoubtedly be disintermediated by these resource rich aforementioned ‘great companies’, that now have ‘intention driven programming’ available to them, I’m simply looking for a few great companies that have themselves unlimited and/or proven ability to scale with AI.

I do enjoy the hunt; however, we may be entering a time of unlimited abundance.

Again, timing is unknowable.

Jason

20 Likes

@WillO2028, I shared a link a few hours ago to an interview with Adam Foroughi, CEO of AppLovin. I urge you to read it. IMO, AppLovin is a strong candidate for one of those “great companies” which you are hunting.

16 Likes

Let’s not build the next Maginot Line, if you can forgive the mixing of the metaphors. It’s looking to me like the advent of AI is NOT like the advent of the Internet. At least, at the stage where we are now.

I grant you that the first stage – building out the infrastructure – looks very similar. I made great money at the time on Cisco, Cabletron, and others – and I’ve made great money on NVIDIA and good money on AMD as well (note: sold AMD back in May due to competition from chips from hyperscalers).

This next stage, as I see it, is all about software making use of AI. However, instead of looking for the next Google riding the AI wave, I’m seeing a bit of a different pattern taking shape. Yes, there will be new entrants like OpenAI and Anthropic that will probably do quite well – but they’re not publicly traded and probably won’t IPO for some time still.

The Internet and the corresponding wave of innovation brought the price of launching a new company and getting its first products to market down by a 1-2 orders of magnitude. In contrast, it seems like the AI wave is making it more expensive to launch a new company and get first products to market by 1-2 orders of magnitude. Anyone else noticing this as well? AI is damn expensive!

That means that in any given market, it’s the existing dominant players who have the best shot right now at capturing the value AI can bring. Not only can they afford the cost of the AI, they already have the data and they already have the customers.

Here are companies I like from this perspective:

  • Intuit (INTU): holds a dominant position in small business accounting with Quickbooks, and can readily charge 5-10X more than they charge today if they can deliver the goods for their customers;
  • Apple, Microsoft, Google: if agentic AI comes to pass, these are the companies that are positioned to win big;
  • Salesforce: they have so much customer data already, and it has incredible potential for their customers if the company can unlock it for them.

Note that these are all companies that have proven time and again that they know how to innovate successfully. A company can be well positioned in terms of capital, customers, and data, but still unable to capture the opportunity.

Because most of these companies already have large businesses, if they hit paydirt with AI, investors may not see the kinds of returns that this board looks for (although I do think it’s possible with Intuit).

I would also point out that as quickly as it seems now that Internet businesses took off, I brought my first Internet ecommerce product to market in 1996. Google didn’t even IPO until close to 10 years later, in 2004. Facebook? 2012.

It’s still early days.

29 Likes

That’s probably just typical early stage costs, which will drop over time.

Nvidia just announced the Jetson Otion Nano, an edge AI compute device with 6 ARM cores and 1024 CUDA cores - for $250.

Runs Llama and CUDA and should be great for robotics, drones, etc.

We agree the next step is software, but I think you’re overlooking the need for platforms on which AI apps run and development take place. That could end up being open source, but maybe not.

And tech moves faster today than it did 20 years ago.

26 Likes

@Smorgasbord1 Sure, cost per unit of GPU compute will continue to fall precipitously from one generation to the next, but at the same time, demand will be rising for years to come (IMHO). There are certainly new entrants, nevertheless – but the cost of innovation in AI is much higher than it was for a classic Internet startup, by multiple orders of magnitude.

You raise the question of platforms – I’m wondering how you feel about Micron (MU). HBM is now > 50% of their revenue and they forecast the HBM market expanding six-fold by 2030. They are currently the leader in performance, and it’s not a market that invites new entrants.

2 Likes

I’m talking software platforms. Today nobody knows/cares what hardware is in the non-AI data center, all they care about is what APIs AWS or Azure are providing. How’s that going to sort out for AI applications, especially those provided AAS (As A Service) - AIAAS?

5 Likes