AI Neocloud may not be as simple as we think

Roman Chernin, co-founder of Nebius just shared a technical read of TractoAI, the end-to-end platform, built by Nebius, for AI and big data workloads.

While this article may be too technical for non-tech readers, I think it is a good illustration of how complicated the system is to support end-to-end large scale AI workloads.

The talents and technical experience Nebius inherited from Yandex is indeed a moat.

Long Nebius.

Luffy

18 Likes

Thanks for posting this. I’m far from a techie, but the very reason I first bought into Nebius earlier this year was because I was so impressed by Arakady and everything he said about his expert team of software engineers that came with him out of Yandex. It seemed to me that we were getting a very experienced team in the guise of a start up and low prices to match.

I’ve also felt that, without wanting to hype this in any way, that the whole AI build out is a once in a lifetime opportunity. It happens once, and we’re in it now. And it’s very exciting to have a stake in it.

I was pretty demoralised with my investments after the SaaS blowout of ‘22, but when I first realised how big that AI was going to be I just piled right in. And I still think it is far from over.

Best,

Jonathan

16 Likes

I skimmed through the paper, which is mostly an techie-oriented advertising blurb for the product, Tractoai, that Neibus is trying to sell. It throws around a lot of terms they expect readers to already know (MVP - Minimum Viable Product, OLTP - Online Transaction Processing, etc.)

One ironic aspect is that this is a software layer, so them touting “bare metal” strikes me as odd. Offering the software package itself is acknowledging that many “users” don’t want bare metal, but want help with software infrastructure. Note that this locks customers in to Neibus, so wide-spread adoption, at least at scale, might not happen unless it somehow proves itself to have huge advantages.

This is a few months old now, but I think this provides a decent overview of both Nebius and CoreWeave, showing both their similarities and differences. It is, of course, out-dated with the recent MicroSoft announcement for Nebius, but I still think it’s worth a read.

12 Likes

Another note of potential concern is that Nebius brags about how Tractoai is built upon the “open source” YTsaurus code base.

That brings up a couple of issues:
• Before open sourcing, the code was leaked, along with a bunch of other Yandex code, in 2023. Note that Yandex was Nebius’ prior name, changed in 2024.

• The leaked code was embarassing for Yandex:

I couldn’t find non-Russian/non-Yandex posts/blogs/comments about anyone using YTsaurus. I don’t know who is contributing to the project besides Yandex/Nebius.

As for the Microsoft deal, this explains it well I think:

Microsoft is being forced to rent AI capacity since it can’t build it out themselves fast enough: “Faster than going solo.” MS also contracted with CoreWeave as well.

I’ll be surprised if Microsoft is using TractoAI, just hardware (“bare metal”), and what the value of TractoAI really is? Lots of software developers at Nebius feels like a mis-match as the company becomes a bare metal provider instead. And now, as the YouTube short states, financing for Nebius is not just “creative,” but “risky.”

6 Likes

Thanks so much for the feedback. I believe you are right that, TractoAI, as well as most of other software solutions developed by Nebius won’t be used by Microsoft. But Microsoft or other tech giants wont’t be the only customers of AI Neo-clouds. I see the AI Neo-clouds providing software solutions as similar to MangoDB offering Atlas or Confluent providing managed Kafka cloud. While tech native companies typically do not use these offerings, AI native start-ups, medium size companies, or non-tech enterprise could see benefits from adopting these end-to-end solutions.

Although the big tech (like Meta, Microsoft, etc) dominate the AI computing demand right now, I believe we are still at a very early stage of the AI revolution and this dynamic will change as the AI adoption grows. Nebius’s CEO foresees that, by 2030, the primary AI-computing demand will shift from model training to model inference as the inference traffic becomes 10X. I believe in this prediction. And, as this develops, an end-2-end software platform will become more and more valuable.

Luffy

7 Likes

Neither MongoDb nor Confluent are end-to-end solutions, they’re speciality products (MongoDB is a database and Kafka is a data stream-processor), and both eventually offered those products as managed cloud services. And the major cloud providers like AWS provide those services as well. So, the question here is why a Neo-Cloud provider would be chosen over a traditional cloud provider that also offers managed AI services.

Similarly, TractoAi is not an end-to-end solution. It’s a platform on which AI/ML workflows can be implemented, by developers. You need to have technical expertise to use all of these. FWIW, an example of an end-to-end solution would be Samara (IOT), to track your devices, like trucks.

I agree with the prediction that inference will require more compute than training, but that doesn’t mean the Neo-Clouds are going to win here. Amazon, Google, Microsoft, and even Oracle are not standing still. And that Microsoft looks at the two public neo-clouds (CoreWeave and Nebius) for their bare metal, not their software, is, I believe, telling us that demand is not long-lived.

3 Likes

From my past experience, I will lend some support to Luffy’s notion of who will and who most likely won’t use the software layer offered by the neo-cloud (I’m not familiar with the term) providers.

I worked at The Boeing Company in IT for 30 years. Most of my career was in software development. When I hired in (early 80’s) for the most part, Boeing designed and built virtually all of their own mainstream engineering, manufacturing and business applications.

It was a Big Blue (IBM) shop. Boeing had IBM mainframes (large centralized computers). Along with the mainframes and peripherils, Boeing had IBM middleware: operating system, job scheduling, teleprocessing, database. etc.

Boeing’s commercial airplane business grew out of their military business during WW2. During the war, Boeing delivered a plane an hour, 24 hours a day. Every plane was exactly like the plane in front of it and the plane behind it in the assembly line. The engineering description of the product was based on a drawing tree rather than a part tree (common practice for most manufacturing companies). The accounting system was based on 30 plane batches rather than individual airplanes. It all made sense at the time, they had one product and one customer. But once they entered the commercial airplane business, it made progressively less sense.

In the 90s Boeing management rightly decided that they were not an IT company (though at one time they did have an excursion into commercial IT product/service offerings). Boeing management decided that they didn’t need to employ nearly as many expensive IT heads as they did. There was a time when Boeing needed to develop their own applications. The business was (still is) very complicated. They had a lot of unique requirements.

But it became clear that there were several companies that specialized in building and supporting large scale engineering, manufacturing and business applications. Commercial of the shelf (COTS) software had become available that addressed much (but not all) of Boeing’s requirements. And in some cases, Boeing management decided that it was easier to change the business process to fit the software than to continue their unique manner of doing business. Though the transition costs were very high, eventually, COTS replaced almost all of Boeing’s internally developed applications. Even a lot of the software embedded in the airplanes is now developed by other companies and purchased by Boeing.

The point of this story is that few companies that aren’t engaged in software development as their primary business actually develop their own software. The most likely scenario is that when Microsoft, META, Alphabet, etc., uses neo-cloud hardware, they will undoubtedly use their own software. But other companies, even big companies like Boeing will for the most part use the software layer that the neo-cloud vendors provide.

8 Likes

Nebius is the same as MongoDB or Confluent here as it provides a specialized cloud product. By “end-2-end“ solution, I mean it is end-2-end within the ML context. So, I don’t mean that TractoAI will replace AWS or Azure, but its position is to provide every software needed specifically by the AI work loads.

I’ll give some more concrete examples about what kind of infrastructure or tools an Machine Learning engineer may need based on my experience. An AI system typically has two aspects:

  1. Model training:
    When training a model, firstly, we’ll need to prepare the training data which often requires data processing tasks, for example, collecting streaming data (user action logs), joining feature data (AKA model input) and label data (AKA the answer of the predicted target in supervised learning), or doing data cleaning which filters out invalid data. This step usually needs a big data processing engine (e.g. Spark for batch data, Flink for realtime data), as well as a scheduling system which manage task executions respecting the dependencies between each tasks.
    Second, we need a runtime environment where we can use common ML frameworks, e.g. Tensorflow, PyTorch, so that developers can develop and execute model code.
    Third, since model training requires doing matrix calculation based on all the model parameters and updating the model parameter, what if the model is too big to fit into a single machine’s memory? So we’ll need a distributed system to support training large models.
    Fourth, we need to be smart at managing GPU resources and may even need to optimize our model code so that it can utilize the GPUs with a high efficiency. This step may require looking into both the hardware and the model code, but a built-in software layer from the GPU clouds could solve the challenges for us probably for free.
    Lastly, after training is complete, we need some tooling to visualize a model’s metrics to evaluate its performance offline. We’ll also need some tools to compare new models with baseline model. We’ll only bring a model online to run AB tests after we see metric gains from offline metric comparisons.

  2. Model serving (inference):
    Inference basically means we giving the model the input data (features) and letting the model give us the predicted results.
    Here, firstly, we need somewhere to deploy the model and bring it online. For some use cases, when model is continuously trained with new data, we’ll want to automate the process to sync new model changes online.
    Second, the inference service must be able to scale horizontally to support large traffic depending on use cases. The service must have a very high reliability and availability if it’s used to support real world products.
    Third, the model inference engine needs to be optimized to use CPU and GPU efficiently. GPU is good at matrix calculations, but there are also other compute-intensive work that can be better done with CPUs during inference.
    Fourth, we’ll need monitoring systems to keep track of the healthiness of the model and the inference service.
    Lastly, sometimes, if, due to compliance requirements, user’s data can not be sent to centralized servers, we’ll have to deploy the model in edge servers. So this system will also need to support serverless architectures.

So, everything here is what I think can be offered by AI Neoclouds. Sure, I understand that companies can build everything by their own if they have developers, but this will not be a efficient use of engineering resource for many companies. And all of these features can certainly be provided by a general cloud as well, but there are two reasons I think Neo-clouds could win market shares: 1, Neo-clouds are more specialized and may be able to offer superior softwares in their specialized areas (similar to Confluent and MongoDB). 2, vendor consolidation. If a company has to use the GPUs from a neocloud, why would they bear with the operational overhead introduced by using a scheduler or resource management system offered by another vendor?

Luffy

7 Likes

Sorry, but I disagree.

First, none of these projects are an end-to-end solution, nor do they “provide every software needed.” Mongo is a database. A database itself is not a solution, it’s a component of a solution. Kafka is a data streaming platform, which again is not a solution by itself, but is a potential component. TractoAI is also not a solution in and of itself. It is a platform on which a solution could be built. For instance, TractoAI is one way to run an LLM. Nebius touts it as providing “support for all major libraries like PyTorch, Hugging Face, Nanogpt.” Those libraries are used by other code that is the application solution.

Second, Confluent is based on true open source standard - Apache Kafka. Kafka isn’t the easiest to setup and use, and that’s how Confluent got started. They’ve since added additional features and capabilities to it. MongoDB turned itself into an open-source project, with the savvy caveat that it was an older version of its NoSQL database that was open sourced. Amazon hosts this open source version for those that want it. Nebius’s TractoAI is not open source. It claims to be based on a open source project called ATsaurus, but as nobody else is using ATsaurus, the open source nature of that is questionable. IOW, as if anyone else cares. And solutions built on TractoAI are locked into Nebius.

Who’s using TractoAI today?

And that’s my point. The neo-clouds are competing with the big boys of cloud computing, who aren’t dumb, aren’t sitting still, and aren’t overlooking the opportunity. A company using MongoDB or Confluent is using true open source software and so could potentially move to other cloud platforms with only a reasonable amount of work. However, a company using TractoAI, which is not a standard, will be locked in to that platform, or have to do a lot of rewriting.

I don’t know why a company would have to use the GPUs from a neo-cloud. There are other options available. Right now the biggest advantage to the neo-clouds is price, but these companies are losing money at that pricing. There’s also that the neo-clouds tend to have higher grade servers, but it’s unclear whether companies needing that horsepower are actual neo-cloud customers, as they tend to build their own data centers.

The Microsoft deal moves Nebius’ main business to being a data center provider. It’s literally a different company now than it was a month ago. Microsoft is where the majority of their revenue will be coming from for the next few years. In this new business, Nebius’ software probably isn’t used and isn’t relevant. How does Nebius adapt?

12 Likes

Here’s answer from Grok AI:

TractoAI, an AI data lakehouse platform built on Nebius AI Cloud for data preparation, distributed training, and custom AI workflows, has secured several high-profile clients early in its development. These include:

  • JetBrains: Uses TractoAI as a primary data lake for AI-related workloads.

  • Mellum: Leverages the platform for big data and AI challenges, including model training.

  • PleIAs.fr (Pleias): Relied on TractoAI to train a 1B-parameter open-source language model, benefiting from its distributed training capabilities.

  • SynthLabs: A startup focused on AI post-training and reasoning models; they used TractoAI’s serverless platform for large-scale evaluations involving hundreds of thousands of GPU inference calls, finding it superior to alternatives like Azure and Lambda Labs.

  • Unrealme.ai: Employs TractoAI for fine-tuning text-to-image models in their generative AI video app, integrating it directly into their backend for speed and flexibility.

These customers span software development, AI research, and generative media sectors, highlighting TractoAI’s role in enabling scalable AI innovation. As a newer initiative within Nebius Group, its client base is growing, with emphasis on LLM pre-training and custom model building.

In addition, the following are the notable customers of the other software solutions from Nebius:

Nebius’s AI software solutions, beyond TractoAI, include Nebius AI Studio (Inference-as-a-Service for open-source models), Managed MLflow (for ML lifecycle tracking), Managed Apache Spark (for big data processing), and Nebius Life Science (AI tools for healthcare and biotech). As a rapidly growing AI infrastructure provider, Nebius has a developing ecosystem of customers, primarily startups, research labs, and enterprises in AI, healthcare, and data processing. Publicly documented notable customers are drawn from case studies, announcements, and partnerships as of October 2025. Specific product attributions are noted where available; some span multiple tools.

Nebius AI Studio

This platform powers low-latency inference for models like Llama and Mistral, with per-token pricing. Notable users include:

  • Chatfuel: Leading AI-powered customer engagement platform; uses a cascade of Llama-405B models via AI Studio for chatbot agents, achieving better response quality and speed.

  • vLLM: Open-source LLM inference framework; tests and optimizes inference on AI Studio for high-performance, low-cost model serving in production.

  • TrialHub: Clinical trial analytics firm; leverages RAG-optimized LLMs and semantic search on AI Studio to build a 250-million vector database in days for insights from 80,000+ medical sources.

Managed MLflow

A managed service for experiment tracking and model management, often integrated into broader ML pipelines. It’s newer (GA in March 2025) and primarily used by in-house teams and early adopters, with limited public case studies. No standalone notable external customers are publicly detailed, but it’s adopted by clients building MLOps workflows alongside other Nebius tools (e.g., AI Studio users like TrialHub for tracking fine-tuning runs).

Managed Apache Spark

Serverless data processing for ETL and feature engineering in AI workflows. Launched in late 2024, it’s integrated into data-heavy AI projects but lacks specific public customer stories. It’s used by clients processing large datasets for training, such as those in life sciences (e.g., Converge Bio’s single-cell analysis pipelines), though not explicitly attributed.

Nebius Life Science

Domain-specific AI platform for bioinformatics, drug discovery, and precision medicine, including OpenBioLLM access via AI Studio. Notable customers and awardees include:

  • Converge Bio: Biotech startup; trains full-transcriptome foundation models (Converge-SC) on Nebius infrastructure for single-cell RNA sequencing and patient-level insights in drug discovery.

  • CRISPR-GPT (Stanford, Princeton, Google DeepMind collaboration): LLM agent system for automating gene-editing experiments; uses Nebius for CRISPR selection, RNA design, and data analysis.

  • Ataraxis AI: Cancer prediction platform; awarded $100K GPU credits; achieves 30% higher accuracy than genomic tests in trials.

  • Aikium: Protein targeting for “undruggable” diseases like cancer/Alzheimer’s; awarded $100K credits for Yotta-ML² platform.

  • Transcripta Bio: Transcriptomic mapping and disease-gene associations; partnered with Microsoft Research using Nebius credits.

And that’s my point. The neo-clouds are competing with the big boys of cloud computing, who aren’t dumb, aren’t sitting still, and aren’t overlooking the opportunity. A company using MongoDB or Confluent is using true open source software and so could potentially move to other cloud platforms with only a reasonable amount of work. However, a company using TractoAI, which is not a standard, will be locked in to that platform, or have to do a lot of rewriting.

I’m not sure how much is the migration cost for AI work loads across platforms, though I don’t think it’s anywhere close to migrating an entire web application from self-operated data centers to AWS. If the switching cost is indeed huge, the same risk holds if the company locks in with a general cloud as well. I don’t have concrete evidence, but I don’t think it’s uncommon that a specialized cloud play can beat general cloud on their focused areas. It’s too early to assume that Neoclouds won’t beat general cloud in the AI infrastructure field.

As Nebius released in Q2 earning call. Shopify and Cloudflare (two famous names on this board) both became Nebius’s customers. And they do both use some parts of Nebius’s software layer offering rather than use the “bare-metal”.

According to Gemini:
Products used by Shopify

Shopify uses Nebius to power AI features for its e-commerce platform and to diversify its GPU capacity.

  • AI Infrastructure: Nebius provides scalable, high-performance GPU clusters that Shopify uses to run large, multi-node AI jobs.

  • AI Feature Integration: Shopify is implementing AI-powered features for merchants, such as enhanced product recommendations and other merchant tools, using Nebius’s cloud platform.

  • Data Labeling (Toloka): By using data from Nebius’s data labeling subsidiary, Toloka, Shopify can refine its machine learning models to improve the entire merchant and customer experience.

  • Multicloud Strategy: Shopify uses Nebius alongside other cloud providers, like Google Cloud, and the tool SkyPilot to manage its AI infrastructure across multiple clouds.

Products used by Cloudflare

Cloudflare primarily utilizes Nebius for high-speed AI inference at the edge of its global network.

  • Edge Inference: Nebius helps Cloudflare deploy powerful AI inference, which is the process of using a trained model to make a prediction. Cloudflare runs this inference at the edge of its network to improve the speed and performance of its offerings for customers.

  • AI Integration: The partnership allows Cloudflare to integrate AI capabilities across its product portfolio, enhancing security and speed.

  • Data Labeling (Toloka): Cloudflare leverages Toloka to train and refine the AI models that it deploys at the edge of its network.

The Microsoft deal moves Nebius’ main business to being a data center provider. It’s literally a different company now than it was a month ago. Microsoft is where the majority of their revenue will be coming from for the next few years.

This is probably a legit assumption, but it’s only based on what the company has done, not what the company is positioned to achieve in the future. As you also agree that the inference demand will grow bigger than training, do you think the inference demand will mostly come from Microsoft or other big tech? I believe the AI demand will surge in every industry and every company. So I’m certain that bare metal won’t mean everything for AI neoclouds. There will be huge demand for the software layer as well. The adoption from Shopify and Cloudflare are just some early examples.

So the question comes back to whether or what kind of companies will choose Nebius, another neo-cloud, or a general cloud for their end-to-end AI workloads. While I don’t think the answer to this question is clear yet, I’m optimistic about Nebius based on their current execution.

Luffy

15 Likes

I had a long answer to your post, but decided that the main relevant points are:

• Nebius’ business is completely changing. Microsoft is their 800-lb gorilla customer. No other customer matters if Nebius isn’t successful with Microsoft’s needs.

• Chances are Microsoft is not using nor providing to its customers any of Nebius’ software. Rather, all reports indicate Microsoft is simply renting bare metal data center capacity from Nebius, just as OpenAI is doing the same with CoreWeave. Do you expect Microsoft to be offering TractoAI via Azure, or under any Microsoft-banded AI cloud umbrella?
I don’t - not at all.

• The Microsoft-Nebius contract shifts the financial risk of building and managing the data centers to Nebius, allowing Microsoft to avoid massive upfront capital expenditure. Since Nebius doesn’t have the capital to build this data center (in Vineland, NJ), it’s going to have to raise capital to built it. The financial risk for Nebius in doing this was described in the short YT video I linked above.

• While this a contract, and so guaranteed as long as Nebius holds up its end, there is still customer concentration risk here. Nebius has to deliver more than 100K fully operational GB300s on time or Microsoft can terminate the deal. Microsoft is playing the field here:

• This deal was larger than Nebius’ entire market cap at the time it was announced. It’s the only thing that matters to Nebius’ management team, and to investors. If things don’t go quite as smoothly as planned, if Nebius looks to cut costs as it focuses on the data center build, where do you think they’d cut? I’ll say the team writing/supporting the software that only a few tiny private startup companies are using.

The bottom line for me is that Nebius is now a data center builder, not an AI software provider. Management probably thinks it can do both but when push comes to shove, it’s obvious to me what gets pushed out.

12 Likes

I 100% agree with this.

It also makes me wonder how much of NBIS’s projected capacity is needed simply to fulfill its MSFT commitment rather than service new or additional customers. It’s certainly not a small percentage.

Regardless, NBIS will be worth much more in the future if it can pull this off. And its success in the short to medium-term will be much more heavily tied to its hardware and capacity efforts than anything to do with software. This is also why any revenues from NBIS’s data development or Avride robot services will pale in comparison to its data center dollars. NBIS may want to present itself as a multi-trick pony, but only one trick really matters for the foreseeable future.

8 Likes

I’m far from a techie, but the very reason I first bought into Nebius earlier this year was because I was so impressed by Arakady and everything he said about his expert team of software engineers that came with him out of Yandex. It seemed to me that we were getting a very experienced team in the guise of a start up and low prices to match.

I’m legitimately trying to understand what is it that Nebius engineers are even working on?

Yes, they inherited a bunch of software engineers from Yandex, but their competitors such as IREN are saying customers are demanding bare metal. Nebius sounds to be building a software layer around the machines, but does Microsoft even need that?

The New Jersey data center build out for Nebius is being done by a company called DataOne. Here’s what we know,

  • The company is already 2 months behind schedule on the build out, and no guarantee it will finish in time with the two month cushion
  • At first they deleted the story on LinkedIn about the delay, but it turns out it’s true
  • The CEO at DataOne is posting about how much money you can make with Nebius stock, “He just had to invest in Nebius and he would have tripled his money”


Lastly I will ask, why are we on this board giving a free pass to a Russian company which rebranded to a European company just in 2024? Historically the board would not even consider names like this in past. We know this adds a massive amount of accounting risk to the overall investment. Here is Saul’s commentary on that from the Knowledge Base,

9 Likes

Seeing the deals Microsoft has done not just with Nebius, but also CoreWeave, Nscale, Lambda, and even Oracle makes me think that these Neo-Clouds are sacrificing their future for present dollars. It’s not unusual, but can they recover from it?

Going to another business, in its early days Tesla sold electric drivetrains to Toyota and Mercedes. As part of the deal, those OEMs invested in Tesla directly and got stock (which they’ve since sold at a profit). But, Tesla didn’t want to become just an electric drivetrain provider and worked hard on its Model S vehicle to differentiate it from all other offerings, including the Toyota and Mercedes vehicles with Tesla drivetrains. And Tesla stopped doing those deals as soon as it could.

So my question is whether Nebius and CoreWeave can also do both. Can they sell the guts of their systems to the legacy providers and still have enough capabilities and wherewithal to develop their own superior AI cloud offerings? The Nebius and CoreWeave deals have them literally competing with themselves from a company with huge clout, marketshare, brand awareness, etc.

I’m afraid the big players are simply getting bigger on the backs of the smaller companies hungry for dollars and near-term profitability. It raise the question of at what point can they pivot, and will customers care enough to switch or even adopt for new workflows from these companies? In the meantime, the legacy cloud providers are getting more capacity at less risk and with less capital outlay.

Right now the neoclouds have pricing of high-end server capacity as one of their customer attraction points. But, they are literally selling that advantage to the legacy cloud providers to up-front money. What happens in a few years when the legacy providers have all this high-end capacity and can simply out-price the neoclouds themselves? What advantages will they have?

Sure, there’s a potential that the neoclouds will provide better software infrastructure. But, being distracted by building out server capacity for their literal competition is, not to mention the dollars needed, is going to hurt their chances of developing that better software infrastructure. It’s going to be really hard to pull off.

12 Likes

Nebius is a European company based in the Netherlands. Yes, it came out of Russia, as did Arkady and their engineers. Had it not been for the war in Ukraine then Nebius would never have been founded as it now is. But it has no links to Russia today. It operates out of Europe (which is a plus for European AI who may prefer to go with Nebius rather than the US Coreweave). It operates in the US and is listed on NASDAQ.

I think the quote from Saul’s Knowledge base is about not investing in China due to potential accounting issues. Nebius does not, in my opinion, fall into the same category as a “dodgy” foreign stock.

Best,

Jonathan

22 Likes