He who has the most data wins

I listened to a podcast this week with Cathy Wood of Ark funds and something she said resonated with me. She was talking about Tesla specifically and said that the reason she thinks they will win the self driving car race is that they have (and will continue to have in the future) significantly more data than any competitor. Even though their cars are not fully autonomous yet, they are still getting tons of data from them that they are analyzing and learning from and using to improve. No other company has anywhere close to the data they have and no one will for a long time. She gave a similar argument for NVTA in the area of genomic sequencing.

It made me think about the companies discussed here and which ones have a similar “data advantage”.

CRWD is one that comes to mind. I’m not an expert, but from what I understand since they have such a large lead in the type of security cloud they are offering, anyone who tries to copy the idea and compete against them will struggle to deliver the same quality of product because they will have significantly less data available because of starting with a much smaller market share. It’s the data that enables the product to be better.

Other similar examples are LVGO/TDOC and GH. GH especially has way more data than any competitors. Part of their moat is that they have over 100k blood samples from previous studies that they can use to mine for additional data as they progress into additional tests. This can vastly accelerate their current and future research. The company refers to it as a “Force Multiplier”.


This idea has been popular in Silicon Valley for some time now. However, the counterpoint is that competitors are finding ways to develop synthetic data. While it may not be as good as the real thing it is possible to erode the competitive advantage that large scale data access provides. I don’t know if you have seen it, but self-driving car competitors put their cars on a conveyor belt and project a screen in front of the car to simulate driving. It’s a good example of synthetic data creation.

With $TSLA FSD specifically, I think the custom chips and software design capabilities will prove to be the real differentiators. Not access to data. It’s similar to Apple. Hardware design is very hard. There are only a handful of companies that can do it in the entire world.


I think the key is learning from behavior. Tesla’s systems already make decisions about what to do for everything that they are able to detect happening around each vehicle. Every time a driver makes a decision that is different from what the system would have done, the system logs the event and they determine the success of what the human did. If the humans are successful, then the autonomy adapts its behavior to include what humans do.

The scary part, is from 1984’s Starman with Jeff Bridges as an alien:
( https://www.imdb.com/title/tt0088172/characters/nm0000313 )
[Starman is driving the car, and speeds across a recently turned red light, causing crashes for the other motorists]

Starman : Okay?
Jenny Hayden : Okay? Are you crazy? You almost got us killed! You said you watched me, you said you knew the rules!
Starman : I do know the rules.
Jenny Hayden : Oh, for your information pal, that was a yellow light back there!
Starman : I watched you very carefully. Red light stop, green light go, yellow light go very fast.


Sorry - I forgot that TSLA is OT - so, please no more replies in relation to TSLA. Thanks

In regard to “simulated data” - thanks for sharing. I had not previously heard of that. I am not so sure that would apply to companies like CRWD, LVGO, or GH. I think it would be hard to simulate the real world data they are all getting.

In regard to “simulated data” - thanks for sharing. I had not previously heard of that. I am not so sure that would apply to companies like CRWD, LVGO, or GH. I think it would be hard to simulate the real world data they are all getting.

I believe NVDA is big into this type of simulated data fwiw.


1 Like

These are all techniques used to develop machine learning models.

Machine learning is all about having a huge set of training data. In the beginning you feed in data based on human assumptions about what is needed but then you can continue to iterate by putting the trained model through scenarios and correcting flaws thus training it further over time.

To train a model you need a lot of training data. That data can then go through all different sorts of manipulations to adapt it to what is being trained. A simple example would be taking an image, changing it to black-and-white and contrasting it to make it easier to find the edges of shapes To be recognized.

The data can also go through multiple models where one trains another. The simulated data you guys are talking about is an option in this last point. You could train a model that knows how to generate an endless variety of video of roads and highways. You would then use this first model to train the self driving model. I’m oversimplifying, but the point is you can start with a subset of training data and use machine learning to extrapolate and generate even more training data.

There are diminishing returns depending on the activity being analyzed. At some point getting a little more data doesn’t change the success of the model enough to matter. Looking at Crowdstrike as an example, and keeping in mind that I’m assuming a lot of this since I don’t work there, they train their system to help recognize attack vectors as huge amounts of data flows by. Anything detected or any new attacks found strengthen the system. However, in the end it is the results that matter. The question becomes, is new data additive or is it churn? At some point it maybe be more about the data staying current rather than amassing more of it.