Synthetic Data for the Win

For years, I was a believer that Autonomy needed billions of miles of data for AI training. I no longer believe that to be the case.

Here’s a video from a channel on movie effects:

If you don’t know about Green Screens, you should watch it from the beginning, but essentially what Hollywood often does is shoot actors in front of a green screen, then use computers to replace green with a computer-generated background. The problem is at the edges of the subjects, especially with things like hair and motion blur, individual pixels will not be entirely green nor entirely subject, and the computer “replace color” algorithm doesn’t know what to do with somewhat green pixels - are they part of the subject (tree green, not screen green), or part-subject and part-background (which would mean partial transparency)?

What they decided to do (and the head of Avatar’s special effects company didn’t think it would work), was to use a commonly available rendering engine to render realistic scenes against a green screen, but since they know what is subject and what is background screen, they were able to train an AI accordingly.

Well, not that many 18-hour training sessions later (and remember this channel doesn’t have millions of dollars or access to thousands of Nvidia chips), they were successful. It’s open source, so hopefully more people in the industry will pick this up and run with it.

My point in posting here is that they didn’t need billions of hours of actual footage shot in front of a green screen, they didn’t even need millions of hours - and I infer that they didn’t even have tens of thousands of hours. Because they knew where the problem areas were (hair, swords, glasses of water, veils, etc.) they were able to concentrate their synthetic data creation accordingly and be efficient in training.

I think this is probably what Nvidia is doing for their AV stack, going into Mercedes, Nuro, and even Lucid in the future. They don’t need 10 Billion miles like Elon claimed, they just need the right set of, say, ten-thousand synthetically-generated miles, for which Nvidia can generate and train in probably 10 minutes given their resources.

Tesla pioneered E2E autonomy back in late 2022 (according to Walter Isaacson’s book). Mobileye’s CEO wrote a blog post saying it was garbage. Waymo adopted some E2E and so did Nvidia (with less fanfare). And now we see Nvidia’s demo during CES just a couple months ago doing really well, especially considering how late they started.

I do believe Tesla is getting there, and is probably ahead in terms of driving capabilities and ability to expand, but I don’t believe Waymo will fall further behind and I believe Nvidia might be catching up. Waymo and Nvidia have the problem that they’re not automotive manufacturers and so physically scaling will be harder for them, but Waymo has Hyundai building cars, and OEMs can adopt Nvidia’s system relatively easily (and cheaply, it’s open source) now.

Nvidia may be making things more difficult for themselves by adding LiDAR, but I suspect OEM pressure requires them to do so even though it’s not necessary and does little more than slow the creation of synthetic data down while increasing training effort and time.

But, my main point here is that the only issue with synthetic data (at least for cameras) is knowing what to simulate. And while there are literally an infinite number of situations that can crop up, I think that coming up with a complete-enough set of situations to render for training is easier/faster than just brute-force capturing 10 Billion miles.

In short, I believe Elon’s wrong, but that doesn’t mean Tesla won’t have other advantages with autonomy.


On a separate note, there was a open NHTSA meeting yesterday:
https://www.nhtsa.gov/events/av-public-meeting-2026

Zoox and Waymo presented, I think Tesla was in attendance.

They discussed changes in FMVSS for AVs. News article here:

It’s now open for public comment. The goal is to let these companies have vehicles on public roads without manual controls. Zoox already has a wavier for around 2500 vehicles (something NHTSA has done in the past for developmental vehicles), but all 3 companies want to put many more vehicles than that on the roads.

I know someone who attended the meeting and spoke with a Tesla attendee, who told him that the Cybercab should directly conform to the updated FMVSS.

2 Likes

The problem with your argument is that it ignores timelines. When Elon made that statement synthetic data was not yet available and the only way to get edge cases was to accumulate mostly useless additional data. Elon said so sometime time ago. Once simulation technology became available, Nvidia’s photorealistic image generation, that problem was solved and Tesla did start using simulation.

The Captain

No, he made that statement only 2 months ago:

AFTER Nvidia/Mercedes showed their system working in San Francisco.

For those detail-oriented, here’s Nvidia’s page on how you get started:

310k data clips overall, 163k with LiDAR and radar simulations co-ordinated, too.
One example graphic:

To be clear, I don’t think those are enough, but they can get you going and they give you the simulation system code so you can create your own clips for training, which shouldn’t be a big deal to an OEM serious about autonomy.

1 Like

I missed that. Elon is no dummy. Of course he is right if one ignores synthetic data but the statement does not mean Tesla is not using simulation.

The Captain

We know Tesla is using simulation - we’ve seen demos back in the Karpathy days and even just recently from Ashok.

The question is why is Elon insisting that 10 billion real miles of training data are needed? Because if it’s not true, Nvidia is catching up quick:

and

That last tweet is quite telling: Elon changes from being behind on the AI software/training part to being behind on the build the AI hardware into vehicles part.

His “several years” is going to be shown as false, IMO. Apparently, Elon’s just as bad with competitive company’s timelines as he his with his own, except for his bias, lol.

Isn’t one of the limitations of simulated data that one has to know about the problem to simulate it? I.e., rare events may be missing from the training data because one didn’t think to simulate it.

2 Likes

Maybe what’s needed now is not the video data but the performance data of the vehicles that are doing the observing and reacting? Concocting visual inputs of a developing situation is really not everything required to simulate, right? surfaces and weather that affect vehicle performance aren’t constants on the trip. Plus, some events are equipment failures that greatly compromise the vehicle, like tire punctures. [edited to remove redundant “failures”]

It depends on the meaning of “needed.” Without synthetic data it is really, physically needed. With synthetic data it is nominally needed, the synthetic data being the shortcut. In other words, don’t take it so literally.

Acknowledging Musk’s ability to produce cash flow I doubt he would spend cash needlessly.

Just my take. You could ask on “X.”

The Captain

This. “Edge cases” are, by definition, those things that happen infrequently, sometimes super-infrequently. If a programmer doesn’t know it ever happened - or can’t imagine it happening - then you can’t use synthetic data to include expectations of it ever happening.

And the “green screen” analogy fails, I think, because it’s two dimensions versus three, it’s static versus dynamic (on the road in real time), and it’s relatively slow versus how a car must be piloted with quick reaction times. As an all-too-obvious point of reference, it can take even the best CGI processors (banks of them) hundreds of hours to process a single frame of Avatar, including 10-50 hours just to get the parameters and hundreds more hours for the processing to complete a single frame. (No, that’s not a direct parallel, and that’s because there is no perfect parallel because this stuff is being invented on the fly.) But while “defining pixels” is important - especially for a system which has only pixels like Tesla’s - it’s pretty deep in the weeds before this translates to “actionable programming” that turns up on the roadways, I think.

Sure, and that was my view previously, but no longer. I’ve seen many examples recently of AIs doing things pretty outside of what they’ve been trained to do. I don’t, for instance, think you need to show an Autonomy AI more than a few dozen overturned vehicles for it to recognize different types of vehicles being overturned at different angles. Think of the B-52s’ dog named “Quiche Lorraine” - I bet even dyed dark green most AIs would still recognize it as a dog.

That’s a good point - I wonder what any AV does with a tire blow-out, etc.

Those are EXACTLY THE SAME!!!

Both are real-world 3D situations rendered to 2D cameras that have to be interpreted back as their original 3D counterparts - With autonomy, you’ve got to recognize the 3D scene, and with the movies, you want to recognize the subject separate from the background. In both cases the 3D world has been flattened into 2D, which is EXACTLY what makes the problem hard.

And, both are dynamic - the car is moving, and with movies, the subject is often moving. And with movies, the subject moving creates all sorts of difficulties from motion blur and what’s obscuring what in an ever-changing fashion since static movies aren’t very interesting to watch.

It appears you’re confusing scene rendering with the matting of moving subjects on backgrounds. These are completely different effects, using completely different techniques.

One eye sees in 2D. Two eyes see in 2+D. How about 8 cameras?

The Captain

Sure, but you don’t train on ONLY simulated data. But look at it another way…you know of a given edge case but after slogging through billions of miles or real data you never see any instances of it. For example I had a bike on a roof rack fall off just in the lane next to me. How many instances of that do you think they’ve collected? A simulator can produce dozens (or thousands) of similar cases and produce instances of them at all hours of the day, under all sorts of weather conditions, etc.

Mike

For autonomy, you’re right. But, that’s because the output is driving control: steering angle, accelerator pedal angle and brake pedal angle (basically). And so you can have human-supplied outputs for the reinforcement learning. When Tesla had to modify FSD to satisfy NHTSA for stop signs, they found that only 1 out of about 200 stop sign encounters they had captured were absolutely correct since almost no-one actually does a full and complete stop, and does so behind the painted line, creeping forward afterwards for visibility.

In these cases, it might be simpler to train on synthetic data where you have the car behave exactly correctly. But, you can still curate real data to extract perfect driving and train on those.

However, for the green screen case, the outputs aren’t so simple. The output is a series of image pairs: one containing the subject image with all traces of green screen removed, the other is an alpha channel matte. The programs used to generate these pairs today have some tools to help automate the process, but as the video demonstrates, they fail at some common situations.

One could, I suppose, hand tweak a green screen image sequence to be perfect and train on that and others, but that hand tweaking is really hard at edges and motion as demonstrated. It’s actually much faster and even better to use synthetic data exclusively.

Remember, since the screen is green, the true color of subject edge pixels in the resulting subject-only image will have to be modified to have the partial green subtracted out, which is almost never done properly since it requires subject interpretation. This is why you see too-hard edges on some green screen effects, which they fix by shooting at higher resolutions and then scaling down.

Let’s take a simple example: a single static image of a brown wood desk in front of a green screen. If you look closely at the edges of the desk you’ll see pixels that aren’t just the desk brown but also not just the green screen. This is easier to see with low res cameras that result in large pixels, but it’s there are all digital camera outputs. What you really want is for that pixel to be rendered as the full subject color at that pixel and then have the corresponding pixel in the matte to have some partial transparency value.

But, in the real world, that pixel is some combination of brown and green depending on where the desk edge falls within that pixel. It might be mostly desk or mostly screen or perhaps half and half. You need to remove the green tint to restore the desk color at that pixel by subtracting out the percentage of green screen contribution, but in the real world you don’t know that so what you’ve done is some approximation guess. And that’s why humans have to tweak the outputs.

But, with synthetic data, you get to feed the AI the exactly correct subject and matte images. And you can generate these way faster than you can hand tweak a real world image, and the real world image result by hand will never be 100% correct.

For autonomy, I suspect some combination of real and simulated data is optimal for speed and coverage. But, since you have to curate the real data to get only the “good driving” you’ll end up classifying situations and from those you can then generate synthetic data where the behavior is exactly what you want the AV to do, which might be easier than going through billions of miles to find the perfect stop sign, perfect curve a thigh speed, perfect drive around a double-parked car, etc. situation.

Nvidia’s Alpamayo software includes hundreds of thousands of synthetically generated clips they used for training. I suspect that’s not nearly enough clips for a true AV and that OEMs using the Nvidia system, like Mercedes, are supplementing with not just additional clips but also real-world driving capture. But, I don’t believe Elon that they need 10 Billion miles of real world data to get all the edge case clips they need.

If Mercedes keeps to its promises (unlikely in the autonomy world), we’ll know more by the end of this year.

1 Like

That is simple to answer. Tesla is the only company with 10 billion real miles of training data, so making the claim supports that only Tesla can be close.

6 Likes