AI in Tesla's FSD/Robottaxi and Sutton's "The Bitter Lesson"

A thread dedicated to Tesla’s AI implementation.

Egged on by competitors, media outlets repeatedly raise two technical attributes of Tesla’s autonomy efforts as concerns that they believe will, or most likely will, prevent Tesla from achieving success: 1) Eschewing LiDAR and 2) Employing an end-to-end AI architecture. This thread will be about the latter.

One prominent and vocal critic of Tesla’s approach has been Amnon Shashua, CEO of Mobileye. In 2023, he co-authored a blog post, saying:

In summary, we argue that an end-to-end approach is neither necessary nor sufficient for self-driving systems. There is no argument that data-driven methods including convolutional networks and transformers are crucial elements of self-driving systems, however, they must be carefully embedded within a well-engineered architecture.

That “well-engineered architecture,” they argue, is what they call “CAIS", for Compound AI System, which “deliberately puts architectural restrictions on the self-driving system for the sake of reducing the generalization error.”

Just about a year later, however, Shashua’s view evolved, and he co-authored a subsequent blog post on the subject, with the result that yet another year later (now 2025), Mobileye is now touting their own end-to-end AI use:

our compound AI system that blends end-to-end perception software with other key breakthroughs

What Mobileye has done is kept their modular architecture (not to be confused with an “AI model”) but rewritten some of the modules to employ end-to-end AI. So, it would be wrong to characterize their system as E2E AI, but specific tasks within that system are encapsulated into modules that use E2E AI. This still allows them to use their “glue code” and insert restrictions, etc. into the pipeline.

Mobileye points out that both they and Waymo use CAIS, while Musk claims Tesla is using a singular E2E AI implementation. We do know Tesla used to separate perception from driving policy, using AI for perception but using human-programmed algorithmic logic for driving policy. They claim to have thrown away 300k lines of human-programmed code in favor of a monolithic E2E AI architecture.

In both blog entries I’ve linked above, Shashua attempts to level a number of criticisms against using E2E AI. For instance:

For controllability, end-to-end approaches are an engineering nightmare. Evidence shows that the performance of GPT-4 over time deteriorates as a result of attempts to keep improving the system. This can be attributed to phenomena like catastrophic forgetfulness and other artifacts of RLHF. Moreover, there is no way to guarantee “no lapse of judgement” for a fully neuronal system.

and

while it may be possible that with massive amounts of data and compute an end-to-end approach will converge to a sufficiently high MTBF, the current evidence does not look promising. Even the most advanced LLMs make embarrassing mistakes quite often. Will we trust them for making safety critical decisions?

Shashua’s argument here is that E2E AI is useful only when mistakes are tolerable. I would like to ask him where that is in terms of self-driving. Apparently, given Mobileye’s use of E2E for their perception module, they can tolerate mistakes in identifying objects in the car’s environment.

That seems unlikely - more likely is that Shashua’s/Mobileye’s views on E2E have evolved in the past few years.

In Shashua’s second (2024) blog, he points out that ChatGPT itself employs a CAIS architecture:

When asked to compute “what is 3456 * 3678?,” the system first translates the question into a short Python script to perform the calculation, and then formats the output of the script into a coherent natural language text. This demonstrates that ChatGPT does not rely on a single, unified process. Instead, it integrates multiple subsystems—including a robust deep learning model (GPT LLM) and separately coded modules. Each subsystem has its defined role, interfaces, and development strategies, all engineered by humans. Additionally, ‘glue code’ is employed to facilitate communication between these subsystems. This architecture is referred to as “Compound AI Systems1” (CAIS).

This all leads me back to Sutton’s Bitter Lesson. If the architecture of modules in a CAIS is done to conform to human understandings and knowledge, especially when some/most of those modules are codified to encapsulate human knowledge, that The Bitter Lesson argues that these are at best interim architectures and that the final/best architecture will be one that fully utilizes reinforcement learning on large amounts of data. Sutton gives earlier examples; here’s one:

In computer vision…early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

Daniel Jeffries expands on Sutton’s

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

by writing:

Humans still have to write the algorithms. They’re not irrelevant. They’re just writing the wrong algorithms much of the time, instead of focusing on the ones that will work best, general purpose learning algorithms and search algos that get better when you hurl more compute at them.

expands on Sutton’s Chess AI history/example:

AlphaGo Zero was the successor to AlphaGo, the platform that beat Lee Sedol. AlphaGo included a great deal of human knowledge baked into it too. It learned from human games to build its policy network. It had a ton of glue code to write around problems. Of course, it also heavily leveraged search and compute with Monte Carlo and deep learning training on lots of data.

But AlphaGo Zero used no domain knowledge, just pure generalizable learning.

Zero knew nothing about chess except the basic rules. It had no domain knowledge baked in at all. It wasn’t told to control the center of the board or penalize pawns like Stockfish. It leaned by playing itself again and again via RL. It got punished for losing and rewarded for winning.

It did the same for Go. No domain knowledge, just the basic rules. Play. Evolve. Reward. Punishment.

AlphaGo Zero beat AlphaGo 100-0.

One would think that if the E2E AlphaGo Zero had made a mistake that AlphaGo would have taken advantage and at least gotten a draw, if not a win. But, that didn’t happen.

Now, of course, the obvious “well but” is that the realm of chess is much smaller than the task of driving in the real world. The real world has so many more objects than chess pieces, and they’re all moving around simultaneously versus one at a time in chess. And yes, autonomy is a much more complicated problem to solve.

But, The Bitter Lesson tells us that reverting to human knowledge encoding won’t solve it any better than it has in chess, go, speech recognition, etc. Methods that are general and use scalable approaches like reinforcement learning are the ones that look most likely to carry the day. In the end, the CAIS systems are but early approaches that will eventually be discarded.

4 Likes

Since the last post was so long, I saved this tidbit for a follow-up. This is Cruise founder Kyle Vogt, talking about past mistakes and the future, when asked about self-driving:

What I see is really Tesla, as a company who pioneered the end-to-end neural network approach to self-driving, which I think is the right technical bet long-term, but they put some constraints on it.

They said, ‘Hey, engineers, you can’t have the best sensors,’ like LIDARs and radars, ‘and the sensors have to look good when we put them on the car. Oh, and by the way, they have to cost one-tenth as much as the guys down the street who are doing this.’

So the right technical vector, but like really being held back by the weight of all these constraints that were put on the system.

But all of their technical approach from day one seems to have been pointed in the right long-term direction. So that’s good.

He does also say:

The other big, perhaps false dichotomy that people create is LIDAR versus cameras.

And with regards to Waymo (edited down):

With Waymo… they built this highly-validated, robust system that’s now on public roads, and it’s great, but they know that it’s the wrong technical approach, and they need to move more in the direction of Tesla, of more neural networks…

Because it is just intractable to maintain a 3D map of every square inch of the planet and update it in real time, and then expect that every time you go somewhere the map is still accurate, on one hand. And it’s also probably unrealistic to assume that every car built in the future is going to have these giant spinning KFC buckets on the roof.

But please, let’s not turn this thread into lidar vs cameras, let’s keep it on software and especially AI architectures.

1 Like

While programming I often thought about how the brain works. My very first program was on an IBM 650. I could not make the code small enough to fit the limited memory. So I told my boss several times and he always replied, “It fits.” One night I woke up at 4 AM with the solution. I could not wait for IBM to open so I could test the idea. It worked and the program fit. My brain figured it out while I was sleeping. This was not the typical boolean way of thinking, this was subconscious working. The solution the subconscious came up with was, “Hey buddy, you don’t need that complicated algorithm. Use the table lookup function to determine if the data is valid or not, that’s all you need to know.”

Part of the brain is a pattern matching machine that has stored just about everything one has experienced since the day one is born. It then matches the new input against that wast storage. The output is not exact, it’s a probability. Richard Feynman said, “You start with a guess. Be it a theory or a hypothesis, all it is is an educated guess.” What the Scientific Method does is to test the guess, true or false, and that’s boolean, not pattern matching.

Why is the current version of AI not better. Because size matters. The human brain has billions of neurons and trillions of synapses. The brains in insects, lizards, and other species are smaller than the human brain and they solve simpler problems. On a biological
scale, where are present day AI data centers? Cells, insects, lizards, birds, mammals?

Maybe Amnon Shashua, CEO of Mobileye, is overthinking the issue?

The Captain

1 Like

Thanks for the interesting and extended post, Smorg. I haven’t yet had a chance to review the linked docs - just your summaries of them. But just from reading what you wrote, I think there’s at least one really easy answer that’s a likely possibility.

Mobileye probably believes that there’s adequate opportunity for catching errors - or at least dealing with them - in their “glue code,” as you call it. One of the key differences between modular E2E and the “pure” E2E that Tesla has now switched to is that “glue code” - the hundreds of thousands of lines of code that aren’t E2E that Tesla decided to jettison.

If the E2E systems have an error rate of X (whatever X is and whatever units), you can try to reduce the error rate by pushing more and more data into the E2E training to try to get it to fall. But if that doesn’t work, you can use the stuff that’s more directly being programmed into the system to add some of that “human” architecture back in. That way if you run into a situation where (for example) your perception module has borked and is returning a “no output” response, you can tell your separate driving behavior module to do a particular thing in response. If it’s E2E, you don’t even have those separate modules, so the thing just disengages instead.

Maybe that’s inconsistent with what’s in the actual blog posts, but I suspect that’s probably where Waymo (and Mobileye) have put in the ‘safety protocols’ that give the AI a way of driving to a safe stop when things have gone wrong, rather than just disengaging back to a driver. Certainly using a modular system with “glue” holding the modules together gives you more opportunities to do that than would a single “black box” E2E that goes straight from photons to driving controls with no opportunities for human code to tweak the results.

I’m not sure no output is really a possibility, but even assuming that it is possible for some object in the scene, I would think that was the least common and least problematic of the potential problems in perception. I would think the most common and most problematic is misidentification … identifying a tree as a person or vice versa.

First, “glue code” is a common description; pulled directly from Shashua’s blog, where he uses the phrase 7 times. Here’s one:

In an engineered approach … the system starts with a built-in bias due to the abstraction of the sensing state and driving policy while (further) reducing variance through data fed into separate subsystems with a high-level fusion glue-code.

And this describes what glue code is - it’s the code that “glues” separate modules, or subsystems, together. It’s the piping in the pipeline, not the actual modules that do the heavy lifting. In Tesla’s case, it wasn’t 300k of glue code that was removed, it was 300k of mostly separate modules with some glue code that was removed.

You should really read the links I provided. In that same blog post, Shashua characterizes errors as either “approximation” or “generalization,” but then talks about “bias” and “variance.” He talks about overall system architecture, claiming that modularization increases bias while decreasing variance, but E2E as the opposite. You can read and comment on that if you want - it wasn’t compelling to me.

Mobileye has a more recent (just last month) blog entry on CAIS that might help people understand their approach better:

Mobileye…breaks autonomy into clearly defined components such as sensing, planning, and acting, each corresponding with a dedicated AI model (or models).

That’s not how it works. The perception module is always, many times a second, returning objects - identifying them, providing distance and movement, etc. It really can’t return “no output” - it always has some output, unless you’re in the middle of a desert with no pavement, lanes, trees, humans, or desert rats, I guess.

I know you complained about me characterizing your description of a system as containing competing modules - and while that may not be your thinking of how the architecture works, it is how Mobileye and others approach the perception problem:

Mobileye integrates multiple sensing modalities (camera, radar, lidar), REM crowd-sourced driving intelligence inputs, diverse AI methods, and overlapping algorithmic layers. These independent paths reinforce one another and provide resilience, not only in standard driving conditions, but also in edge cases and complex scenarios.

These are independent perception modules that compete to build the world model used as input into (eventually) the driving policy module.

In Mobileye True Redundancy | Realistic Path for AVs at Scale Mobileye describes it further:

At Mobileye, we task both … camera and radar-lidar with sensing all elements of the environment and each building a full model.

Here’s their picture:

The two modules - camera and radar/lidar each build a world model, and then some other module combines them. So, in essence, they’re competing.

Now, rather than “no output” what is more likely happening is that the perception modules return a confidence weighting (might be a percentage) for each object, describing how confident that module is that there is a thing there and what it is and what it is doing (might be multiple confidence levels, one for each aspect).

If the modules disagree whether a thing at a particular location is a tree or person (to use @Tamhas ‘s example), the code might look at the confidence values returned by each module and decide based on which was higher. More likely, there’s a complex decision making tree that knows what each module is best and worst at and weighs the confidence levels accordingly. At any rate, it is a fair characterization on my part to say these modules are competing.

And while it’s tempting to say, as Mobileye does, that having a CAIS architecture that provides for connecting modules in ways that provide opportunities to massage AI outputs, that is really just another application of human knowledge to the problem. Which ties back to The Bitter Lesson in that such applications fail in the long run. Which is why AlphaGo with its knowledge modules coded in failed to beat E2E AI AlphaGo Zero.

I found Vogt’s observation that Waymo has a good system and so will be hesitant to replace it with a new architecture accurate. Waymo’s co-CEO was interviewed not that long ago, and when asked about LiDAR, she admitted that LiDAR is what got them there and they’re “not yet” ready to remove it from their system. This actually strikes me as an Innovator’s Dilemma type problem. Waymo would have to build a new system and train it and test it, and iterate for probably years, and they can’t go backwards in reliability on what they produce for customers, so they’d have to do this as a side project running in parallel to their main/current line. That means splitting resources between improving the existing system and building the new system. As Christensen describes, that can be a hard thing to pull off in practice.

3 Likes

My instinct is to think that it could be unfortunate to build a camera world model separate from a LIDAR world model rather than using the raw information from each to build the model.

1 Like

Yeah, but interestingly that’s exactly the opposite of what Mobileye claims is best:

saying:

“With sensor fusion that is done before creation of the environmental model, each software update to the AV would require hundreds of millions of hours of data for validation.”

I see nothing to prove this and conceptually don’t believe it, at least for the edge cases we’re all concerned about where one set of sensors is superior to another, but hard to know at which times which is actually superior.

Shashua also makes a demonstrably false claim:

in the case of a failure of one of the independent systems, the vehicle can continue operating safely in contrast to a vehicle with a low level fused system that needs to cease driving immediately.

In Mobileye’s world, if the camera system fails, the vehicle cannot operate safely, as the cameras are the only way to see traffic lights, lane markers, and wording on signs.

In actuality, redundancy is built into such systems at all levels. There are multiple cameras with overlapping fields of vision, so failure of any one camera could be compensated for by the two cameras on either side - at least at a good enough level to perform a fallback maneuver. Tesla has two processors as well. And I’m sure there are other safeguards throughout, even if I don’t have the technical details at my disposal. And I’m sure Mobileye does this as well, as does Waymo.

Why Mobileye tries to claim:

Sensor redundancy is meant to ensure that sensors serve as back-ups for one another. But we often see complementary, not redundant, sensors – where cameras and radar or lidar each sense certain elements of the environment, which are then combined to build a single world model.

when it makes little difference to redundancy at what level the information is combined. Since cameras see lights and paint and radar & lidar cannot, the world models built by these two sets of sensors will usually be different - hence "complementary, not redundant” in both architectures.

I wonder if in Tesla’s system the individual cameras feed into the fused model or whether there is a subsystem between the cameras and the fused model as there clearly is with Mobileye. With a subsystem, there would seem to be another possible point of failure.

Given sufficient data and the compute resources to process that data with a general model, then the general model should be able to converge to a very good prediction (arbitrarily good).

However, it requires sufficient data, which for AI driving could be really large to sufficiently cover all of the driving scenarios that are needed to meet “better than human” safety.

Second however, it then also requires sufficient compute to handle the large, general model and all of the data. The scaling laws show diminishing returns in error improvement for each order of magnitude increase in compute, so reducing the last 9s of driving error could be a real slog.

Third however, a more specialized model with sufficient understanding of the driving process (maybe along the lines of how we can understand the physics of a missile or a tropical storm and employ that knowledge in models to predict the behavior of those systems), could have prediction as good or better than the more general model with less data and less compute.

So, how much data and compute does the general model need for AI driving?

Tesla is learning this right now, or at least we expect that they are.

So second, do we have sufficient knowledge of models of the driving process to reap predictive power in AI driving models?

Waymo and Mobileye believe yes (as best I understand their approaches).

Reminds me of an old Spanish joke:

S1: The Americans spent eight billion dollars to send a man to the moon!

S2: How much is that in pesetas?

S1: All of them! All of them!

Cars kill people yet we drive cars. Airplanes kill people yet we fly. Trains… Well you get the point

Once AI based driving is safer than human driving it’s time to use it and perfect it by using it.

The Captain

1 Like

Just to be clear (“this…, but that…” phrasing suggested to me a difference in concepts), these are different pairs of labels for the same error concepts.

Shashua introduces bias error and variance error first, and then explains them using the ideas of approximation error and generalization error, respectively.

I found it compelling because it applies a foundational concept in machine learning to AI driving.

Shashua is focused on the bias-variance tradeoff, per the title of the post, because the bias-variance tradeoff is one of the foundational theoretical results in machine learning - and model selection in particular.

How well models like Tesla’s and Waymo’s AI driving models perform will depend directly on their bias and variance error.

Total model prediction error = model bias error + model variance error

Model bias means, that under hypothetical repeated estimations (trainings) of the same underlying process, the mean model prediction will differ from reality by some fixed amount. The “bias” measures the deviation between the mean model prediction and reality.

The model variance is, under hypothetical repeated estimations (trainings) of the same underlying process, the mean squared deviation between the model prediction and reality. The variance measures random uncertainty in model predictions: noise/error around the mean model prediction.

If you believe that E2E (end-to-end neural network) is the more general model, then, based on the theory of the bias-variance tradeoff, this claim has a sound theoretical justification.

Why would the general model have lower bias, higher variance and the more specialized model higher bias, lower variance?

The general model is more flexible, it can fit a more diverse set of underlying processes, so it has lower bias.

But flexibility comes at a cost, the flexibility to fit a variety of scenarios also means the model has more inherent variability, so it has higher variance.

The more specialized model is the opposite. Less flexible to fit different processes, so higher bias, but also lower variance because it is a more constrained model with less inherent flexibility.

If you know the underlying system process very well, then the specialized model is the way to go: the embedded human knowledge is doing a lot of the work, you don’t need as much data to “discover” how it works because you already know how the system works. You won’t need as much compute because the specialized model will be smaller than a general black box E2E model.

If you do not know the underlying system well, it’s a black box, then the general model is the way to go. But you’ll need a lot more data to tell the model how the system works and then compute to go with the big model and the big data.

Regarding Tesla and Waymo, we could say that Tesla has the more general E2E model while Waymo has a more structured CAIS (compound AI system) model.

Then the question is, where does each approach live on the bias-variance tradeoff curve?

Which model has lower total prediction error currently and in the future?

For Tesla, how big of a model and thus how much data and compute will they need?

For Waymo, they still will need plenty of data, but if they are including some knowledge of how the system works (eg, objects like stop signs have special meaning, let’s make sure we always perceive them), or they think specialization is helpful (eg, a submodel that specializes on perception), is that knowledge really helpful in improving prediction?

Currently Waymo is much further along in unsupervised miles, with publicly reported safety outcomes on a 50+ million mile data set and accumulating 4 million unsupervised miles per month and accelerating.

Tesla reported only 7k miles in one month in Austin for supervised driving. They’ll need 10s of millions of miles of unsupervised driving (or some equivalent data set) to estimate accident rate if it is near human levels.

7k miles is such a small amount of data, especially for a general model, it’s not clear to me at all what value this data has. It’s small for training and it’s small for safety validation.

Tesla also announced a new model with 10x more parameters, suggesting they still need a bigger model (and with each bigger model, more data and more compute).

“Autonomous decisions: The bias-variance tradeoff in self-driving technology” (Autonomous decisions: The bias-variance tradeoff in self-driving technology | Mobileye Blog)

For Tesla, I don’t believe having enough data is a problem, but rather having enough good data. The data needs to be evaluated / characterized / curated to get value from it. For instance, not coming to a complete stop at stop signs is something that a lot of people do, and we don’t want the system to learn that behavior. (Side note: Tesla had in an earlier FSD version, not using AI for driving policy, programmed to stop for a millisecond and then proceed. Cops reported that as rolling the stop sign because humans can’t perceive a millisecond of non-movement. Tesla had to artifically extend the stopped duration, lol). Another example, and one of my pet peeves, is drivers taking curves at speed who over-steer. Whether they do this instinctively as “cutting the corner” to save time, or as pre-compensation for a possible centrifugal-force slide outwards, I don’t know, but we’ve all seen cars go over the inside lines of lanes or even over the center line at speed on curves. This is not behavior to be learned.

Most autonomy companies today, Tesla and Waymo included, have indicated that they also generate synthetic data on which to train their systems. Presumably, this synthetic data has vehicles obeying exactly as they wish the trained system to operate.

This is always the hope of such expert systems, but this is exactly the kind of thing that Sutton argues against.

I agree we don’t know how much compute is needed. Luckily, this appears to be mostly on the training side, so it’s done once on the servers (modulo subsequent tweaks once it’s good enough). That said, it is expected that large models with more parameters will require additional inference requirements in the vehicles, although Tesla has shown some ability to reduce that effect, and has been upgraded its on-board compute regularly.

It’ll be interesting to see the improvement with this system, if any.

E2E self-driving systems can still be programmed with the rules of the road. For instance, Daniel Jeffries describes AlphaGo Zero as being programmed with “the basic rules.” The kind of programming differences we’re talking about are, for instance in chess, whether to delay activating your queen, or trying to control the center of the board, or getting rooks behind passed pawns, etc. In autonomy, this might be how soon to get into the right lane before a right turn needs to be taken. But, rules of the road, such as obeying speed limit signs and stopping for stop signs are probably programmed in. Recognizing a stop sign in the field, however, is done through general AI and not through edge-recognition or 8-sided polygon type programming techniques.

1 Like

None, of course, but it is rather like noting that a runner has passed the 10’ mark at the start of a mile race.

2 Likes

AI-zero was expert systems and it failed miserably.

The Captain

Is he actually arguing against them? Or just pointing out that they won’t be as powerful as systems that don’t have them?

I read your blog posts, and I don’t understand why you drew from them the conclusion that Mobileye has made any change in its position. The most recent blog post doesn’t say they’re moving into an E2E system like Tesla has. It just says they use E2E for the visioning portion.

AIUI (and I might be misreading it), the main difference between the modular systems and Tesla’s “pure” E2E system is that in the former, the humans do force a little human way of looking at the world into the process through the modular architecture. Take Mobileye. They use a vision/perception module that forms a model of the world - an E2E process that identifies things like “other cars” and “the road” and “pedestrians” and everything else the car might need. And that then gets put into the driving/behavior module. You increase bias (because you are forcing the module to look at the world the way humans do), but you decrease variance (you take out sensitivity to the input data).

Telsa’s model is more “pure” E2E. Photons in, control directions out. The AI has absolute freedom to ‘learn’ what to do in response to any particular collection of photons without having to generate any specific type of internal model, which lets it “fit” more broadly to any set of inputs (lower bias). But since that increases the variance error.

You can see, though, why the modular system has some advantages for safety protocols. With a modular system that is forced to solve the problem by building a representation of the physical world, you can give it a safety subroutine. You can tell the AI that if it ever reaches a state where it has no idea what it’s supposed to do next, it should try to pull the car over to the side of the road and come to a stop without hitting any objects (the actual new prompt to the AI driver would probably be more elaborate, but that’s the point).

If you try to tell an E2E driver to pull the car over to the side of the road without hitting objects, the AI would ask you “what’s a road?” and “what’s an ‘object’?” I mean, not literally - it can’t ask questions like that. But because we didn’t force it into the human way of solving the driving problem (form a mental conception of space and objects and that the car is an object passing through that space among objects, etc.), there’s no way to interject a safety prompt that relies on the car having thought about the driving problem that way. You can’t give the car an alternate prompt like that. If the car doesn’t know what to do next, it can hand over control to a human and stop driving itself. But since it doesn’t “think” about the world the way humans do, it doesn’t have built into the model the same sort of stuff that the Mobileye architecture would.

2 Likes

Yes, he literally said E2E was not necessary:

In summary, we argue that an end-to-end approach is neither necessary nor sufficient for self-driving systems.

In later posts, he tamped that down, and eventually adopted it for their own use. That’s a change in position.

I’ll also add that claiming a module is E2E is just wrong, too. Imagine saying “we use an end-to-end encryption module” to protect your data. Well, the whole point of E2E is that it is End to End. Putting encryption (or AI) in the middle, not from the beginning (one end) to the termination (the other end) is NOT E2E. By definition.

In essence, Shashua is co-opting buzzwords to make his company sound good. It’s actually laughable to us in the technical world.

BTW, you’re wrong on not being able to tell and E2E autonomy system to pull over. Just as the E2D AlphaGo Zero program was told what the chess pieces were and what the rules were, one can tell an E2E autonomous system what objects in the world are and what to do related to them. And Tesla’s system shows you on screen objects it’s identified, too.

I wasn’t asking about the Mobileye CEO and E2E systems. I was asking about Sutton and expert systems. He pointed out that unfettered E2E systems would have more power than expert systems….but was he actually arguing against having expert systems altogether? There would certainly be applications where the more powerful system is always better, but that doesn’t mean that every application calls for the most powerful system vs, say, other attributes of a system that an expert system can have that a full E2E system does not.

I don’t think that’s right. You can’t just “tell” one of these systems what objects are and what the rules of the road are the way you can define chess pieces in AlphaGo. It’s way too complex. They “learn” what things are objects and what they look like by using vast amounts of data - the training system. You can force them to do that by separating the process into modules. You have a visioning/perception module that learns how to identify objects, and then you have a driving module that takes the identified objects and learns how to generate driving instructions with that as an input.

Tesla’s driving AI no longer is required to do this, and my understanding is that it doesn’t do this any more. It’s no longer required to do the intermediate step of forming an internal model of the world as a space with objects. The stuff on the display is just cosmetic - it’s a holdover from the pre-V.11 system that used to run that way, and they never bothered to take it out. Probably because it looks cool. But the current AI process is pure E2E - photons in, driving controls out, and that process does not generate any object identification like the old system did.

You get a more powerful system, because now the AI is free to identify connections and pathways and heuristics to knowing what controls to issue that have nothing to do with object identification. But you get a more opaque and overfitted system, and one that generally either generates outputs or shuts off and hands over control back to the human. I’m not sure that the “pure” E2E allows a bailout safety subroutine the way that a modular system does, because the AI driver isn’t doing the same things that a modular system does. That may help explain why Tesla hasn’t implemented anything like what Waymo or Mobileye do, where there are circumstances where the car will initiate a safety “pull over” response in lieu of handing control back to the driver. At least, I’ve never seen anyone report that the car does this, except when the car is still completely driving but chooses to end the trip and pull over because the driver hasn’t demonstrated atttentiveness.

1 Like

It seems to me that one needs to handle at least two types of problem … one is where the perception module can’t resolve what is out there and the other is where the action module can’t resolve what to do. I suppose for a vehicle with cameras and LIDAR, there is also not being able to resolve the differences between the modules. Pulling to the side of the road may not be the universal best action.

1 Like

Sorry that I quoted the wrong line from your earlier post and conflated my answer. You DID say:

To which my reply was intended, but during my editing got mixed up. Anyway:

  1. Mobileye did say E2E wasn’t necessary and then subsequently adopted and now touts E2E for its own use. That’s a big change.

  2. As for Sutton, he most certainly is arguing against expert systems:

The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.

If you think he’s leaving the door open for expert systems, please quote and explain.

If that were true, then there would be mis-matches between what the “pre-V.11 system” is displaying on the screen and how the car is reacting. Plus Tesla is improving its visualization output:

Which doesn’t seem like something they’d do on an old perception engine.

As for Tesla’s E2E code being monolithic, you have a good observations on vehicle stop and ask for help, but I don’t see how that couldn’t eventually be programmed into the system.

We don’t have many details, but apparently monolithic code structure isn’t required for end-to-end:

Scroll down to “Transitioning to FSD v12 and the End-To-End Architecture”

Now consider an end-to-end setup with the same Blocks A and B.

  1. You have a single objective function that considers both recognizing the objects in the image (Block A’s task) and predicting the trajectory (Block B’s task).

  2. You train both Block A and Block B together to minimize this joint loss.

Key point: Information (and gradients during backpropagation) flows from the final output all the way back to the initial input. Block A’s learning is directly influenced by how well Block B performs its task, and vice versa. They are jointly optimized for a single, unified objective.

I don’t know how authoritative this is, but it does allow for Blocks and so the output of Block A could be directed to a visualization engine as well. At the very least, it doesn’t seem that an E2E system can’t have two sets of outputs (one for visualization and one for vehicle control).