A thread dedicated to Tesla’s AI implementation.
Egged on by competitors, media outlets repeatedly raise two technical attributes of Tesla’s autonomy efforts as concerns that they believe will, or most likely will, prevent Tesla from achieving success: 1) Eschewing LiDAR and 2) Employing an end-to-end AI architecture. This thread will be about the latter.
One prominent and vocal critic of Tesla’s approach has been Amnon Shashua, CEO of Mobileye. In 2023, he co-authored a blog post, saying:
In summary, we argue that an end-to-end approach is neither necessary nor sufficient for self-driving systems. There is no argument that data-driven methods including convolutional networks and transformers are crucial elements of self-driving systems, however, they must be carefully embedded within a well-engineered architecture.
That “well-engineered architecture,” they argue, is what they call “CAIS", for Compound AI System, which “deliberately puts architectural restrictions on the self-driving system for the sake of reducing the generalization error.”
Just about a year later, however, Shashua’s view evolved, and he co-authored a subsequent blog post on the subject, with the result that yet another year later (now 2025), Mobileye is now touting their own end-to-end AI use:
our compound AI system that blends end-to-end perception software with other key breakthroughs
What Mobileye has done is kept their modular architecture (not to be confused with an “AI model”) but rewritten some of the modules to employ end-to-end AI. So, it would be wrong to characterize their system as E2E AI, but specific tasks within that system are encapsulated into modules that use E2E AI. This still allows them to use their “glue code” and insert restrictions, etc. into the pipeline.
Mobileye points out that both they and Waymo use CAIS, while Musk claims Tesla is using a singular E2E AI implementation. We do know Tesla used to separate perception from driving policy, using AI for perception but using human-programmed algorithmic logic for driving policy. They claim to have thrown away 300k lines of human-programmed code in favor of a monolithic E2E AI architecture.
In both blog entries I’ve linked above, Shashua attempts to level a number of criticisms against using E2E AI. For instance:
For controllability, end-to-end approaches are an engineering nightmare. Evidence shows that the performance of GPT-4 over time deteriorates as a result of attempts to keep improving the system. This can be attributed to phenomena like catastrophic forgetfulness and other artifacts of RLHF. Moreover, there is no way to guarantee “no lapse of judgement” for a fully neuronal system.
and
while it may be possible that with massive amounts of data and compute an end-to-end approach will converge to a sufficiently high MTBF, the current evidence does not look promising. Even the most advanced LLMs make embarrassing mistakes quite often. Will we trust them for making safety critical decisions?
Shashua’s argument here is that E2E AI is useful only when mistakes are tolerable. I would like to ask him where that is in terms of self-driving. Apparently, given Mobileye’s use of E2E for their perception module, they can tolerate mistakes in identifying objects in the car’s environment.
That seems unlikely - more likely is that Shashua’s/Mobileye’s views on E2E have evolved in the past few years.
In Shashua’s second (2024) blog, he points out that ChatGPT itself employs a CAIS architecture:
When asked to compute “what is 3456 * 3678?,” the system first translates the question into a short Python script to perform the calculation, and then formats the output of the script into a coherent natural language text. This demonstrates that ChatGPT does not rely on a single, unified process. Instead, it integrates multiple subsystems—including a robust deep learning model (GPT LLM) and separately coded modules. Each subsystem has its defined role, interfaces, and development strategies, all engineered by humans. Additionally, ‘glue code’ is employed to facilitate communication between these subsystems. This architecture is referred to as “Compound AI Systems1” (CAIS).
This all leads me back to Sutton’s Bitter Lesson. If the architecture of modules in a CAIS is done to conform to human understandings and knowledge, especially when some/most of those modules are codified to encapsulate human knowledge, that The Bitter Lesson argues that these are at best interim architectures and that the final/best architecture will be one that fully utilizes reinforcement learning on large amounts of data. Sutton gives earlier examples; here’s one:
In computer vision…early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.
Daniel Jeffries expands on Sutton’s
One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
by writing:
Humans still have to write the algorithms. They’re not irrelevant. They’re just writing the wrong algorithms much of the time, instead of focusing on the ones that will work best, general purpose learning algorithms and search algos that get better when you hurl more compute at them.
expands on Sutton’s Chess AI history/example:
AlphaGo Zero was the successor to AlphaGo, the platform that beat Lee Sedol. AlphaGo included a great deal of human knowledge baked into it too. It learned from human games to build its policy network. It had a ton of glue code to write around problems. Of course, it also heavily leveraged search and compute with Monte Carlo and deep learning training on lots of data.
But AlphaGo Zero used no domain knowledge, just pure generalizable learning.
Zero knew nothing about chess except the basic rules. It had no domain knowledge baked in at all. It wasn’t told to control the center of the board or penalize pawns like Stockfish. It leaned by playing itself again and again via RL. It got punished for losing and rewarded for winning.
It did the same for Go. No domain knowledge, just the basic rules. Play. Evolve. Reward. Punishment.
AlphaGo Zero beat AlphaGo 100-0.
One would think that if the E2E AlphaGo Zero had made a mistake that AlphaGo would have taken advantage and at least gotten a draw, if not a win. But, that didn’t happen.
Now, of course, the obvious “well but” is that the realm of chess is much smaller than the task of driving in the real world. The real world has so many more objects than chess pieces, and they’re all moving around simultaneously versus one at a time in chess. And yes, autonomy is a much more complicated problem to solve.
But, The Bitter Lesson tells us that reverting to human knowledge encoding won’t solve it any better than it has in chess, go, speech recognition, etc. Methods that are general and use scalable approaches like reinforcement learning are the ones that look most likely to carry the day. In the end, the CAIS systems are but early approaches that will eventually be discarded.

