You should stay away from the analogies … they are leading you to false conclusions.
The base problem here is that you have the idea that L2 and L4 systems are totally different things. Whereas we are trying to get you to understand that the brains that make them work are the same type of thing distinguished mostly by the action taken when the brain gets confused … a difference based not on some fundamental design difference but by the frequency of confusion.
Why not? Most of the sensors are seeing perfectly normal things. Who cares what is dancing on the bonnet unless one is a driver whose vision is obscured … and for which some training has probably occurred.
I’m not saying they’re totally different things, but I’m trying to get you to realize that these are differences that are not just a function of how frequently the “brain” gets confused. It’s the result of real design differences in the software stack. The software is programmed to react differently to confusion scenarios (I know, “is programmed” is not the right tense structure for an AI end-to-end system). Even if they have exactly the same rate of confusion scenarios, the software in an L4 system does not have the option of dumping to a human driver in real time. It’s not part of the software’s decision-handling algorithm. Other systems do have that option, and it’s part of their design to use it as their response to confusion scenarios.
Obviously, avoiding injury and accidents is the main point. However things are a little confusing because with Tesla we don’t have a clear breakdown of:
The software got confused and said “help human”.
The driver got impatient and took over.
The driver changed his/her mind.
The driver perceived a valid safety issue and intervened.
The driver thought he/she perceived a safety issue and intervened, but the car would have handled it just fine.
If we had a believable breakdown, it would tell us a lot. Some of those don’t apply to L4.
Because that’s not always how these systems work. Most of them don’t “think” like a human does. They don’t “know” that there’s something dancing on the bonnet, they don’t have a “mental model” of the physical world. They don’t know what a bonnet is, they don’t know what it means for something to be “on” the bonnet. These are token predictors - they take the data from the inputs and predict based on what the billions of miles of training data what the output would be. If the input data is far enough away from any of their training data, they don’t know what to do.
That’s why the “Chinese Room” analogy of this is so useful. If you’re in that room and all you see are QR codes, and you can generate outputs based on similarities to past QR codes, you can’t process a QR code that has no similarities to prior QR codes. You don’t “know” what any of the QR codes actually correspond to, so you can’t intuit what to do.
My understanding - but here we get to the edges of lay understanding - is that the reason Waymo is able to operate at L4 is because their system isn’t entirely set up using these processes. Waymo’s software stack is a modular system. Part of the process is based on these kinds of neural network “input leads to output” training systems. But the rest of it is developed using human-written code - lots of different modules and subroutines and whatnot that were shaped by human hands (metaphorically!).
Tesla’s approach used to be that way, but about two years ago they scrapped that approach for a pure end-to-end system. No more humans writing the code for the software (more or less), all of it being developed as an end-to-end system constrained by training compute resources, not by software engineers. Which has helped Tesla reduce the incidence of confusion/disengagement moments relative to where they were before they abandoned the old structure….but it leaves them with an entirely different software structure than Waymo for dealing with what happens if the token prediction process stops.
I think you’re overlooking the degree to which these operational attributes arise from fundamental differences int he software architecture. It’s not “Waymo fails less often but Tesla’s getting better at that.”
I agree 100%. Assessment of Tesla’s capabilities is hampered by the fact that they do not provide public information about the rates or circumstances of disengagement. They don’t just decline to provide that information - they appear to be taking steps to avoid any regulatory processes that require it to be disclosed.
The critical scenario is the first one. “Software got confused and said, ‘help human.’” If the software incorporates “help human” into the real-time decision-making algorithm, then you have potentially horrible consequences if there isn’t a human in the car to help.
For this discussion, it is. It is a human in the car that is performing a safety-critical driving function. It is a “driver” for autonomy purposes - and for all practical purposes as well. If you have to have a human physically present in the car for it to operate safely, it does not matter wither that human is doing 0.1% of the physical controlling of the car or all of it. If you move the safety driver to the passenger seat, they’re still a safety driver.
We don’t have it, but we can make a pretty strong inference about it. If Tesla had a rate of disengagement that was small enough for them to be “close” to being able to pull the safety drivers from the robotaxis (for any practical definition of close), they wouldn’t be acting they way they are. They wouldn’t still be declining to move ahead with their CA driverless testing. They wouldn’t be hiding that information, either - they’re very open with the “miles between accidents” data that provides a positive spin on their progress, but they never disclose a comparable “miles between disengagements” figure. And now that they’ve entered the robotaxi business, there’s no plausible explanation for them doing that - unless the figures are bad.
We don’t know the figures, so we don’t know they’re bad - but that’s the most likely explanation.
Who said anything specific to limited access highways? Many cities will have a highway to the airport.
I don’t know if this is still true, but at one point a Waymo could only go so fast or it would overrun the LIDAR. I suspect they might be past this, but some residual might be there.
And I am pointing out there is no sensor for what is dancing on the bonnet. Virtually all sensors will be seeing normal things and the penguins won’t interfere with anything.
A reasonable “midway goal” is “I am thinking about investing in the future self-driving business, including robotaxis and other potential offshoots. Based on what we know today, which system is likely to have early success?”
Waymo says they do in some limited circumstances. They are testing in several cities and do use highways in LA for customers now, according to their press office.
Sorry - just force of habit from my legal practice, which deals with transportation planning issues more than you’d think. Many ordinary low-speed “ordinary” surface roads are “highways” (including many that Waymo travels on) in some legal and technical contexts. So I tend to use “limited access” as short-hand jargon when I want to make it clear I’m referring to refer to the high-speed motorways, like the ones Waymo excludes for use by public customers (Waymo cars go on freeways for testing, of course, but they also take Waymo employees as passengers). Since I practice in a very urban area (Miami), I almost never deal with high-speed roads that aren’t limited access highways.
Well, I had said pelicans. The point is that Tesla’s system - like all AI systems - will sometimes encounter a particular set of inputs that does not lead to an output. I chose pelicans mating on the bonnet as a colorful (and obviously nonsensical) way of describing one such possible scenario, but it can be anything that constitutes a set of inputs that is so far outside the training data that the AI driver can’t generate any outputs.
Their different approach to sensors gets a lot of attention, because it’s really obvious. But they also take very different approaches to designing the software stack. And those different software approaches contribute to some of the operational differences we’ve been talking about here. Waymo uses a modular design, not just one end-to-end system - so even in circumstances where their “foundation” module gets confounded by a really weird circumstance, the other modules are still independently building up an internal model of the environment and nearby objects that can be used for decision-making. Tesla’s model used to be that way, but they scrapped it two years ago, and now they have just a single large model going from inputs to controls.
This is the heart of the matter, lack of disclosed sufficiently detailed performance and safety data from Tesla.
If such data were disclosed, a much stronger evaluation of their product’s capability could be done.
In the meantime, we get the drip feed of unverifiable claims, silly predictions and singleton demos.
And we can only infer their AI driver capability based on the commercial product they actually release.
Another tidbit is by not releasing an unsupervised AI driver, Tesla does not have to incur liability for their AI’s driving decisions (they do assume liability for their small human supervised taxi fleet).
Maybe this liability is too great, given the (unknown to us) capability and safety data?
When do the unsupervised customer vehicle deliveries begin at scale, from plant to customer?
That’s not how it works. It’s not redundant modules competing, it’s modules feeding into one another. As the article states:
Waymo’s foundation model is a modular transformer-based neural network, meaning different models contribute to perception, prediction, decision-making, and control, with one model’s outputs acting as inputs to the next.
I didn’t say they were competing. The various models are processing different aspects of the input to form different assessments of the environment, and those outputs are themselves used as input for the overall foundational model.
But my layman’s understanding is that’s what enables the system to function even if the overall foundation model is unable to resolve to a prediction. If the overall foundation model is unable to predict what it should do next, the system can use the inputs from the component models to fashion a safety response. IOW, because the system remains modular, it’s still going through the exercise of, say, identifying specific objects in the environment (“there is a car over there”). Normally it does that for the purpose of feeding into the overall process, but it’s then available to use as input “in case of emergency” so that the car can navigate a safety response to avoid a crash.
Tesla used to have their system set up like this, but switched to a pure end-to-end model. So you don’t have “chunks” in the processing. Video leads to output - photons straight to the gas pedal/steering wheel moving. That’s enabled them to take more advantage of their huge training data set. But one consequence of it (again, AIUI as a layman) is that the system is much more of a “black box” - inputs generate outputs, but there’s no readily accessible partial solutions or intermediate steps.
The two systems now have different architectures. They have different characteristics and different capabilities. It is very possible that those differences will mean that one system might work better for some uses and not others.
To return to your house example, one company might be in a position to paint 90% of all 1000 rooms in the house, while the other might be in a position to paint 100% of only 15% of the rooms. Both are incomplete, but they’re incomplete in different ways. The former has made more progress towards painting the house overall, but the latter has actually delivered usable rooms much earlier than the other.