A brief history of the Tesla humanoid robot
The Captain
Yes, but to again use Dennyâs terminology, thatâs a âcontrol by numbersâ situation. If you tell the robot what an object is made of and the various characteristics of a the material, then it has some additional information about what to do. But thatâs entirely different from a machine âlearningâ to look at an object and figure out how to manipulate it in the real world without someone having programmed in what that object is made of.
Again, itâs very different from what we were able to do with language. Just like in CAD, weâve always had the ability to âhand programâ in all the various rules of language, definitions, grammar requirements, and the like into a chat simulator - and many of the pre-ChatGPT chatbots had a fair amount of capability based on the CAD-equivalent of language. But it wasnât until we started using the Big Data around language that we saw a real step change in what the AI was capable of.
We donât have a source of Big Data for real world physical interactions, which is going to keep Optimus from being much of anything worthwhile for quite a long time.
Of course, but how do people decide what material it is or most likely is made of? Vision, mostly, but not exclusively. The 2012 Imagenet database even has a few dozen(?) materials/fabrics etc that that machines can easily identify even though this dataset wasnât directly intended for this purpose.
In 10 seconds I found the MIT material database from 2014 and there are many others.
https://people.csail.mit.edu/lavanya/fmd.html
It can be difficult distinguishing a shiny aluminum sheet from a steel sheet (given no rust of other hints) so just like a human youâd have to use something beyond vision alone, but the robot would be able to instantly come up with a list of possible materials and a range of weights/densities easily.
Of course, anyone intentionally trying to fool a robot would have an easy time just like Hollywood prop makers create foam rocks and rubber hammers.
My problem with the Tesla bot isnât with things like thisâŚI just donât think that for the price there will be enough market demand for a mobile robot.
Mike (mentoring high school FIRST robotics teams for almost 20 years)
Perhaps, though itâs not merely a âfoam rockâ problem. Many things in the real world are painted, wrapped, coated, enclosed in another object, or any of a hundred other things that make it somewhat difficult to figure out from vision alone what it is an what itâs likely to be. And thatâs only a super-tiny portion of the problem. Itâs not merely about recognizing the substance something is made of, but knowing how to manipulate it in the physical world - how much force to exert to carry something in one âhandâ without breaking the object, without it slipping, or without overbalancing the robot.
Youâre right that some of that can be programmed in. But the reason why the Asimoâs and the Atlasâ of the world never progressed very far is because âcontrol by numbersâ canât get you very far. Until we see Tesla doing something that actually moves beyond what robotics has been able to do for the last several years (rather than just duplicating it in their inhouse shop), thereâs no reason to think that Optimus will be anything beyond an R&D project. The advances in AI have come with massive datasets, and Dojo is designed to process those massive datasets - but without a massive dataset, what is Tesla actually bringing to the table thatâs new or different for Optimus?
I donât know. But like I said I donât think there is a big enough market for a humanoid robot for the price. But, in the case of practical industrial uses you donât need some giant dataset for vision, it is pretty much a solved problem or easily solvable for any practical use already. In last yearâs FIRST robotics competition high school kids were using the YOLO AI model to detect and pickup game pieces within only a few weeks of knowing what the game was.
(yes a very narrow case that was programmed into their robotsâŚbut high school kids and a 10 year old embedded CPU)
Mike
Sure - and I think thatâs one of the main reasons why people have been so optimistic about autonomous cars. The only sense input you really need to use to drive a car is sight, and maybe a little bit of hearing. And since vision is basically solved, it should be possible for an AI to drive a car. Because driving almost never involves manipulating any objects in the environment. Thereâs no touching. In fact, itâs almost all about not touching. For any given object, the AI never needs to know anything about its weight or durability or composition or anything apart from what it looks like.
But robots, even industrial robots, are different. Lifting, moving, manipulating, installing, carryingâŚthese are all things that require the device to interact with objects in the environment. Which generally requires knowing more about how those objects will react when you do things to them, more than just what vision can reveal. How hard do you have to press on a door to open it? How much force do you use to open a closed drawer, or pull the sheets off of a bed for washing?
You can let a robot AI watch a million hours of poultry plant workers breaking down a chicken. But since that video doesnât contain information about how much force the workersâ hands and fingers are applying to the knife, to the bones, to the chicken meat, etc., it canât âlearnâ from that how to break down a chicken the way that a LLM AI bot can âlearnâ how to compose texts by being fed all of human writing.
But you can then give it chickens to cut up and let it learn from the experience.
Could you? I mean, the really amazing strides in AI have resulted from those brains having access to massive numbers of âexamplesâ of things (images, words, video, etc.) in âcomputer timeâ (being able to process millions and millions of instances). Iâm not sure weâve developed the tech where a computer brain could learn how to cut up a chicken without being programmed to, and instead using the âtrial and errorâ approach used to train neural nets in LLMâs an other environments. How many chickens, and how much time, would you need?
Good question.
Are you familiar with reinforcement learning or reinforcement learning with human feedback?
The way I think it could work, very simplified, is something like this:
All the parts to do this already exist maybe with the except of the last step that is still a work-in-progress AFAIK. And there are probably other good ways to solve this. You just need a company with enough resources to put it all together
Mike
Learning never ends.
The Captain
Isnât that one of the main problems - building that database? With most of the really exciting AI advancements these days, theyâre trained on huge databases that we spent millions of man hours creating for other reasons. We digitized virtually every almost every instance of written communication that existed prior to the internet, and almost every recent instance of written communication is generated electronically. So too with images - we created a dataset of literally billions and billions of labelled images.
This is ostensibly Teslaâs advantage in building an AV AI - that Tesla has the big dataset of billions of car-miles to train its AI in Dojo, and competitors donât. But no such dataset exists for the manifold tasks an Autonomous Humanoid Robot would potentially be asked to handle - and no similarly practical way of building one.
Isnât that the other main problem here?
Weâve automated a massive amount of jobs using conventional âcontrol by numbersâ programming. In assembly/manufacturing, many of the jobs that havenât been automated are the ones where we just canât program a machine to do the task for a variety of reasons. One of the common reasons, AIUI, is if the thing the machine needs to act on has certain characteristics that make it all-but-impossible to model internally with any precisions.
Things that are floppy, squishy, irregular in size and shape and rigidity. Intensely variable situations, where itâs unknown in advance what the machine will encounter. These are things that we canât just program the machine to handle. Humans are amazing at our ability to manipulate those kinds of objects and situations, but machines have a lot of difficulty in handling them.
Thatâs why Cargill still has tons of poultry processing plant employees, instead of the lines of machines like many other - simpler - food production processes. Itâs not a solvable problem today.
Not really. Just pointing cameras at humans doing each of the steps at a processing facility for just 100 iterations would be enough to get all the correct view of images. Since the humans arenât making many errors, youâd need them to fake some mistakes to not take hours and hours. Things like, step 1, holding the knife incorrectly or not even picking it up so you could score these things low.
That is needed for encompassing a vast number of edges cases that could occur any at any time, thus all needed within a single DNN model.
Mike
How does this solve the aforementioned closed box problem? A human approaches the box, touches it, sees how it moves, determines how heavy it might be, determines how unbalanced it might be, all in a few milliseconds ⌠and none of that thought process appears on the video. Heck, itâs hardly even conscious in a human after a few years of living.
Optimus will become with doing some basic repetetive tasks that require humans (wiring ?).
It does NOT need to solve more complex jobs with v1.
It doesnât. I was addressing just the idea of the robot learning the incremental step-by-step skill of cutting up chickens in a food processing facility.
Mike
23 chickens, only a few days. Finding a robot that could successfully peel a hard boiled egg without taking half the egg with it, that will take longer.
Probably not. I think youâre overstating the degree to which general image recognition, which can happen at a very high level of generality with only a few hundred iterations, with the type of hyperspecific categorization necessary for more complicated tasks. This is from only a few months ago:
The manufacturing sector, including automotive plants, has relied on robotics for several years, but putting a bolt in a car is very different than working with meat, since no two animals are the same.
One of the challenges is replicating the human eye and touch. So far, robotic butchers arenât able to make precise cuts and can also struggle to accurately tell the difference between skin, fat, bone and meat in chicken and turkey facilities.
This is the Catch-22 of a new, general purpose robot. The jobs that are easy to replace with machines have already been replaced with machines. The ones that remain to be automated are things that machines have difficulty doing - things that are hard for computer vision to handle, manipulating objects that are hard for machines to handle. So if even a bespoke machine - which can be loaded up with as many high-end cameras and as many actuators and sensors as we want - canât handle these jobs, itâs going to be very difficult to advance a general purpose robot with only two âeyesâ and basic-model âtouchâ and motor control to the point where it can do it.
Take it in stages. I wonder if there is a way to tell the robot/AI âThis is a pile of chickens. Look at them, pick them up, move them around, study them, learn everything you can about them.â Then, as the next step, âThese are chickens after they are cut up. Study them.â After that, âWatch these people use knives to cut up chickens.â And finally, âHere is a knife. Use it to practice cutting up chickens into pieces as you have been shown.â
How do very young kids learn? By stages. Itâs important not to commingle industrial robots, numeric control robots, with intelligent robots. Knowledge is assimilated in stages.
Knives? Careful, donât cut yourself!
Electric sockets? Careful, donât electrocute yourself!
The good thing about AI is that what one ârobotâ learns teaches all robots via over-the-air software updates. Donât ask me for details but surely the smart folks at Tesla are finding out. The robots are humanoid in more ways than one. Conceptualizing humanoid robots after industrial robots is a big mistake.
The Captain
YeahâŚand that type of unbelievable advancement in AI would be something to behold. But AIUI, thatâs not what current technology - including Teslaâs current technology - can do. These machines donât âlearnâ or âthinkâ the same way that young kids do. We use those terms as anthropomorphizing metaphors to describe the process of developing pattern-recognition algorithms when trained on massive datasets, not because they actually describe what the AI âblack boxâ is doing. Even the most cutting-edge AIâs today canât generalize or reason.
So, no, the robots arenât humanoid in more ways than one. Even the state of the art AI isnât humanoid yet. Teslaâs robots are only humanoid in form factor.
One other point:
Iâm pretty sure this is false - or at least inconsistent with how AI programs are developed and applied. Your Tesla car doesnât âlearnâ anything when it encounters new data - the software it runs downloaded from Tesla, and it just runs it. The software is developed on Teslaâs supercomputers - those supercomputers (now Dojo) are the AI machine that âlearnsâ (metaphorically) based on the training data it is provided.
The robot, like your car, isnât going to have a supercomputer in it - not for $20K, certainly. The robot canât âlearn,â at least not any differently than a âcontrol by numbersâ robot can today. It can (at best) upload its observations OTA to another machine that has enough supercomputing power to be trained on the data.