Tsla Optimus gen 2

A brief history of the Tesla humanoid robot

The Captain

1 Like

Yes, but to again use Denny’s terminology, that’s a “control by numbers” situation. If you tell the robot what an object is made of and the various characteristics of a the material, then it has some additional information about what to do. But that’s entirely different from a machine “learning” to look at an object and figure out how to manipulate it in the real world without someone having programmed in what that object is made of.

Again, it’s very different from what we were able to do with language. Just like in CAD, we’ve always had the ability to “hand program” in all the various rules of language, definitions, grammar requirements, and the like into a chat simulator - and many of the pre-ChatGPT chatbots had a fair amount of capability based on the CAD-equivalent of language. But it wasn’t until we started using the Big Data around language that we saw a real step change in what the AI was capable of.

We don’t have a source of Big Data for real world physical interactions, which is going to keep Optimus from being much of anything worthwhile for quite a long time.

Of course, but how do people decide what material it is or most likely is made of? Vision, mostly, but not exclusively. The 2012 Imagenet database even has a few dozen(?) materials/fabrics etc that that machines can easily identify even though this dataset wasn’t directly intended for this purpose.
In 10 seconds I found the MIT material database from 2014 and there are many others.

https://people.csail.mit.edu/lavanya/fmd.html

It can be difficult distinguishing a shiny aluminum sheet from a steel sheet (given no rust of other hints) so just like a human you’d have to use something beyond vision alone, but the robot would be able to instantly come up with a list of possible materials and a range of weights/densities easily.

Of course, anyone intentionally trying to fool a robot would have an easy time just like Hollywood prop makers create foam rocks and rubber hammers.

My problem with the Tesla bot isn’t with things like this…I just don’t think that for the price there will be enough market demand for a mobile robot.

Mike (mentoring high school FIRST robotics teams for almost 20 years)

2 Likes

Perhaps, though it’s not merely a “foam rock” problem. Many things in the real world are painted, wrapped, coated, enclosed in another object, or any of a hundred other things that make it somewhat difficult to figure out from vision alone what it is an what it’s likely to be. And that’s only a super-tiny portion of the problem. It’s not merely about recognizing the substance something is made of, but knowing how to manipulate it in the physical world - how much force to exert to carry something in one “hand” without breaking the object, without it slipping, or without overbalancing the robot.

You’re right that some of that can be programmed in. But the reason why the Asimo’s and the Atlas’ of the world never progressed very far is because “control by numbers” can’t get you very far. Until we see Tesla doing something that actually moves beyond what robotics has been able to do for the last several years (rather than just duplicating it in their inhouse shop), there’s no reason to think that Optimus will be anything beyond an R&D project. The advances in AI have come with massive datasets, and Dojo is designed to process those massive datasets - but without a massive dataset, what is Tesla actually bringing to the table that’s new or different for Optimus?

1 Like

I don’t know. But like I said I don’t think there is a big enough market for a humanoid robot for the price. But, in the case of practical industrial uses you don’t need some giant dataset for vision, it is pretty much a solved problem or easily solvable for any practical use already. In last year’s FIRST robotics competition high school kids were using the YOLO AI model to detect and pickup game pieces within only a few weeks of knowing what the game was.
(yes a very narrow case that was programmed into their robots…but high school kids and a 10 year old embedded CPU)

Mike

1 Like

Sure - and I think that’s one of the main reasons why people have been so optimistic about autonomous cars. The only sense input you really need to use to drive a car is sight, and maybe a little bit of hearing. And since vision is basically solved, it should be possible for an AI to drive a car. Because driving almost never involves manipulating any objects in the environment. There’s no touching. In fact, it’s almost all about not touching. For any given object, the AI never needs to know anything about its weight or durability or composition or anything apart from what it looks like.

But robots, even industrial robots, are different. Lifting, moving, manipulating, installing, carrying…these are all things that require the device to interact with objects in the environment. Which generally requires knowing more about how those objects will react when you do things to them, more than just what vision can reveal. How hard do you have to press on a door to open it? How much force do you use to open a closed drawer, or pull the sheets off of a bed for washing?

You can let a robot AI watch a million hours of poultry plant workers breaking down a chicken. But since that video doesn’t contain information about how much force the workers’ hands and fingers are applying to the knife, to the bones, to the chicken meat, etc., it can’t “learn” from that how to break down a chicken the way that a LLM AI bot can “learn” how to compose texts by being fed all of human writing.

1 Like

But you can then give it chickens to cut up and let it learn from the experience.

1 Like

Could you? I mean, the really amazing strides in AI have resulted from those brains having access to massive numbers of “examples” of things (images, words, video, etc.) in “computer time” (being able to process millions and millions of instances). I’m not sure we’ve developed the tech where a computer brain could learn how to cut up a chicken without being programmed to, and instead using the ‘trial and error’ approach used to train neural nets in LLM’s an other environments. How many chickens, and how much time, would you need?

Good question.
Are you familiar with reinforcement learning or reinforcement learning with human feedback?

The way I think it could work, very simplified, is something like this:

  • A robot is given a very simple goal (maybe step 1 of 10 towards the end goal) and a human rates the result. Rinse and repeat for each step.
  • You build a database of good and bad results and train an evaluator AI model that replaces the humans – most of the time.
  • Next you place the evaluator model into the robot so it can judge its own results.
    (There is no “conflict of interest” since it can be the identical model running)
  • At this point you can increase the task complexity to chain together all the steps requiring that each step be completed to a high degree of accuracy before a subsequent step is started
  • Now you build a physics accurate model of the robot…easily done since you used advanced CAD software to design the robot to start with. You pair this with a 3D game-like simulator and you can train away in the cloud at computer speeds. You will probably need a rough physics model of chickens as well or whatever else you want the robots to cut up or tasks you want
  • Finally, you pair all this with AI models that watch videos and create the steps to accomplish other tasks

All the parts to do this already exist maybe with the except of the last step that is still a work-in-progress AFAIK. And there are probably other good ways to solve this. You just need a company with enough resources to put it all together

Mike

2 Likes

Learning never ends.

The Captain

2 Likes

Isn’t that one of the main problems - building that database? With most of the really exciting AI advancements these days, they’re trained on huge databases that we spent millions of man hours creating for other reasons. We digitized virtually every almost every instance of written communication that existed prior to the internet, and almost every recent instance of written communication is generated electronically. So too with images - we created a dataset of literally billions and billions of labelled images.

This is ostensibly Tesla’s advantage in building an AV AI - that Tesla has the big dataset of billions of car-miles to train its AI in Dojo, and competitors don’t. But no such dataset exists for the manifold tasks an Autonomous Humanoid Robot would potentially be asked to handle - and no similarly practical way of building one.

Isn’t that the other main problem here?

We’ve automated a massive amount of jobs using conventional “control by numbers” programming. In assembly/manufacturing, many of the jobs that haven’t been automated are the ones where we just can’t program a machine to do the task for a variety of reasons. One of the common reasons, AIUI, is if the thing the machine needs to act on has certain characteristics that make it all-but-impossible to model internally with any precisions.

Things that are floppy, squishy, irregular in size and shape and rigidity. Intensely variable situations, where it’s unknown in advance what the machine will encounter. These are things that we can’t just program the machine to handle. Humans are amazing at our ability to manipulate those kinds of objects and situations, but machines have a lot of difficulty in handling them.

That’s why Cargill still has tons of poultry processing plant employees, instead of the lines of machines like many other - simpler - food production processes. It’s not a solvable problem today.

1 Like

Not really. Just pointing cameras at humans doing each of the steps at a processing facility for just 100 iterations would be enough to get all the correct view of images. Since the humans aren’t making many errors, you’d need them to fake some mistakes to not take hours and hours. Things like, step 1, holding the knife incorrectly or not even picking it up so you could score these things low.

That is needed for encompassing a vast number of edges cases that could occur any at any time, thus all needed within a single DNN model.

Mike

1 Like

How does this solve the aforementioned closed box problem? A human approaches the box, touches it, sees how it moves, determines how heavy it might be, determines how unbalanced it might be, all in a few milliseconds … and none of that thought process appears on the video. Heck, it’s hardly even conscious in a human after a few years of living.

2 Likes

Optimus will become with doing some basic repetetive tasks that require humans (wiring ?).

It does NOT need to solve more complex jobs with v1.

1 Like

It doesn’t. I was addressing just the idea of the robot learning the incremental step-by-step skill of cutting up chickens in a food processing facility.

Mike

1 Like

23 chickens, only a few days. Finding a robot that could successfully peel a hard boiled egg without taking half the egg with it, that will take longer.

1 Like

Probably not. I think you’re overstating the degree to which general image recognition, which can happen at a very high level of generality with only a few hundred iterations, with the type of hyperspecific categorization necessary for more complicated tasks. This is from only a few months ago:

The manufacturing sector, including automotive plants, has relied on robotics for several years, but putting a bolt in a car is very different than working with meat, since no two animals are the same.

One of the challenges is replicating the human eye and touch. So far, robotic butchers aren’t able to make precise cuts and can also struggle to accurately tell the difference between skin, fat, bone and meat in chicken and turkey facilities.

This is the Catch-22 of a new, general purpose robot. The jobs that are easy to replace with machines have already been replaced with machines. The ones that remain to be automated are things that machines have difficulty doing - things that are hard for computer vision to handle, manipulating objects that are hard for machines to handle. So if even a bespoke machine - which can be loaded up with as many high-end cameras and as many actuators and sensors as we want - can’t handle these jobs, it’s going to be very difficult to advance a general purpose robot with only two “eyes” and basic-model “touch” and motor control to the point where it can do it.

Take it in stages. I wonder if there is a way to tell the robot/AI “This is a pile of chickens. Look at them, pick them up, move them around, study them, learn everything you can about them.” Then, as the next step, “These are chickens after they are cut up. Study them.” After that, “Watch these people use knives to cut up chickens.” And finally, “Here is a knife. Use it to practice cutting up chickens into pieces as you have been shown.”

1 Like

How do very young kids learn? By stages. It’s important not to commingle industrial robots, numeric control robots, with intelligent robots. Knowledge is assimilated in stages.

Knives? Careful, don’t cut yourself!
Electric sockets? Careful, don’t electrocute yourself!

The good thing about AI is that what one ‘robot’ learns teaches all robots via over-the-air software updates. Don’t ask me for details but surely the smart folks at Tesla are finding out. The robots are humanoid in more ways than one. Conceptualizing humanoid robots after industrial robots is a big mistake.

The Captain

2 Likes

Yeah…and that type of unbelievable advancement in AI would be something to behold. But AIUI, that’s not what current technology - including Tesla’s current technology - can do. These machines don’t “learn” or “think” the same way that young kids do. We use those terms as anthropomorphizing metaphors to describe the process of developing pattern-recognition algorithms when trained on massive datasets, not because they actually describe what the AI “black box” is doing. Even the most cutting-edge AI’s today can’t generalize or reason.

So, no, the robots aren’t humanoid in more ways than one. Even the state of the art AI isn’t humanoid yet. Tesla’s robots are only humanoid in form factor.

One other point:

I’m pretty sure this is false - or at least inconsistent with how AI programs are developed and applied. Your Tesla car doesn’t “learn” anything when it encounters new data - the software it runs downloaded from Tesla, and it just runs it. The software is developed on Tesla’s supercomputers - those supercomputers (now Dojo) are the AI machine that “learns” (metaphorically) based on the training data it is provided.

The robot, like your car, isn’t going to have a supercomputer in it - not for $20K, certainly. The robot can’t “learn,” at least not any differently than a “control by numbers” robot can today. It can (at best) upload its observations OTA to another machine that has enough supercomputing power to be trained on the data.