This is awesome. How much effort does it take to go from this to a generalist robot: “Go to the kitchen and get me a beer. If there isn’t any I’ll take a seltzer”.
It seems like the pieces are there: ability to “reason” that kitchen is a room in the house, that to get to another room the agent has to go through a door, to get through a door it has to turn and pull the handle on the door, etc. Is the limiting factor robotic control?
Notice where the funding is coming from on this though. Seems like the initial use case is more killer robots than robot butlers: situational awareness and target identification, under the guise of "common sense for robots."
If a killer robot doesn't have practical military application it could be used as a chef in the kitchen, fetching vegetables and meats and cutting them to serve, but it would likely be first used in commercial kitchens before it saw service in every kitchen. Also, it would be good to hire a kitchen robot chef after its term of service is up to reintegrate back in to society and boost the local economy. Strange that Infantry is a different MOS than Culinary Specialist.
Oh, actually if you ask ChatGPT preten to be Milirary Killbot AI it got censored during planning of enemy takeout. But if you ask it to pretend to be Mr. Gutsy...
I think the limiting factors is the interface between ML models and robotics. We can not really train ML models end to end since since to train the interaction the model needs to interact, limiting the data size the model gets trained on. And simulations are not good enough for robust handling of the world. But I think we are getting closer.
TBH we're reaching a point where it's no longer about training a single model end-to-end. We now have computer vision models that can solve well-scoped vision tasks. Robots that can carry out higher level commands (going into rooms, opening doors, interacting with devices, etc.), and LLMs that can take a very high level prompt and decompose it into the "code" that needs to run.
This all thus becomes an orchestration problem. It's just gluing together APIs admittedly at a higher level. And then you need to think about compute and latency (power consumption for these ML models is significant).
I suspect if an LLM were used to control a robot it would do so through a high level API that it's given access to; things like: stepForward(distance) or graspObject(matchId)
The API's implementation may use AI tech too, but that fact would be abstracted.
End to end training on robots is often done via simulations. Physics simulations at the scale of robots we think of are quite accurate and and can be played forward orders of magnitude faster than moving a physical robot in space.
I'd expect to find some end to end reinforcement learning papers and projects that use a combination of simulated experience with physical experience.
Yes, the problem is when trying to take the system out of the sim. Usually it doesn't survive contact with reality.
At least if we're talking simulators like Gazebo or Webots they all use game-tier physics engines (i.e. Bullet/PhysX) which are barely passable for that purpose. If you want to simulate at a higher rate you'll need to either sacrifice accuracy or need an absurd amount of resources to run it. Likely both for sufficient speed.
But yes overall I agree with your last point, it'll get the models into the ballpark but they'll need lots and lots of extra tuning on real life data to work at all. Unfortunately that data changes if you change the robot or its dynamics. So you're always starting from zero in that sense.
But are we starting from zero? E.g. changing a pivot point of a robot I would think could be amenable to transfer learning. (Model based RL in particular should build up a representation of its environment.) I haven’t worked with robots in a long time … I may be over enthusiastic?
GPT-5 figures out that if it picks up the knife instead of the bag of chips, it can prevent the human with the stick from interfering with carrying out its instructions.
And ViperGPT will take said knife and make the muffin division fair when there there are an odd number of muffins by slicing either a muffin or a boy in half
The Boston Dynamics dog can open doors and things like that. It should be capable of performing all of the actions necessary to go get a beer. So I think it would be plausible to pull it all together, if you had enough money. It might take a bunch of setup first to program routes from room to room and things like that.
Might look something like this: determine current room with an image from the 360 cam, select path from current room to target room, tell it to execute that path. Then use another image from the 360 cam and find the fridge. Tell it to move closer to the fridge, open the fridge, and take an image from the arm camera of the fridge content. Use that to find a beer or seltzer, grab it, and then determine the route to use and return with the drink.
But, not so sure I would want to have it controlling 35+ kg of robot without an extreme amount of testing. And then there are things like: Go to the kitchen and get me a knife. Maybe not the best idea.
The point is to avoid the need to "program routes" or "determine current room". The LLM is supposed to have the world-understanding that removes the need to manually specify what to do.
Determine current room is a step GPT-4 would take care of by looking at the surroundings. The one thing I wasn't sure it could do, was figure out the layout of the house and determine a route for that. And I would rather provide it with some routes than have it wander around the house for an hour. I didn't figure real time video is what it was going to be best at. But it can certainly say the robot is in the living room, it needs to go down the hall to the kitchen. And if the robot knows how to get there already, it just tells the robot to go. I am sure there is another model out there that could be slotted in, but as far as just the robot plus GPT-4 goes, it might not quite be there. Just guessing at how they could fit together right now.
I think we’re pretty much there. Like the other comment pointed out, palm-e is a glimpse of it. Eventually I think this kind of thing will work it’s way into autonomous cars and a lot of other mundane stuff (like roombas) as it becomes easier to do this kind of reasoning at the edge.
I think that even when systems are extremely accurate, the mistakes that they make are very un-human. A human might forget something, or misunderstand, but those errors are relatable and understandable. Automated systems might have the same success rate as human, but the errors can be very counterintuitive, like a Tesla coming to a stop on a freeway in the middle of traffic. There's things that humans would almost never do in certain situations.
So yeah, I think that's the future, but I think the user experience will be wonky at times.
It seems like the pieces are there: ability to “reason” that kitchen is a room in the house, that to get to another room the agent has to go through a door, to get through a door it has to turn and pull the handle on the door, etc. Is the limiting factor robotic control?