The real value of this is the algorithm of "one-shot imitation learning" (the paper is here: https://arxiv.org/pdf/1703.07326.pdf ). The title about robots it is only to catch the attention of the media. The domain is simple to show the idea, but it can be applied to more complex domains once you know how to define it (which it is usually really complex). The blocksworld domain is used because it has been used in automatic planning for decades, and it is well know in research. It feels trivial to use only 6 blocks, but when you want to create an automatic plan of the steps to reach the final position it is not that simple for the computers.
+1 that the meta-learning spin on this approach is the really interesting part. The normal approach would be as follows:
"you want to stack 6 blocks on one another? great, let me collect 1,000 examples of doing that in VR, and I'll train my policy on this and see how that works"
instead, we change the question:
"you want to stack 6 blocks on one another? great, that's one possible thing out of thousands you might want to do. so lets create a dataset of 1,000 examples of tuples: one 'query' demonstration, and a second demonstration as the target behavior to train the network on, when it sees the query. The training data is now 1,000 tuples of (query_demo, target_demo)), trained again with supervised learning."
Once this is trained, we can sub in (in theory) any arbitrary desired demonstration, and the network will learn how to "extract" what is intended, and uses the demonstration as a crutch that is being imitated. It's a bit of a change of mindset, but a very powerful one, much more general one, and much more exciting one.
karpathy, I see you are at Stanford for Deep learning and NLP... I'm working on a project for audio/sound classification and have been sniffing around for some folks who may have encountered a similar set of feature points for audio data in deep learning. Would you be open to connecting? If so, let me know an email or other way to contact you and I'll reach out.
For some constructive feedback, this is a really awkward way of getting in touch with someone. Assume they're busy and professional, you're sending them a message to ask them to send you a message to tell you how to send them a message...
If you want to contact someone, check their public profile on their website and see if they've said there's a preferred way (some people want everything to a certain email address, call them directly, never call them, flat out tell you not to contact them or more commonly just say email them and get to the point). Follow whatever they suggest.
Write something simple and clear, and be upfront about what you're asking for. Make it as easy as possible for the person to help you (this applies to both reading and answering the question). I'm far more likely to reply if I can open an email, type a sentence or two, and then move on.
With your message, I don't know if you're just after datasets, help with a particular problem, a mentor, business partner or what. I also don't know what area of audio/sound classification so if I was actually in that area then I'd not know right now if I could help or not (whereas if you'd said human voices, bird chirps, etc. I'd have a better idea).
Essentially, assume most people are pleasant and helpful but also extremely busy.
The algorithm has a major limitation: it's supervised learning, which means it needs big data sets for learning. And to create big learning sets, you need first to solve the problem that you are teaching the network how to solve. In this instance, the authors needed first to solve the problem of block stacking so that they could generate a learning set, which they used to learn how to stack the blocks. This issue is general to this method: Let's say you want to create a maid robot, you would have first to solve all the problems relating its duties (opening a door, cleaning the table, grasping a bottle ...), before you could train this network to solve these problems.
To resolve this limitation we need either to create deep networks that do not need big data sets, which does not seem possible right now. Or we try to factorize the problem differently: for example, we separate learning how the control of the robot (e.g. how to move something for point a to b) and learning how to abstractly solve the problem (in this case, learning that you need to stack the blocks in a certain order).
You can describe any collection of algorithms as a single, more complicated, one. And it seems reasonable to say that humans are quite advanced given we can't yet replicate some of their reasoning capabilities.
Yes but to get to that point, Herbert needs to know the physics of the world, cause the mail may be under the kid's football which would drop on the floor of the mail is pulled and bounce around and break some things. And the robots also needs to know about making the right decision: is it better to let the ball break the China vase than making a move that would hit the dog in the face even not very hard?
In order to get there, you need a robot that knows a lot of stuff.
I think for the example with the China you would just have it do it's best to avoid things and if it breaks stuff then it breaks stuff. Unless you're robot breaks a ton of stuff in normal operation I don't see how being unable to choose between different things to break being a large issue
Having to anticipate how the robot will screw up and reorganize your life based around that is a big issue. It's a big enough reason for people to decide they need to give up a dog they love, and people love dogs a lot more than they love robots.
But benefits of having dogs are mostly psychological and so dog owners are a minority. Since robots will have economical benefits people will use them even if it will make their lives worse.
This is placing a much higher standard on robots than humans are held to: to demonstrate that robots have to make hard decisions you presented a scenario where a human making either decision would have socially acceptable justifications available. But that just may be the key: maybe robots don't have to have superhuman decision making as long as they can produce humanly acceptable justifications for their decisions. That might still be hard, but at least you don't have to do that in a split second.
The central problem with this idea is noise and creepiness. Nobody wants a noisy robot, and roboticists tend to produce creepy ones. Both problems might be solvable, but I think robotics tends to focus on the technical problems at the exclusion of the aesthetics. It's definitely a startup idea.
What you're describing are separate concerns that I'm sure will be addressed before they're brought to market. Currently they're focussed on getting the robot to be able to learn from example. Spending a lot of time on aesthetics at the expense of getting the learning to work doesn't make sense at this point. You're exactly right that they're focussing on the technical problems: that's what needs to be solved. Once it works as expected, then they can focus on improving the aesthetics. A startup that focussed on making robots that look good but aren't effective at accomplishing tasks would likely have a very limited market.
Yeah, fair enough. I was thinking that BigDog was an example of this that works today, but the military passed on it because it was too noisy. But it didn't really do anything but move around.
People with central air handlers find the creaking and snapping of radiator heating very creepy, whereas radiator people find the whoooooosh of air handlers creepy. Kitchen appliances randomly and insistently beeping would likely have totally freaked out an ancestor just 200 years ago. The idea of voices and pictures of humans coming out of boxes was initially kinda disturbing. To say nothing of the sound of car vs horse.
Has anyone asked WHY robots NEED to have humanoid bodies?
I can see the case for familiarity, especially for robotic pets and attendants for children, but the human body and its limbs are the result of millions of years of evolution, in conditions that don't exist or apply anymore.
It had to adapt to being born naked and having to grow up, possibly on its own, and feeding itself and fending for itself...
I mean, consider wheels vs. legs. You don't see the former anywhere in nature, but it's the most efficient mode of locomotion in human society.
Why aren't we exploring more efficient robotic bodies that would also be easier to program?
Say, a rotating trunk-like body with multiple octopus-like tentacles with suckers instead of fingers. Just an idea. Or a robot that is actually a swarm of ant-like machines.
There will be useful cases for robots of all body types. BUT we humans have created a humanoid world where the human form makes sense. As such we'll need robots that mimic that for certain functions.
For example wheels, they're easier to do that legs (by far) but gets stuck on cables, carpets and stairs.
Have you saw them working in real life? Can't say about others but the first one is, maybe not creepy, rather looks like fake and cheap. Usually standing in one place, stuttering movement, etc. I don't want to say it is worthless. You have to start somewhere. But we are far away from the smooth and immediate reactions.
I wouldn't mind a creepy, noisy robot if it did the chores for me while I was at work and away from the house. I would be more concerned about safety with small children. Mine often find unpredictable ways to injure themselves with inert small objects, so I would imagine the possibilities with a larger, moving machine would be larger.
Do you mean acoustic noise? Both noise and aesthetics are far simpler problems to solve than autonomous learning. I don't think there's much of a market for realistic, quiet robots until the central problem has been solved.
Several of the robots around my home (dishwasher, washing machine, etc.) make noise, and I already accept them. If another robot adds value, even with noise or creepiness, I cannot imagine rejecting it over those concerns.
> "Herbert: grab my car keys." "Herbert: set the dinner table." "Herbert: put the mail in the mailbox."
Doing that requires far more than simply being able to pick things up and put things down and walk around. To do that requires nearly human level AI that can "understand".
Not really. You just need some basic natural language processing to recognize what task needs to be done. There's some stuff like this already: https://www.youtube.com/watch?v=54HpAmzaIbs
No, really. Natural language processing isn't going to allow it to see and recognize the thing, nor navigate to the thing; it'll take vastly more complication than the OP implied to do something as simple as pick up something when instructed.
This transition from end-to-end differentiable 'black box' systems to multiple networks dedicated to certain tasks working in conjunction is interesting and very probably the idea that will keep this field going. We might not fully understand end to end systems in as much detail as we'd like to, but this abstraction layer enables us to at least know what part is doing what, empirically.
Is the vision network learning continuously or it has been trained with many configurations of the blocks and gives a continuous output?
The post says that the imitation network takes the input from the vision network and processes it to infer the intent of the task. Isn't the "intent" always "to stack"? Or can the imitation also be just re-arranging blocks in another configuration?
This part is interesting, if I understood it well..
> "But how does the imitation network know how to generalize?
The network learns this from the distribution of training examples. It is trained on dozens of different tasks with thousands of demonstrations for each task. Each training example is a pair of demonstrations that perform the same task. The network is given the entirety of the first demonstration and a single observation from the second demonstration. We then use supervised learning to predict what action the demonstrator took at that observation. In order to predict the action effectively, the robot must learn how to infer the relevant portion of the task from the first demonstration."
Does this mean that the imitation network has been trained on stacking, un-stacking, throwing...and other such tasks, and then it identifies that "stacking" is what is being done in order to imitate it?
Is there an ELI5 for what the 2 NNs are actually learning?
The vision network is trained before-hand on lots of different configurations in simulation and then used to infer the block locations in the image from the camera. So it’s not learning continously. The imitation network takes the block locations predicted by the vision network, together with the demonstration trajectory in VR, and imitates the task shown in the demonstration. So, it learns to look through the demonstration to decide what action to take next given the current state (i.e. location of blocks and gripper). To keep the setup simple, we only trained the imitation network on stacking tasks (so no unstacking, throwing, etc). In future work, we want to make the setup and tasks much more general.
The fact that NNs can generalize from simulated data to real data is very interesting. We can generate tons of simulated data of stuff like driving in simulators or even video games like GTA. And do experiments that would be way to dangerous or costly to perform in real life. It's not good when the first iteration of a reinforcement learner crashes the car at 60 miles per hour!
You can then add tons of randomization to the simulation to make sure it doesn't overfit to the particulars of the simulated data. Like random filters on the input, moving the camera around and vibrating it, making cars and pedestrians behave unrealistically erratically, etc, or having sensors fail. If it can learn to handle these extreme situations in the simulators, hopefully it would be generalize even to rare scenarios that occur in real driving.
This, IMHO, is the real revolution of the one-two punch of AI/VR. Training robots in VR and then using them to produce in a real time environment that maps well to the VR env.
Humans interact in the VR env (with oculus/vive/etc) to train / configure the robots and assembly line.
'we've taught this robot to move small uniform blocks and we're going to make it perform arbitrary complex tasks on a variety of objects' sounds a lot like 'i'm trying to draw the mona lisa and I got her eyebrow down really good'
Depends on your goals and budget. I've seen a lot of cheap hobby servo based arms. Precision is awful, payload is miniscule. OTOH, you will learn why precision is important :) and you will learn the software of arm motion planning.
The next step up in cheap arms is to build around Bioloid/Dynamixel style servos, which will increase the budget significantly, but you will end up with useable but coarse precision. Meaningful payloads will still cause servos to overheat but at least those brands will shut down rather than smoke -- usually.
If you want to do serious research you will need a serious budget. Arms are hard. This is not to say you can't have a lot of fun and learn a lot with something less. A friend build a smart task lamp around Bioloids -- the goals was a lamp that would autonomously aim a work light where he was soldering by following the soldering iron tip using CV and a cheap web cam. This is totally within the payload and precision limits of hobby servos, and the software can run on a RasPi.
The first question is what do you want to do with the arm and why can't a mechanically simpler contraption like a gantry be used if fixed or something like a dump truck or flatbed truck for mobile?
Now - granted, that's a lot of money for a small robot arm (actually, it isn't - price out a TeachMover or Rhino arm), but take a look at it - it's not just a robot arm.
Probably it's best additional use is as a 3D printer - so if you have given thought to buying one of those, too - well, here you can have both.
You can also purchase one on Amazon, if you prefer that method. It was also reviewed in March in Servo Magazine:
Now granted, this arm isn't a kit - but usually, if you want a kit, you'll sacrifice precision, and if you want precision, you'll get it (usually) pre-built.
Note that the lower the mass on the end of the arm, the more precise it will be and move (and be able to lift more). If the servos or motors are mostly near the base of the arm, that's going to be the best placement; the popular "pallet stacking" style parallel arm kits you see sold out there typically have this arrangement, with only a servo on the end for the gripper (one of these actually isn't bad, if you pair it up with quality metal-gear dual-ball bearing servos).
Another option would be to 3D print a robot arm; there are plenty of examples and files out there for that, if you already have a 3D printer or access to one.
You might look around on Ebay or similar for an older "vintage" desktop robot arm from the 1980s (and old TeachMover or similar arm is ideal), and try to repair or refurbish it (there are a few sellers that do this as well, and their work can be stellar - but you'll pay the price for it, too). For instance, I got an old Rhino XR-1 arm from Ebay for a couple hundred dollars - it's controlled using a simple serial port (RS-232) protocol. Most of those early arms work the same way.
Another option might be to try to replicate one of those early robot arms, or build your own arm from parts - pringles cans can make a great starting point!
Finally - and only do this if you are really serious - you can find on Ebay used industrial robot arms for sale; look for one that is "lab grade", as they'll usually be smaller in size and cleaner (most of the arms on there - while a great buy considering what they cost new - are so big you'll need a truck and pallet jack to move them, plus 440v to run 'em - not to mention all the hydraulic mess). The downside will probably be on the interfacing end; you might have to do something custom there (ie - rip out the old controllers and hack your own in place).
We actually use both. The data used to train the neural networks is all generated through a scripted policy simulation. We then use a single human demonstration in VR to show the robot how to carry out a particular task. VR lets a human do a demonstration quite naturally, and would also be an effective way to collect training data for more complex tasks, where we’re not be able to create a scripted policy.
It's demonstrating one-shot learning. If you want the robot to do something that you can do, just doing it while the robot watches is a lot easier than building simulations.
It's not easy to get a million training examples, if you can learn something useful from just one example, that's great.
Even if you do have a model.
Imagine you have the following tasks:
1) Type this sentence.
2) Specify the movements required to type that sentence in a robotics simulator.
If it can learn from 1, _you_ save a lot of effort, and the required expertise is lower.
You can generate lots and lots of simulated data. DL needs huge amounts of training data, and CG is basically a solved problem.
Actually, for me the most interesting thing is that training on synthetic images transfers so well to real camera frames. That means that I might actually be able to do some cutting edge stuff without a huge curated dataset.
vs simulation, it lets you get trajectories from the way people actually move without having to make insanely complicated models. vs. precollected training data, maybe the recording apparatus is less expensive/complex?
Imagine if you went back 50 years and told Terry Winograd or Marvin Minsky that in 50 years we'd still be trying to figure out how to get robots to stack blocks on top of each other. They'd think you were nuts.
It's called Moravec's paradox. Basically the AI things that we thought would be easy ("computer vision, pfft, give it a few months") turned out to be really really hard. And the AI/computer things that we thought would be hard were actually not that hard (Some scenes of Star Trek: The Next Generation have data pausing for a moment to return the result of a query. Google has faster response times today)
Maybe the pause is for adding human-like "flavor" and not really necessary?
To take a Steve Jobs'ism, once tech advances sufficiently you have to introduce qualities that may have nothing to do with tech, to make things more pleasing to us humans.
I'm asking why it's so impressive that it learns by watching a person. Specifically, why we think Terry Winograd would be impressed that 50 years of AI research led to this.
Didn't Deep Blue do the same thing as part of its chess training in the 1990s? Social/observational/modeling learning has been studied in psychological research for over 50 years: https://en.wikipedia.org/wiki/Social_learning_theory Isn't basically every recommender system, all of the vision systems, and the machine translation we all use... isn't all of that just observational learning?
I'm not saying it's not interesting. It's a cool toy. But Elon Musk gets an idea for a neat machine learning toy, using an idea that predates AI as a field, and Terry Winograd is supposed to be impressed by this?
The point is, AI has only been narrowly successful. Narrow success is rad. I'm not hating. But none of the promises of broad intelligence have really progressed. Siri is not a particularly meaningful advance beyond SHRDLU. Instead of just stacking blocks, she stacks text on Messages.app and pushes buttons in Weather.app.
What's interesting about this particular paper, is that it can learn from a single demonstration. Instead of taking thousands of examples before it can do anything. Even cooler, it learns how to learn. They didn't hand code how to do this, it figured it out through training. On top of that it uses machine vision to learn to see all on it's own, without having to be hand coded with information about where things are.
More generally, it uses a deep neural network. Which is very different than the older approaches you mention, and can learn much more sophisticated functions. And has enabled a lot of results that would have been unimaginable previously.
As for the early AI researchers? They were insanely overly optimistic about how difficult AI would be. It didn't seem like a hard problem. They thought they could just assign a bunch of graduate students to solve machine vision over the summer. It seems pretty simple. We can see great without even thinking about it, how hard can it be?
But after biting their teeth into it a bit, I'm sure they would appreciate our modern methods. They might not be as elegant as they hoped. They wanted to find "top down" solutions to AI. Simple algorithms based on symbolic manipulation and logic and reasoning. Such an algorithm probably doesn't exist, and an enormous amount of the history of AI was wasted searching for it.
And even if they did discover our modern methods much earlier, they wouldn't have been able to use them. It's only recently that computers have gotten fast enough to do anything interesting. It's like they were trying to go to the moon with nothing but hand tools. Sometimes you just need to wait for the tech tree to unlock the prerequisite branches first.
Pretty interesting. I suppose the point here though is that the machine learned the movement through general observation, opposed to someone writing in a very specific complex algorithm by hand.
I think you would be surprised by how many things they imagined and even prototyped in some form. The Society of Mind videos have some very interesting historic tidbits (and more):
I don't think so, I'm sure they imagined the internet in some abstract way, and smartphones really are quite a step backward if you want to quote that. Search is also quite imaginable.
We have had hundreds of thousands of software engineers working in advertising companies or making hardware that gets used to sell products and keep the attention of the youth perpetually captured, hindering their growth as humans. I say we've regressed in a big way and they would think the same.
Check out Alan Kay's talks from this week if you still feel good about today's software industry.
(I'm trying to watch all of his recent talks. Despite [or because of?] being critical of the industry, they are some of the most inspiring videos I've seen in years.)
That doesn't change that I think there is a bunch of things they didn't imagined and I think they would both agree. (Minsky unfortunately can't anymore)
And sure they can probably see things I can't it's not a competition.
To be honest I think automated sandwich/burger making has been possible for a long time already. But fast food level labour is just so cheap that it never caught on.