I find the video super interesting to watch. It's not like previous Mario bots I've seen that were programmed with path-finding and a go-right goal which race through the level perfectly. Instead, this makes me think of a very young kid playing the game who isn't entirely sure what they should do, and is more interested in figuring out what the controls do and usually goes right because that's where more stuff is, not because they know they're "supposed" to do that.
The idea of curiosity (seeking out states that lead to unpredictable stuff) being a good scoring function for an AI seems really compelling to me. I'm really curious about what other places this idea can be applied to.
At first glance it reminds me of an old paper I read on causal entropic forces, [1] which is kind of a thermodynamic approach to understanding the emergence of complex behaviors. I've always had that in the back of my mind as an idea I'd never have the time or resources to investigate further.
I really enjoyed Level 2-2 when it repeatedly tries to swim up onto the land tiles - it feels like you should be able to do that and there might be a big reward for succeeding, so it's very natural to keep trying for some time.
I remember Schmidhuber showing off "artificial curiosity" stuff a while back (e.g. http://people.idsia.ch/~juergen/interest.html ). In particular, ideas like "compression progress" have been influential on my own research about how to measure what's "interesting", and I've implemented a rudimentary version of PowerPlay (and a slight alternative) at http://chriswarbo.net/projects/powerplay
(He's since applied these ideas to art, humour, etc. which I think is a nice thought, but not worth taking particularly seriously)
For longer levels, I think training on later parts of the level tend to be change the policy to not do as well in the earlier parts. I suspect it would do fine on the linear levels if the number of agents and batch size was increased.
There have been labs at my uni in non-deep RL or evolutionary robotics researching curiosity rewards a few years back, but I've not seen something like this until this year.
Is it just scale and more powerful NNs that is making curiosity work well now? The Montezuma and this Mario video look like a serious breakthrough to me (non-expert).
A naive brute force algorithm would take a lot longer to complete some of these levels.
Let's say it takes 45 seconds to complete a level. That's 45 * (60 / 12) = 225 moves. The size of the action space is 14, so you'd be looking at 14 ^ 255 (give or take a few orders of magnitude) trajectories before finding the solution.
To brute force in a reasonable a time, you would have to look at the environment and have the algorithm iterate through all trajectories in a clever way. For instance, you may choose to only try trajectories that are constantly moving right. This strategy may find a solution to 1-1 fairly quick, but this does not generalize to other levels, especially ones that requires backtracking or waiting.
You'd have to design a pretty gnarly algorithm for it to beat 1-1, 1-4 and 2-2. This gets even more complicated if you bring in other environments: the original paper also trained on Montezuma's Revenge, Private Eye, Venture, Freeway and Gravitar
They have 350 hrs of playing per level. They could brute force on these parameters.
1. Detect to_avoid moving obstacles
2. Detect to_avoid stationary obstacles (gaps)
3. Player moves (jump (left, right)(up, far, farthest), walk)
4. Success (hitting the flag, reaching princess)
Once above information is available on the screen, it becomes easily bruteforcible. I was hoping use of genetic algorithms i.e. they could have taken success from one part (path) of level and crossed it with another and tried that with across levels. But there won't be a generic strategy or learning anyways as there's fair bit of randomness. So this does seem like brute forcing for success path
The whole point is that the algorithm doesn't know about obstacles or success as a concept baked into the algorithm. Likewise, this is pretty initial research, meant to inform and promote
In other words, this isn't meant to be super useful by itself. It seems tailor made (as many of these things do) to play super-simple 80's video games and literally nothing else, but it's an interesting proof of concept. I'd also be interested in different iterations on this general pattern - for instance, something that didn't translate directly from screen + button -> prediction, and instead had some interstitial systems - translating from screen -> entities, then predicting entity state of entities given button presses. It'd also be interesting to see how this performs with ML algorithms designed to learn on the fly instead of through training from a static set of data (at least, this looked like it learned through back propagation - I skimmed).
But I can see broader practical applications for this in, for instance, recommender systems trying to break users out of the closed feedback loop that people tend to end up in when going down certain rabbit holes (e.g. watch one Flat Earther conspiracy video and suddenly that's all you see for a week because the recommender system knows that people who look at one will look at more). The point being: the real test comes when this strategy is exposed to more diverse problem spaces, it's just that those are harder to model and we need to weed out the pointless stuff first.
A couple years ago there was a paper reported somewhere (it may have been here) that dealt with unsupervised learning using entropy as the only fitness function. Regardless of the task or any other factors, the researchers used maximizing entropy as the only goal. And this immediately led to the development of complex, interesting, and desirable behavior. When used for a system balancing a pole, it would learn to balance the pole upright. When given a ball where a hoop was present, it would automatically navigate the ball through the hoop. I tried to reach out to the author of the paper to get a full copy of the paper (could only find a paywalled abstract online) but never got a response. It seemed like a very interesting approach, and this sounds like doing basically the same thing. Favor moving to any state which increases the maximum likely future states. Increase entropy, 'intelligence' emerges.
The idea of curiosity (seeking out states that lead to unpredictable stuff) being a good scoring function for an AI seems really compelling to me. I'm really curious about what other places this idea can be applied to.