Curiosity Killed the Mario

AgentME · on May 23, 2019

I find the video super interesting to watch. It's not like previous Mario bots I've seen that were programmed with path-finding and a go-right goal which race through the level perfectly. Instead, this makes me think of a very young kid playing the game who isn't entirely sure what they should do, and is more interested in figuring out what the controls do and usually goes right because that's where more stuff is, not because they know they're "supposed" to do that.

The idea of curiosity (seeking out states that lead to unpredictable stuff) being a good scoring function for an AI seems really compelling to me. I'm really curious about what other places this idea can be applied to.

gipp · on May 23, 2019

At first glance it reminds me of an old paper I read on causal entropic forces, [1] which is kind of a thermodynamic approach to understanding the emergence of complex behaviors. I've always had that in the back of my mind as an idea I'd never have the time or resources to investigate further.

[1] https://www.google.com/amp/s/phys.org/news/2013-04-emergence...

learnstats2 · on May 23, 2019

I really enjoyed Level 2-2 when it repeatedly tries to swim up onto the land tiles - it feels like you should be able to do that and there might be a big reward for succeeding, so it's very natural to keep trying for some time.

auiya · on May 23, 2019

I feel a game from the Zelda series would be more suited to this style of game play than Mario.

jolfdb · on May 23, 2019

See also playfun/learn fun, a game-agnostic AI that uses raw uninterpreted memory bytes as it's signal: http://tom7.org/mario/

chriswarbo · on May 23, 2019

I remember Schmidhuber showing off "artificial curiosity" stuff a while back (e.g. http://people.idsia.ch/~juergen/interest.html ). In particular, ideas like "compression progress" have been influential on my own research about how to measure what's "interesting", and I've implemented a rudimentary version of PowerPlay (and a slight alternative) at http://chriswarbo.net/projects/powerplay

(He's since applied these ideas to art, humour, etc. which I think is a nice thought, but not worth taking particularly seriously)

AstralStorm · on May 23, 2019

Pity it actually fails at the truly complex levels. Gets stuck in decision minima due to lack of memory?

Or is the reward too sparse?

Cannot find the symbolic notion of progress and dies of boredom?

teeki · on May 23, 2019

For longer levels, I think training on later parts of the level tend to be change the policy to not do as well in the earlier parts. I suspect it would do fine on the linear levels if the number of agents and batch size was increased.

marcosdumay · on May 23, 2019

> Cannot find the symbolic notion of progress and dies of boredom?

So, it gets burnout.

sametmax · on May 23, 2019

It makes it more human :)

fartcannon · on May 23, 2019

The author mentions self-driving cars and, indeed, it works well for Mario Kart: https://www.youtube.com/playlist?list=PLTWFMbPFsvz122oi3aEWZ...

debatem1 · on May 23, 2019

This is an interesting idea for concolic execution. I assume it's been done before, but a quick search doesn't turn it up. Maybe it's just assumed?

DanielleMolloy · on May 24, 2019

There have been labs at my uni in non-deep RL or evolutionary robotics researching curiosity rewards a few years back, but I've not seen something like this until this year.

Is it just scale and more powerful NNs that is making curiosity work well now? The Montezuma and this Mario video look like a serious breakthrough to me (non-expert).

cwyers · on May 23, 2019

The video is set to play back at a funky speed, playing at 1.5x speed is much more natural.

sdan · on May 23, 2019

Is there a paper published (or preprinted) on this?

flooo · on May 23, 2019

There is a link to the original work by OpenAI: https://openai.com/blog/reinforcement-learning-with-predicti...

negamax · on May 23, 2019

How is this different from brute forcing and then recording a success path? Not sure this qualifies as AI

teeki · on May 23, 2019

A naive brute force algorithm would take a lot longer to complete some of these levels.

Let's say it takes 45 seconds to complete a level. That's 45 * (60 / 12) = 225 moves. The size of the action space is 14, so you'd be looking at 14 ^ 255 (give or take a few orders of magnitude) trajectories before finding the solution.

To brute force in a reasonable a time, you would have to look at the environment and have the algorithm iterate through all trajectories in a clever way. For instance, you may choose to only try trajectories that are constantly moving right. This strategy may find a solution to 1-1 fairly quick, but this does not generalize to other levels, especially ones that requires backtracking or waiting.

You'd have to design a pretty gnarly algorithm for it to beat 1-1, 1-4 and 2-2. This gets even more complicated if you bring in other environments: the original paper also trained on Montezuma's Revenge, Private Eye, Venture, Freeway and Gravitar

garaetjjte · on May 23, 2019

This approach is similiar: http://tom7.org/mario/

negamax · on May 23, 2019

They have 350 hrs of playing per level. They could brute force on these parameters.

1. Detect to_avoid moving obstacles

2. Detect to_avoid stationary obstacles (gaps)

3. Player moves (jump (left, right)(up, far, farthest), walk)

4. Success (hitting the flag, reaching princess)

Once above information is available on the screen, it becomes easily bruteforcible. I was hoping use of genetic algorithms i.e. they could have taken success from one part (path) of level and crossed it with another and tried that with across levels. But there won't be a generic strategy or learning anyways as there's fair bit of randomness. So this does seem like brute forcing for success path

null000 · on May 23, 2019

The whole point is that the algorithm doesn't know about obstacles or success as a concept baked into the algorithm. Likewise, this is pretty initial research, meant to inform and promote

In other words, this isn't meant to be super useful by itself. It seems tailor made (as many of these things do) to play super-simple 80's video games and literally nothing else, but it's an interesting proof of concept. I'd also be interested in different iterations on this general pattern - for instance, something that didn't translate directly from screen + button -> prediction, and instead had some interstitial systems - translating from screen -> entities, then predicting entity state of entities given button presses. It'd also be interesting to see how this performs with ML algorithms designed to learn on the fly instead of through training from a static set of data (at least, this looked like it learned through back propagation - I skimmed).

But I can see broader practical applications for this in, for instance, recommender systems trying to break users out of the closed feedback loop that people tend to end up in when going down certain rabbit holes (e.g. watch one Flat Earther conspiracy video and suddenly that's all you see for a week because the recommender system knows that people who look at one will look at more). The point being: the real test comes when this strategy is exposed to more diverse problem spaces, it's just that those are harder to model and we need to weed out the pointless stuff first.

bicepjai · on May 23, 2019

In reinforcement learning this can be attributed as exploration be exploitation

otakucode · on May 23, 2019

A couple years ago there was a paper reported somewhere (it may have been here) that dealt with unsupervised learning using entropy as the only fitness function. Regardless of the task or any other factors, the researchers used maximizing entropy as the only goal. And this immediately led to the development of complex, interesting, and desirable behavior. When used for a system balancing a pole, it would learn to balance the pole upright. When given a ball where a hoop was present, it would automatically navigate the ball through the hoop. I tried to reach out to the author of the paper to get a full copy of the paper (could only find a paywalled abstract online) but never got a response. It seemed like a very interesting approach, and this sounds like doing basically the same thing. Favor moving to any state which increases the maximum likely future states. Increase entropy, 'intelligence' emerges.