Deep Q-Learning: Space Invaders

mratzloff · on March 14, 2016

I wasn't able to play the video for some reason, but I located it on YouTube:

https://m.youtube.com/watch?v=ZisFfiEdQ_E

For comparison, DeepMind:

https://m.youtube.com/watch?v=ePv0Fs9cGgU

Note that 550 is a very low score in Space Invaders.

nrmn · on March 14, 2016

Not OP but I believe the low score is due to not enough training time and incorrect parameters such as frameskip. Space Invaders was mentioned as one of the few games they needed to lower the frameskip (from 4 to 3 or 2?) on because of the flashing lasers. I'm assuming OP left the parameters as is from the implementation by Nathan Sprague, which has frameskip at 4, and trained for a few epochs.

alexlikeits1999 · on March 15, 2016

Changing the frameskip is not needed any more since the implementation does max of last two frames before processing (same as what DeepMind does)

EDIT: talking about Sprague's implementation btw, not necessarily OPs.

mjaskowski · on March 15, 2016

Actually I did both frame skipping and amxing out two frames as reported in the Natures letter

https://storage.googleapis.com/deepmind-data/assets/papers/D...

mjaskowski · on March 14, 2016

I noticed that the video does not work in Safari.

Interesting. I am not sure how this deepmind game was played. Note that a typical game with epsilon = 0.1 achieves results around 550 points. In the nature paper they were running the game with epsilon = .05

If you spot any other differences I'd be happy to learn about them.

Zitrax · on March 14, 2016

Isn't the score based only on the first level ?

2bitencryption · on March 14, 2016

I learned of q-learning in the berkeley AI course (the pacman course). So I sort of get that. but the course didn't touch neural networks.

what's the difference between q-learning with and q-learning without neural networks? Or, rather, in the process of doing q-learning, where does the neural network slot in, what does it replace if there is no nn?

mjaskowski · on March 14, 2016

Note that Neural Network is just a very complex function.

You usually think of Q as a function (S, A) -> (Expected accumulated future reward)

which is equivalent to S -> A -> (Expected accumulated future reward)

the Neural Network is S -> (A -> (Expected accumulated future reward)) or if you whish the output layer of neural network consists of |A| neurons. Each indicates the (Expected accumulated future reward) given current experience.

2bitencryption · on March 14, 2016

Thanks!

So what we are saying is that a neural network can be used as the implementation for the q-function? I.e., a q-function is by definition only a mapping of (S,A) pairs to an expected future reward. We can do this using a traditional style like value iteration or back propagation, or we can use a neural network? And it's just a matter of implementation?

mjaskowski · on March 14, 2016

Yes, we try to approximate Q function with neural network. Which is basically an enhanced version of gradient-descent Sarsa.

The main trick to notice is that you can't provide consecutive frames as mini-batches as these would be highly correlated and would derail stochastic gradient descent.

So we keep many frames (and all other necessary information) in memory and draw these experiences uniformly to form a minibatch that becomes input to the neural network

ska · on March 14, 2016

Stronger than that - you can think of neural networks as universal function approximators. So this is just a particular function to approximate.

See the suggestively named "Universal approximation theorem" for details.

phatbyte · on March 14, 2016

This is one of my favourite topics recently. ML and AI is something amazing. A very nice skill to have. Good article also.

karmapolice · on March 15, 2016

As an amateur, I've always wondered if reinforcement learning could work with games where there are some probabilities in place (e.g. poker). What happens when the action taken is a good one but the outcome is negative due to bad luck?

mjaskowski · on March 15, 2016

Absolutely. Q-learning has this capabilities and a shallow neural network was used back in 1992 to play backgammon, which has a lot of stochasticity. See https://en.wikipedia.org/wiki/TD-Gammon

serge2k · on March 15, 2016

I would like to learn more about the techniques used here. Can anyone recommend some books, or online materials but I generally find those worse. I have a moderately strong math background (undergraduate degree with double major in CS/Math).

fizixer · on March 15, 2016

someone posted this comment in another thread [1]. (Also read the two parent comments of that comment).

Essentially read and do exercises of ISLR (Intro to Statistical Learning, with applications in R). Will both give you a strong base, and increase your job prospects (according to the comment).

[1] https://news.ycombinator.com/item?id=11286980

P.S: my personal opinion. For every R language exercise in that book, try to do a similar exercise in Python. If you don't know Python, learn it. (you'll thank me later).

mathgenius · on March 14, 2016

"I omit certain details for the sake of simplicity and I encourage you to read the original paper."

This link to the original paper appears to point to theano unit test documentation. Does anyone know what is the "original paper" to look at?

mjaskowski · on March 14, 2016

Let me fix that. There are actually two papers: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

and a more recent and more detailed: https://storage.googleapis.com/deepmind-data/assets/papers/D...

kotach · on March 14, 2016

Now, train the network jointly over the game sequence. Or even better, when given a chance to take action rollout on each action and learn jointly on that rest of gameplay.

Reinforcement learning is very hard. Especially when you create meaningful games and then don't use the fact that a whole game is a one long chain of events, and instead force learning on windowed sequence.

Neural network has enough parameters to remember much of these windows and will clearly perform well, but the training last too long given the fact that no structured information is used.