PlaNet: A Deep Planning Network for Reinforcement Learning

mark_l_watson · on Feb 16, 2019

Nice, another large contribution to the field from Google just a day after Open AI’s paper on better language models and their implications. This is in addition to other nice recent public contributions from Uber, Facebook, Microsoft, etc.

I think I understand these huge tech company’s “generosity”: these public contributions to the field probably help in recruiting efforts like salary and fringe benefits do. The field is moving so fast and growing so fast it is difficult to hire talent right now (I manage a machine learning team at a very large company, and at least this is my experience).

This paper is claiming a 5000 times increase in performance over previous state of the art techniques. Huge.

GistNoesis · on Feb 16, 2019

It's not as "huge", as they make it look. The goal of the technique is to increase data efficiency (number of real world tries it needs to learn). So instead of using real trajectories it simulates trajectories (that's planning), and learn from these.

These lines of ideas is not new. The main problems associated with it are that it is almost always more computationally expensive (you learn from real and dreamed trajectories) and it is harder to learn as it is susceptible to a kind of exposure bias : Once you have built a model like "the earth is flat", then you will simulate/dream trajectory according to it, diluting the weak evidence you can get from real data telling you that the "earth is round", and so it gets stuck with a wrong model.

The performance gain you refer to is a gain relative to a naive way of doing things i.e. working in pixel space.

Don't get me wrong, I'm a big fan of the model based approach, and every small step in this direction is good as it helps with explain-ability. This paper is one of these nice small steps, but doesn't compare to the gain of previous techniques like experience-replay, or hindsight-experience-replay.

danijar · on Feb 19, 2019

Author here. First of all, I'd like to clarify that the data efficiency gain over D4PG is 5000% or 50x.

Regarding computational efficiency, we match D4PG, a top model-free agent that uses experience replay among other techniques (actor critic, distributional loss, n-step returns, prioritized replay, distributed experience collection).

Your point about exposure bias is interesting, and applies equally to agents that do not learn a model. Personally, I think we need reliable uncertainty estimates in neural networks to make progress on this research question, so the agent can know what it doesn't know.

Hindsight experience replay doesn't apply to tasks where the inputs are images because it requires knowledge of a meaningful goal space with a distance function (e.g. 2D coordinates of goal positions).