Reinforcement learning is supervised learning on optimized data

officehero · on Oct 14, 2020

Although the "data problem" is already well known by all ML and RL engineers/researchers (with varying degree depending on what one works on), it's an innovative and concise post that really boils it down. It's particularly important for modeling and when you have to generate all the data on your own.

raptortech · on Oct 14, 2020

This seems like a HUGE insight! As I understand it, they show that RL can effectively be recast as two sub-problems:

1. learning a policy that imitates your own behavior on prior experience, which is a trivial supervised learning problem

2. learning how to weight the importance of prior experiences (learning a data distribution), for which the authors have derived a lower bound

Given a pool of experience, this seems like a fantastic off-policy method to optimize arbitrary reward functions. The main shortcomings I see with this method is that it still does not lead to any significant insights into how to collect new data online, which is a major open problem in RL.

robotresearcher · on Oct 14, 2020

> I'm also wondering why the authors didn't publish any experiments to show that it works...

This is a blog post. It cites three of the authors' papers that each contain empirical results. The abstract of the first ends:

"We formally show that this iterated supervised learning procedure optimizes a bound on the RL objective, derive performance bounds of the learned policy, and empirically demonstrate improved goal-reaching performance and robustness over current RL algorithms in several benchmark tasks."

klipt · on Oct 14, 2020

> empirically demonstrate improved goal-reaching performance and robustness over current RL algorithms

It's interesting that their choice of current algorithms includes PPO but not e.g. Deepmind's Rainbow agent that achieved state of the art performance on many measures: https://arxiv.org/abs/1710.02298

tastroder · on Oct 14, 2020

They mention Rainbow in the related work section of the third paper listed there, Kumar, A., Peng, X. B., & Levine, S. (2019). Reward-Conditioned Policies. arXiv:1912.13465 as part of this remark: "they are also known to be notoriously challenging to use effectively, due to sensitivity to hyper parameters, high sample complexity, and a range of important and delicate implementation choices that have a large effect on performance [5, 6, 12, 15, 23, 24, 46]."

shakascchen · on Oct 14, 2020

Actually supervised learning is "learning missing data dimension" by parameter tuning via associative learning rules.

Gradient descents are a special case of associative learning rules assuming all data points the same importance.

A type of associative learning rules is Hebbian learning rule.

In the very fundamental we only need associative learning. Of course, for practical application we need diverse tools with different conceptual frameworks to choose for commercial: human resource or cost performance.

blackbear_ · on Oct 14, 2020

> Gradient descents are a special case of associative learning rules assuming all data points the same importance.

No. Gradient descent is an optimization method that has nothing to do with learning. It is used to optimize parameterized functions that are said to be "learning", but it's not the only approach. It is also trivially easy, and not uncommon, to use different weights for different data points.

Can you clarify what you mean with associative learning?

Der_Einzige · on Oct 14, 2020

I'd like to see a resurgance in pattern mining and association rule mining. That stuff is do awesome and useful!

kk58 · on Oct 14, 2020

I think stochastic programming community had this approach for a very long time. Do take a look at Princeton's Warren Powell works

jeremysalwen · on Oct 14, 2020

I had a hard time understanding how Actor-Critic methods relate to this insight. Anyone care to explain?

gajomi · on Oct 14, 2020

It seems to me that they are basically describing a variational formulation of the "optimization perspective" of reinforcement learning, which is cool, but I am confused... where is the supervised learning? Like what is the input and what is the output?

bnegreve · on Oct 14, 2020

The way I understand it, the two subproblems are supervised in the sense that they are trained using data sampled from a fixed distribution, instead of data sampled from a distribution that changes as you update your model, as it is usually the case in RL. This makes the training more stable.

jonnycomputer · on Oct 14, 2020

Thanks for clarifying that point.

Cmmn_Dscndnt · on Oct 14, 2020

It seems more as if the authors are abusing terms from Machine Learning like "Supervised Learning".

jonnycomputer · on Oct 14, 2020

abusing how?

emersonrsantos · on Oct 14, 2020

Nice definition of “school”, which is the most efficient method to teach/learn we have.

moron4hire · on Oct 14, 2020

Efficient for those doing the teaching, but not for the one being taught.

jknoepfler · on Oct 14, 2020

throughput != latency