Hacker News new | past | comments | ask | show | jobs | submit login
Reinforcement learning is supervised learning on optimized data (bair.berkeley.edu)
241 points by jonbaer on Oct 14, 2020 | hide | past | favorite | 18 comments



Although the "data problem" is already well known by all ML and RL engineers/researchers (with varying degree depending on what one works on), it's an innovative and concise post that really boils it down. It's particularly important for modeling and when you have to generate all the data on your own.


This seems like a HUGE insight! As I understand it, they show that RL can effectively be recast as two sub-problems:

1. learning a policy that imitates your own behavior on prior experience, which is a trivial supervised learning problem

2. learning how to weight the importance of prior experiences (learning a data distribution), for which the authors have derived a lower bound

Given a pool of experience, this seems like a fantastic off-policy method to optimize arbitrary reward functions. The main shortcomings I see with this method is that it still does not lead to any significant insights into how to collect new data online, which is a major open problem in RL.


> I'm also wondering why the authors didn't publish any experiments to show that it works...

This is a blog post. It cites three of the authors' papers that each contain empirical results. The abstract of the first ends:

"We formally show that this iterated supervised learning procedure optimizes a bound on the RL objective, derive performance bounds of the learned policy, and empirically demonstrate improved goal-reaching performance and robustness over current RL algorithms in several benchmark tasks."


> empirically demonstrate improved goal-reaching performance and robustness over current RL algorithms

It's interesting that their choice of current algorithms includes PPO but not e.g. Deepmind's Rainbow agent that achieved state of the art performance on many measures: https://arxiv.org/abs/1710.02298


They mention Rainbow in the related work section of the third paper listed there, Kumar, A., Peng, X. B., & Levine, S. (2019). Reward-Conditioned Policies. arXiv:1912.13465 as part of this remark: "they are also known to be notoriously challenging to use effectively, due to sensitivity to hyper parameters, high sample complexity, and a range of important and delicate implementation choices that have a large effect on performance [5, 6, 12, 15, 23, 24, 46]."


Actually supervised learning is "learning missing data dimension" by parameter tuning via associative learning rules.

Gradient descents are a special case of associative learning rules assuming all data points the same importance.

A type of associative learning rules is Hebbian learning rule.

In the very fundamental we only need associative learning. Of course, for practical application we need diverse tools with different conceptual frameworks to choose for commercial: human resource or cost performance.


> Gradient descents are a special case of associative learning rules assuming all data points the same importance.

No. Gradient descent is an optimization method that has nothing to do with learning. It is used to optimize parameterized functions that are said to be "learning", but it's not the only approach. It is also trivially easy, and not uncommon, to use different weights for different data points.

Can you clarify what you mean with associative learning?


I'd like to see a resurgance in pattern mining and association rule mining. That stuff is do awesome and useful!


I think stochastic programming community had this approach for a very long time. Do take a look at Princeton's Warren Powell works


I had a hard time understanding how Actor-Critic methods relate to this insight. Anyone care to explain?


It seems to me that they are basically describing a variational formulation of the "optimization perspective" of reinforcement learning, which is cool, but I am confused... where is the supervised learning? Like what is the input and what is the output?


The way I understand it, the two subproblems are supervised in the sense that they are trained using data sampled from a fixed distribution, instead of data sampled from a distribution that changes as you update your model, as it is usually the case in RL. This makes the training more stable.


Thanks for clarifying that point.


It seems more as if the authors are abusing terms from Machine Learning like "Supervised Learning".


abusing how?


Nice definition of “school”, which is the most efficient method to teach/learn we have.


Efficient for those doing the teaching, but not for the one being taught.


throughput != latency




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: