Pardon my naive question: Is there any point to RL apart from automatically gene...

karpathy · on Jan 30, 2017

Not sure if I understand "automatically generating labels to a SL network". I don't believe RL is used in this setting.

RL is about learning expected-reward-maximizing policies for environments that you get to interact with. Common benchmarks currently mostly include games (e.g. ATARI, AlphaGo, VizDoom), physics-based animation (e.g. https://www.cs.ubc.ca/~van/papers/2016-TOG-deepRL/index.html), or (simulated) robotics-like tasks (e.g. MuJoCo). But the core algorithms (such as policy gradients) can be used more generally in settings that don't necessarily look like environments as usual, but where you want to train a network with stochastic nodes, such as in hard attention, etc.

RL is a funny area; A lot of AI researchers get excited about it (mostly motivated by its promise as the formalism that leads to AGI), and yet despite the hype it has so far had very little impact in the industry so far (the Google data center application possibly being an exception, though it was more "RL" than RL, with quotes). It has some promise for Robotics in the real world, but not applied directly and naively. The way that will play out is likely through behavior cloning on human demonstrations or on outputs of trajectory optimizers from simulation, or possibly RL fine-tuning in simulation transferred to real world. But it's still quite early to tell.

On this topic, fun story, the most impressive robots I'm aware of right now are from Boston Dynamics and as they mentioned at this year's NIPS they use ZERO machine learning. Forget deep learning or even deep reinforcement learning. Zero Machine Learning.

I gave a talk last week about some of our RL experiments @ OpenAI and someone came to me after the talk, described their (straight forward) supervised learning problem and asked me how they can apply RL to it. This, to me, is an alarming sign of damaging hype to the community. You don't use RL for your SL problems. You can if you really want to (e.g. reward = 1.0 if you guess the correct label or -1.0 otherwise), but you really don't want to. You're lucky, use your labels, business as usual.

option_greek · on Jan 30, 2017

Thank you for the explanation. It somehow seems to me that most of the RL problems can be converted to some form of SL. For example can the pong RL solution using PG (thank you so much for that article btw) not be converted to SL by recording a human player actions for a while and labeling them based on rewards achieved ?

gugagore · on Jan 30, 2017

Learning a human's actions in most cases is probably a supervised learning problem. But in that case, you actually don't even want to look at the rewards. You just want to know what a human did given a specific scenario.

However, any time you have a reward signal (like the score of the game) in a multi-step decision problem, like a game where you take actions sequentially (e.g. once per turn), you need RL machinery to make sense of the data. Maybe you take an action now, and you might only reap the reward of that action in the future. So how do you "label" the action right now? You label it with some measure that takes into account the future of the reward signal.

So some human plays a game and gets a super high score. You only see that they got a high score at the end of the game. How do you go back and label the 150 actions that led you to the score? That's the part that is RL.

option_greek · on Jan 31, 2017

Thank you. It makes more sense now.

spangry · on Jan 30, 2017

Honestly I'm still struggling to grasp with RL is. Can it roughly be conceptualised as: a neural network wrapped in a genetic algorithm that re-writes the structure/parameters of the network at the end of each 'pass' (according to some target you're optimising for to determine model 'fitness')?

If this is roughly what it is, it sounds like a logical place to apply some kind of automated meta-programming. I recently stumbled across a Python ML framework that I think used Jinja2 templates for this purpose...

mattkrause · on Jan 31, 2017

Focusing on the "neural network" part might be confusing you.

Classification/supervised learning is essentially about learning labels. We have some examples that have already been labeled: cats vs. dogs, suspicious transactions vs legitimate ones, As vs Bs vs...Zs. From those examples, we want to learn some way to assign new, unlabeled instances, to one of those classes.

Reinforcement learning, in contrast, is fundamentally about learning how to behave. Agents learn by interacting with their environment: some combinations of states and actions eventually lead to a reward (which the agent "likes") and others do not. The reward might even be disconnected from the most recent state or action and instead depend on decisions made earlier. The goal is to learn a "policy" that describes what should be done in each state and balances learning more about the environment ("exploration", which may pay off by letting us collect more rewards later) and using what we know about the environment to maximize our current reward intake (exploitation). Games are a particularly good test-bed for reinforcement learning because they have fairly clear states (I have these cards in my hand, or that many lives, etc), actions ("hit me!", "Jump up") and rewards (winnings, scores, levels completed). There's also an obvious parallel with animal behavior, which is where the name originated.

In both cases, neural networks are useful because they are universal function approximators. There's presumably some very complex function that maps data onto labels (e.g., pixels onto {"DOG", "CAT"}) for supervised learning, and states onto action sequences for reinforcement learning. However, we usually don't know what that is, and can't fit it directly, so we let neural networks learn it instead. However, you can do both supervised learning and reinforcement learning without them (in fact, until recently, nearly everyone did).

However, the network typically doesn't get "rewritten" on the fly. Instead, it does something like estimate the value of a state or state-action pair.

spangry · on Feb 1, 2017

Thanks for taking the time to explain this to me, it's definitely helped my understanding. It sounds somewhat 'static', assuming I haven't misinterpreted. Where does the 'learning' part come in? Again, this is almost certainly due to my lack of knowledge, but it sounds like RL essentially brute-forces the optimal inputs for each statically defined 'action function'. Meaning the usefulness of the model depends entirely on how well you've initially specified it, meaning the problem is really solved through straight-forward analysis.

(I've obviously gone wrong somewhere here... Just walking you through my thought process)

mattkrause · on Feb 1, 2017

You're welcome!

The agent "learns" by stumbling around and interacting with its environment. At the beginning, its behavior is pretty random, but as it learns more and more, it refines its "policy" to collect more rewards more quickly.

Brute force is certainly possible for some situations. For example, suppose you're playing Blackjack. You can calculate the expected return from 'hitting' (taking another card) and 'standing' (keeping what you've got), based on the cards in your hand and the card the dealer shows.

So...brute force works for simple tasks, but in a lot of situations, it's hard to enumerate all possible states (chess has something like 10^47 possible states) and state-action pairs. It's also difficult to "assign credit"--you rarely lose a chess game just because of the last move. These make it difficult to brute-force a solution or find one via analysis. However, the biggest "win" for using RL is that it's applicable to "black box" scenarios where we don't necessarily know everything about the task. The programmer just needs to give it feedback (though the reward signal) when it does something good or bad.

Furthermore, depending on how you configure the RL agent, it can react to changes in the environment, even without being explicitly reset. For example, imagine a robot vacuum that gets "rewarded" for collecting dirt. It's possible that cleaning the room changes how people use it and thus, changes the distribution of dirt. With the right discounting setup, the vacuum will adjust its behavior accordingly.

ced · on Jan 30, 2017

Do you know about expected utility? Optimal behaviour (of any kind) can be framed as "At each step, pick the action that maximizes your expected utility." So, for instance, you might study hard tonight because it'll lead you to pass your exam tomorrow and get a high-paying job later. In that scenario, studying's utility is higher than going out for a beer.

Reinforcement learning's goal is either to estimate each action's expected utility (possibly using neuron networks), or to directly learn what the best action to take is in any given situation, without bothering with utility estimation.

posterboy · on Jan 30, 2017

robotic movement is maybe not exactly an examplary domain, because biologically a good share is controlled by the parasympathic nervous system, ie reflexes. Maybe robotics needs to get that right first with a strong focus on the mechatronics before the software becomes relevant, just as bipedal movement requires from a baby to build up the muscles.

On the other hand, maybe evolution of a nervous system is analogous to some form of neural learning, but on a huge timescale, and maybe the scale is proportional to the searchspace complexity.

edit: Well, reflexes are also propagated by neurons, but on a short circuit and thus readily modeld by PDI etc. Noticing the similar (?) reliance on differentials and coefficients, the advantage is that a subset of PDI has perfect solutions.

pilooch · on Jan 30, 2017

Any link to that talk, or was it internal to openai ? Thanks :)