Hacker News new | past | comments | ask | show | jobs | submit login
Deep reinforcement learning doesn't work yet (alexirpan.com)
246 points by deepGem on Feb 15, 2018 | hide | past | favorite | 50 comments



"A friend is training a simulated robot arm to reach towards a point above a table. It turns out the point was defined with respect to the table, and the table wasn’t anchored to anything. The policy learned to slam the table really hard, making the table fall over, which moved the target point too."

Seems as though the problem of learning unintended techniques sometimes may be better described as the model being too creative! Hitting the table is a really clever solution for the problem it was given. These examples show that the real challenge for researchers is constraining the models enormous capacity for creativity without stifling its ability to learn.


> constraining the models enormous capacity for creativity without stifling its ability to learn

If I recall my AI history correctly all the 60s/70s research focused around logic based systems and inference producing things like Prolog. It was thought that one just simply needed to come up with an appropriate rule set to generate powerful AI. The issue of course is writing enough rules to give powerful AI outside some very constrained problem domains (see https://en.wikipedia.org/wiki/SHRDLU) just isn't feasible.

The problem of current machine learning models being too clever for their own good and needing appropriate constraints feels similar. If only you had the correct constraints you could do all kinds of things. History repeating itself?


Can't wait to see how "creative" our future autonomous drone strikes will be.



Like if we program our AI overlords to minimise unavoidable deaths, and it learns the best way to do in the long term is to sterilise everyone and wipe out the human race!


Well, sure, 7 billion quick deaths means less death and suffering than even a measly one gruesome random accident per year for 8 billion years in the future.


Reminds me of this article, about how a single reward could potentially destroy humanity. It's just a thought experiment though.

https://www.salon.com/2014/08/17/our_weird_robot_apocalypse_...


I felt the same, and also thought that the article didn't really do enough to explain why "write better reward functions" isn't enough, especially because in gaming, if anyone set the same goals these researchers did, humans would eventually learn the same patterns.


There was a good talk at NIPS this year about the difficulty of reproducing results and benchmarking in RL. In supervised learning you can easily compare results on standard datasets. In RL though, once the actions of two policies diverge, you're at the whim of the PRNG. Even with the same random seeds, gradient descent hyperparameters, etc., it is not possible to meaningfully compare two policies with a single training run. Even if you hold everything else constant, random seed had a huge effect. Ideally papers would show aggregated data over 100 runs for each setup, but RL is too computationally expensive for that.

It is frustrating, but also exciting because the field had so many open problems.

On other hand, it really sucks that those with the 1000 machine cluster have such a huge advantage over smaller labs.


>Whenever someone asks me if reinforcement learning can solve their problem, I tell them it can’t. I think this is right at least 70% of the time.

This seems to be a really strange calibration for "doesn't work". If you replace "reinforcement learning" with other well known technologies, and ask "of the instances where someone asks if X is a good solution, what percentage of the time is it actually a good solution?" I feel like 30% would be on the high end of the scale.

The rest of the article seems has a lot of interesting discussion about RL's limitations, but it seems weird to make the article's thesis that RL doesn't work, rather than just "RL still has a lot of limitations".


I think the point he makes is that RL doesn't work as the world believes it to. RL is seen as the holy grail for AGI. He is saying that isn't the case, at least as of yet since RL algorithms don't generalize. For all other tasks, where RL can be used, other existing alternatives fare much better.


That's a little unfair to RL. Reinforcement Learning is great even outside of autonomous control. [It's become quite important in NLP, for example.] It should be seen as a valuable set of approaches in a toolbox, not a silver bullet.


> [It's become quite important in NLP, for example.]

[citation needed]

Maybe it's the specific NLP tasks I've been paying attention to (goal oriented dialog), but most of the RL for NLP work I've seen has not been super impressive.


Williams, 1992.

http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92s...

This is even within dialogue generation. I’ve found plenty more recent citations, but it’s been a very important part of that subfield for decades.

I can’t speak for goal-oriented dialogue generation, however. Perhaps your area of work has benefited less or given it less attention.


There's plenty of work applying RL techniques, it's just not actually useful for building real systems.

It's basically in the same state as the rest of this blog post where it's so horribly sample inefficient you're usually better off with supervised learning if you're paying annotators. Or you're doing REINFORCE for some proxy metric or simulator which you're probably over-fitting and not actually improving your system with.

The Alexa prize is basically the only situation where you're getting anywhere near enough RL rewards to be meaningful, because everywhere else the feedback is rare enough to not be helpful due to the sample inefficiency.

Which is why I disagree with the characterization that it's important. There's a lot of it, and it can squeeze a little bit of performance on whatever dataset you're looking at, but it's had nowhere near the impact of, say, word vectors or (Bi)LSTMs.


Thanks for the feedback! I'm more familiar with NLP through my colleagues and coursework, so I appreciate the real world experience.


I find that any technology has limitations. We have the problem in industry that whenever there is a new buzzword there are "Thought leaders" in companies telling everybody to use it, because it's the new thing. Like when washing machines needed to have Fuzzy Logic in the 90s, just because it got kinda hyped. They didn't do anything better because of it, but marketing could print it on the front.

Now you have AI, Industry 4.0/IoT, Blockchain, Big Data and the Cloud. Most of these technologies are useless to most problems, while being useful in a select few. However this won't stop marketing and sales departments from selling it to clients because it sounds good.


Author here. My intent was to say, "When someone I know asks me if RL is a good idea, the answer is no >= 70% of the time." The important missing piece is that it's conditioned on that person asking me in particular. I don't have the largest profile, so the people who ask me about RL are generally well-informed about RL and have some practical ML experience too. Within that set, 70% felt like the right lower bound.


If you read the article, I think that is a reference to the later passage where he notes that even RL models that are considered "working" still fail 30% percent of the time because the models are so unstable that their success depends on the random seed. If that's the level required, then he is allowed to say with confidence that they don't work, because even if it sometimes would, that would be way under 30% of the cases.


You don’t understand that quote. It’s saying 70% of the time when people think reinforcement learning is the solution it’s not. Not saying that 30% of all problems are solvable by reinforcement learning. Sometimes I can’t tell if people here you included are intentionally misreading things to further their agenda.


> Sometimes I can’t tell if people here you included are intentionally misreading things to further their agenda.

More likely, sometimes people are just tired, distracted, or after consumption of alcohol/drugs.

Personally, every other week I find myself writing a comment - sometimes long - and then deleting it a minute later, after re-reading the comment I was replying to and realizing I completely misunderstood it, and/or was arguing against a strawman of my own creation. And every other month I have a comment wrt. which I realize my mistake only few hours later.


Did you misread my comment? I'm really confused how you could think that I was saying "of all problems", when I specifically said "of the instances where someone asks if X is a good solution".


I don't give a shit and I read it as 30% of the time it works, so I guess i'm just a dumbass? :(


It's a shame how data hungry DRL is (even when compared to DL), but the DRL framework encompasses all the standard classification/regression tasks and also includes decision making, planning and pretty much anything else you can think of.

Model generality and data efficiency are in an inverse relationship and a lot of research has been in moving up or down this hyperbola. On one extreme, tailoring models to specific use cases/datasets/environments, on the other end transferring learning across domains. DRL is stuck pretty high on that generality end. Some breakthroughs seemed to have moved progress to a higher level curve, optimizers (Adam, DQN, TRPO) have gotten better which helps everything in general, core structures like CNNs or memory cells, which seem to be somewhat universal (or our best guess yet), but there's still something fundamental that seems to be missing. Or maybe this is all there is and we just need a computer with a richer/higher-resolution sense field and the flops to process them.


>It's a shame how data hungry DRL is (even when compared to DL), but the DRL framework encompasses all the standard classification/regression tasks and also includes decision making, planning and pretty much anything else you can think of.

Sure, and a Turing machine can express any program you care to write. RL seems like the Turing tar pit of machine learning: theoretically able to express everything, practically convenient mostly just for trivial examples.

>Model generality and data efficiency are in an inverse relationship and a lot of research has been in moving up or down this hyperbola.

Sure, basic Curse of Dimensionality, but the whole success of hierarchical modeling in the brain, hierarchical Bayesian methods, and deep neural networks has been that hierarchical modeling seems to ameliorate or even defeat the Curse of Dimensionality. The question is: well, why can't it do that in reinforcement learning?

In interesting terms: why does the teaching signal have more information in supervised learning than reinforcement learning, relative to the inherent uncertainty of the task?


DRL is too fragile to be realistically useful while supervised learning is not.


but the DRL framework encompasses all the standard classification/regression tasks and also includes decision making

Except labelling data, correct ? which is non trivial effort.


> Except labelling data, correct ? which is non trivial effort.

Choosing choice of labeling is a RL problem too.

If you're choosing environment actions to learn how the environment 'labels' them, that's the classic topic of 'how do we make DRL models explore' well (and arguably is the Achilles heel of the model-free DRL approaches OP is criticizing: the NNs can easily learn to optimize their actions, even tiny NNs are more than enough, but they just don't get fed the 'right' data ie. exploration is bad). Relevant papers: https://www.reddit.com/r/reinforcementlearning/search?q=flai...

If you're being very narrow and considering a classification problem, well, that's a RL problem too: you can optimize which datapoints you get labels for based on how informative a datapoint is (most datapoints are simply redundant) or how expensive it is to label. That's called 'active learning': https://www.reddit.com/r/reinforcementlearning/search?q=flai... It's particularly natural if you are doing large-scale image classification and have a service like Amazon Turk plugged in to get (or correct) labels.


In many cases people greatly overestimate the effort of labelling data. For the price of just a single engineer man-month you can generally label a lot of data, sufficient for many tasks; especially if you can do bootstrapping (the humans don't label all data, but correct mistakes of a previous insufficiently good automated solution) or some transfer learning from a somewhat similar task or an unsupervised one.

Like, if you need a solution for a some niche of image classification, even a single afternoon of labeling data might be sufficient to adapt an ImageNet classifier to your particular labels and get reasonable accuracy.


I'm watching an AI agent right now learn to play Texas Hold'em and after 75,000 hands, starting with zero knowledge, it plays about as well as some of my friends.

What should I try to teach it next?


75K hands seems very low to me

Can you give more details ?

Is it HU ? NL ? Using just self-play from 0-knowledge ?

How many roll-outs do u perform for each game action?


Home-brewed AI, not a NN. Similar in some ways to AlphaGo Zero's MCTS.

Learns to play with 0 knowledge. No training data. No rollouts. It squeezes a lot of info from each hand, more than a NN, which is why it needs less trials.

Currently 100k hands and most respectable play.


Would love to chat and exchange some info

yazr2yazr@gmail.com


Finding the cure for cancer.


Wash my dishes and fold my laundry.


I do not have the experience to support or refute the author's claim, but he writes:

"The paper does not clarify what “worker” means, but I assume it means 1 CPU."

That seems like way under-powered to me. It's deepmind and so I would assume that 1 worker is 1 GPU/TPU node, meaning there are multiple GPUs for each worker. I could see how not having enough compute power could result in a poor solution


DRL is different from regular DL in that it tends towards CPU-heavy, not GPU-heavy. It's hard to saturate a single GPU/TPU since you're using tiny little NNs and only once in a while updating them based on long episodes through the environment.

It might not be using GPUs/TPUs at all! If you look at the algorithm which that DM paper is based on, PPO, the original OpenAI paper & implementation (https://blog.openai.com/openai-baselines-ppo/) doesn't use GPUs, it's pure-CPU. (They have a second version which adds GPU support.)

Or in a DM vein, look at their latest IMPALA which you might've noticed on the front page a few days ago: https://arxiv.org/pdf/1802.01561.pdf Look at Table 1 pg5's computational resources for various agents: note how many of them have 0 GPUs whatsoever. Even the largest configuration, 500 CPUs, only saturates 1 Nvidia P100 GPU.

(So, 'worker' could hypothetically refer to a server with X cores and 1 GPU processing them locally, but this is almost certainly not the case since it would imply scaling up to thousands of CPUs which is actually highly difficult and requires careful engineering like with IMPALA.)


The author also works for google brain though, so i doubt that they failed to consider GPU/TPU. Really "worker" is just too underspecified to say what it really means.


> In this run, the initial random weights tended to output highly positive or highly negative action outputs. This makes most of the actions output the maximum or minimum acceleration possible. It’s really easy to spin super fast: just output high magnitude forces at every joint. Once the robot gets going, it’s hard to deviate from this policy in a meaningful way - to deviate, you have to take several exploration steps to stop the rampant spinning. It’s certainly possible, but in this run, it didn’t happen.

This is extremely human. Once you're deeply committed to something, it's hard to imagine alternatives, never mind embrace them.


This is an excellent write-up.


I might show this to that Uncle who talks about Kurzweil and the Singularity at Thanksgiving.


Even AI researchers fall to the Singularity (imo) fallacy, including Schmidhuber:

http://people.idsia.ch/~juergen/history.html

Although at least there's some display of self-skepticism:

"Kurzweil (2005) plots exponential speedups in sequences of historic paradigm shifts identified by various historians, to back up the hypothesis that "the singularity is near." His historians are all contemporary though, presumably being subject to a similar bias. People of past ages might have held quite different views. For example, possibly some historians of the year 1525 felt inclined to predict a convergence of history around 1540, deriving this date from an exponential speedup of recent breakthroughs such as Western bookprint (around 1444), the re-discovery of America (48 years later), the Reformation (again 24 years later - see the pattern?), and other events they deemed important although today they are mostly forgotten."

Which is a little rare, if you know the curious character of Jürgen Schmidhuber :)


Well, in a way, they're right. You could consider the printing press, or the aeroplane, or electricity, or the telegraph, to be a mini-"singularity" event since it drastically changed the world in ways unpredictable beforehand. "Singularity" doesn't necessarily equate to "rapture of the nerds where AI gods make everything awesome (and/or kill us all)", it just means "point where things get weird and we can't predict what will happen next."


Well that's IMO devoiding the word of original meaning then. You're referring to a revolution, which is a well established term, not a bona fide 'Singularity', which comes from mathematics as a point where the speed of change properly diverges -- as would be the case if we had a geometric time series of events with constant improvements.

This usage of the term really originated in the context of rampant intelligence growth (through a supposed explosive self-improvement), see the wikipedia article:

https://en.wikipedia.org/wiki/Technological_singularity


for the singularity term to mean something it has to involve the mechanism of a starting point from which on change ever increasingly accelerates. The singularity can't 'slow down' so to speak. That's simply a 'paradigm change' or a significant disruption of which we had lots.

As the name suggests if the 'singularity' exists there's only going to be a single one.


I don't like the title of the article. I think is sensationalist. Deep reinforcement learning has beaten a human at the game of Go. It does work. The model are not easy to train for most of the laymen out there and it has a lot to be addressed to make it production-ready. But it does work. Great write up overall. Thank you for the effort.


The current AlphaZero system has little in common with deep reinforcement learning; the NN-guided MTCS is effective but not something that can be applied to general reinforcement learning tasks. And the first AlphaGo system was highly reliant on standard supervised learning (training the value network and policy network on grandmaster games).

If anything, AlphaGo, AlphaGo Zero and AlphaZero are illustrations of how "pure" deep reinforcement learning is insufficient, that the non-RL parts have an enormous impact.


You might want to read the article...


Can autonomous vehicles work without deep reinforcement learning? I thought that things like negotiating entry into a crossroad required DRL.


Of course they can. DRL is a very very specific set of techniques to train decision-making over multiple timesteps.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: