Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> D-REX proposes a really clever trick to get around not having any reward labels at all, even when the demonstrator is suboptimal: Given a suboptimal policy... add variable amounts of noise to its actions. Assume that adding noise to a suboptimal policy makes it even more suboptimal... Train a ranking model to predict which of two trajectories has a higher return. The ranking model magically extrapolates to trajectories that are better

What strikes me about this is the assumption (adding noise to a policy makes it worse) goes completely against evolutionary approaches to AI (that we can look for improvements by adding noise).



The two ideas are mostly compatible (and neither assumption always holds):

(Evolutionary) If you generate enough perturbations then some of them are better.

(TFA) If you generate perturbations then most of them are worse.

In the evolutionary case you also explicitly design your model and algorithm to try to generate good perturbations, so the two ideas aren't necessarily directly comparable anyway.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: