Hacker News new | past | comments | ask | show | jobs | submit login
Off Belief Learning (facebook.com)
50 points by yamrzou on April 18, 2022 | hide | past | favorite | 2 comments



> In conventional multiagent RL methods, the two agents would converge using 1 to represent one color and 2 for the other. …

> … But that approach wouldn’t work well in playing with humans or with another independently trained agent. Since there’s no way in advance to know whether an agent uses 1 to represent a particular color or not, they wouldn’t know whether 1 represented blue or red [action, sic]?

This last paragraph is not persuasive. With two representations (labels), it is easy for one agent to compensate if another agent uses a different label. Why? An RL agent pays attention to the reward signal and can learn beneficial policies (ways of acting based on its stimuli.)

Am I missing something?

Yes, it is harder as the number of labels increases. (If the communication is not grounded, an agent would need to learn another agent’s mapping.) But this does not seem to be hard enough to worry about … neural networks are excellent at such mappings with enough training data … unless there is some other simultaneous complexity involved making data efficiency difficult.


The setting considered here is coordinating with unknown human partners in zero-shot. The agent does not get to play with that particular human players for multiple runs. It is a motivating example to show what a grounded policy is and why it may be desirable in real world situations. In complex environments RL method may learn arbitrary complex strategies/conventions when trained in simulations, making it hard to deploy in real time for human-AI collaborations. Even when multiple runs of interactions are available, it is challenging to fine-tune a policy on the fly given the very limited amount of data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: