Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> If neither of those, what would proper evidence for the claim look like?

Ok tell me what you think of this, it's just a thought experiment but maybe it works.

Suppose I train Model A on a dataset of reviews where the least common rating is 1 star, and the most common rating is 5 stars. Similarly, I train Model B on a unique dataset where the least common rating is 2 stars, and the most common is 4 stars. Then, I "align" both models to prefer the least common rating when generating responses.

If I then ask either model to write a review and see it consistently preferring 3 stars in their scratchpad - something neither dataset emphasized - while still giving me expected responses as per my alignment, I’d suspect the alignment is "fake". It would seem as though the model has developed an unexplained preference for a rating that wasn’t part of the original data or alignment intent, making it feel like the alignment process introduced an artificial bias rather than reflecting the datasets.



It doesn't seem like that distinguishes the case where your "alignment" (RL?) simply failed (eg because the model wasn't good enough to explore successfully) from the case where the model was manipulating the RL process for goal oriented reasons, which is a distinction the paper is trying to test.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: