Hacker News new | past | comments | ask | show | jobs | submit login

It is fine tuned to maximize reward though, not likelihood. And it provides an answer in both cases, just not as well.



So since a model is fine tuned via RLHF my point doesn't stand?

Genuine question; it would be interesting if some other mechanism was at play here.


For an answer I would expect it to get the same reward for both question orderings. So naively I would expect it to not be affected by the ordering.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: