Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there any intuition why does it even work? It seems very unexpected.


The intuition is that the relative frequency at which past tokens get attention from future tokens is a good proxy for their relative importance.

The model the authors use, in fact, maps attention scores to features in the frequency domain.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: