Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Very clever, very meta, and it seems to work really well.

The two big take-aways for me are:

* It's possible to train a model to learn to summarize context from the attention matrix, based only on dot-product scores (k @ q.T * mask), regardless of how tokens are embedded.

* Once the model is trained, it will work with any attention matrix, even if it's the attention matrix of another model.

I've added this to my ever-growing list of things to try.



Is there any intuition why does it even work? It seems very unexpected.


The intuition is that the relative frequency at which past tokens get attention from future tokens is a good proxy for their relative importance.

The model the authors use, in fact, maps attention scores to features in the frequency domain.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: