Very clever, very meta, and it seems to work really well.
The two big take-aways for me are:
* It's possible to train a model to learn to summarize context from the attention matrix, based only on dot-product scores (k @ q.T * mask), regardless of how tokens are embedded.
* Once the model is trained, it will work with any attention matrix, even if it's the attention matrix of another model.
I've added this to my ever-growing list of things to try.
The two big take-aways for me are:
* It's possible to train a model to learn to summarize context from the attention matrix, based only on dot-product scores (k @ q.T * mask), regardless of how tokens are embedded.
* Once the model is trained, it will work with any attention matrix, even if it's the attention matrix of another model.
I've added this to my ever-growing list of things to try.