Very clever, very meta, and it seems to work really well. The two big take-aways... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		cs702 on Dec 17, 2024 \| parent \| context \| favorite \| on: New LLM optimization technique slashes memory cost... Very clever, very meta, and it seems to work really well. The two big take-aways for me are: * It's possible to train a model to learn to summarize context from the attention matrix, based only on dot-product scores (k @ q.T * mask), regardless of how tokens are embedded. * Once the model is trained, it will work with any attention matrix, even if it's the attention matrix of another model. I've added this to my ever-growing list of things to try.

xpl on Dec 17, 2024 [–]

Is there any intuition why does it even work? It seems very unexpected.

cs702 on Dec 17, 2024 | [–]

The intuition is that the relative frequency at which past tokens get attention from future tokens is a good proxy for their relative importance.

The model the authors use, in fact, maps attention scores to features in the frequency domain.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact