> The only thing which is larger is the self attention calculation which is quad...

Kubuxu · 2024-12-06T12:23:48 1733487828

FFWD later is independent of context size, each processed token passes thought the same weights.

menaerus · 2024-12-06T13:33:32 1733492012

So you're saying that if I have a sentence of 10 words, and I want the LLM to predict the 11th word, FFWD compute is going to be independent of the context size?

I don't understand how since that very context is what makes the likeliness of output of next prediction worthy, or not?

More specifically, FFWD layer is essentially self attention output [context, d_model] matrix matmul'd with W1, W2 and W3 weights?