In training we learn a.) the embeddings and b.) the KQ/MLP-weights.
How well do Transformers perform given learned embeddings but only randomly initialized decoder weights? Do they produce word soup of related concepts? Anything syntactically coherent?
Once a well trained high dimensional representation of tokens are established. can they learn KQ/MLP weights significantly faster?
How well do Transformers perform given learned embeddings but only randomly initialized decoder weights? Do they produce word soup of related concepts? Anything syntactically coherent?
Once a well trained high dimensional representation of tokens are established. can they learn KQ/MLP weights significantly faster?