Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In training we learn a.) the embeddings and b.) the KQ/MLP-weights.

How well do Transformers perform given learned embeddings but only randomly initialized decoder weights? Do they produce word soup of related concepts? Anything syntactically coherent?

Once a well trained high dimensional representation of tokens are established. can they learn KQ/MLP weights significantly faster?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: