Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The name "per-layer embeddings" is all we have to go on, and there are currently no published papers (that I'm aware of) using any similar mechanism, so, yes, it's a huge leap from a paper that doesn't mention per-layer anything.

It's fine to speculate based on the name, but don't pretend that it's a known technique when it clearly isn't.



Someone [1] inspected dimensions of the embedding component of model and it seems GP was on the right track. Assuming I understood correctly in [2], it does seem to be the embedding of the input tokens which is passed directly into each layer.

I have not looked at the model but since the embedding dimension of 256 seems quite small (for reference according to [3] the old Gemma 1B had 1152 dimension input embedding), I'm guessing that this is not done _in lieu_ of the main input embedding to first layer, but in addition to it.

[1] https://twitter.com/cccntu/status/1925043973170856393

[2] https://news.ycombinator.com/edit?id=44048662

[3] https://developers.googleblog.com/en/gemma-explained-whats-n...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: