Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The positional embedding can be thought of: in the same way you can hear two pieces of music overlaid on each other, you can add both the vocab and pos embedding and it’s able to pick them apart.

If you asked yourself to identify when someone’s playing a high note or low note (pos embedding) and whether they’re playing Beethoven or Lady Gaga (vocab embedding) you could do it.

That’s why it’s additive and why it wouldn’t make much sense for it to be multiplicative.



The visualisation here may be helpful.

https://github.com/tensorflow/tensor2tensor/issues/1591


Thanks, that's a really useful intuition!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: