It seems he brushes over the positional encoding, which for me was the most puzz...

nerdponx · on April 15, 2024

That's one way to think about it.

It's clever way to encode "position in sequence" as some kind of smooth signal that can be added to each input vector. You might appreciate this detailed explanation: https://towardsdatascience.com/master-positional-encoding-pa...

Incidentally, you can encode dates (e.g. day of week) in a model as sin(day of week) and cos(day of week) to ensure that "day 7" is mathematically adjacent to "day 1".