This is cool, I highly recommend Jay Alammar’s Illustrated Transformer series to anyone wanting to get an understanding of the different types of transformers and how self-attention works.
The math behind self-attention is also cool and easy to extend to e.g. dual attention
The math behind self-attention is also cool and easy to extend to e.g. dual attention