If developing intuitions is the goal, I really do think Jay Alammar's "Illustrated Transformer" [1] is at least a step function better in the service of that outcome than the academic paper itself.
(I totally realize this is subjective, but that has been my experience with my own learning in the space over the last few years as well as some folks I've mentored)
(I totally realize this is subjective, but that has been my experience with my own learning in the space over the last few years as well as some folks I've mentored)
http://jalammar.github.io/illustrated-transformer/