When I first tried to understand transformers, I superficially understood most material, but I always felt that I did not really get it on a "I am able to build it and I understand why I am doing it" level. I struggled to get my fingers on what exactly I did not understand. I read the original paper, blog posts, and watched more videos than I care to admit.
https://karpathy.ai/zero-to-hero.html If you want a deeper understanding of transform and how they fit in the whole picture of deep learning, this series is far and away the best resource I found. Karpathy goes into transformers by the sixth lecture, the previous lectures give a lot more context how deep learning works.
I agree that Karpathy's YouTube video is an excellent resource for understanding Transformers from scratch. It provides a hands-on experience that can be particularly helpful for those who want to implement the models themselves. Here's the link to the video titled "Let's build GPT: from scratch, in code, spelled out": https://youtu.be/kCc8FmEb1nY
Additionally, for more comprehensive resources on Transformers, you may find these resources useful:
I endorse all of this and will further endorse (probably as a follow-up once one has a basic grasp) "A Mathematical Framework for Transformer Circuits" which builds a lot of really useful ideas for understanding how and why transformers work and how to start getting a grasp on treating them as something other than magical black boxes.
"This document aims to be a self-contained, mathematically precise overview of transformer architectures and algorithms (not results). It covers what transformers are, how they are trained, what they are used for, their key architectural components, and a preview of the most prominent models."
This hour-long MIT lecture is very good, it builds from the ground up until transformers. MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention: https://youtube.com/watch?v=ySEx_Bqxvvo
The uploads of the 2023 MIT 6.S191 course from Alexander Amini (et alii) is ongoing, periodical since mid March. (They published the lesson about Reinforcement Learning yesterday.)
The original paper is very good but I would argue it's not well optimized for pedagogy. Among other things, it's targeting a very specific application (translation) and in doing so adopts a more complicated architecture than most cutting-edge modes actually use (encoder-decoder instead of just one or the other). The writers of the paper probably didn't realize they were writing a foundational document at the time. It's good for understanding how certain conventions developed and important historically - but as someone who did read it as an intro to transformers, in retrospect I would have gone with other resources (e.g. "The Illustrated Transformer").