I am trying to learn more in depth. Could you suggest some good resource for lea...

metanonsense · on April 15, 2023

When I first tried to understand transformers, I superficially understood most material, but I always felt that I did not really get it on a "I am able to build it and I understand why I am doing it" level. I struggled to get my fingers on what exactly I did not understand. I read the original paper, blog posts, and watched more videos than I care to admit.

The one source of information that made it click to me were chapters 159 to 163 of Sebastian Raschka's phenomenal "Intro to deep learning and generative models" course on youtube. https://www.youtube.com/playlist?list=PLTKMiZHVd_2KJtIXOW0zF...

TyrianPurple · on April 15, 2023

Sebastian Raschka's course is really good. Gone through it like three times.

indeedmug · on April 15, 2023

I found these resources to be helpful.

https://jalammar.github.io/illustrated-transformer/ This is a good illustration of the transformer and how the math works.

https://karpathy.ai/zero-to-hero.html If you want a deeper understanding of transform and how they fit in the whole picture of deep learning, this series is far and away the best resource I found. Karpathy goes into transformers by the sixth lecture, the previous lectures give a lot more context how deep learning works.

pankajdoharey · on April 15, 2023

I agree that Karpathy's YouTube video is an excellent resource for understanding Transformers from scratch. It provides a hands-on experience that can be particularly helpful for those who want to implement the models themselves. Here's the link to the video titled "Let's build GPT: from scratch, in code, spelled out": https://youtu.be/kCc8FmEb1nY

Additionally, for more comprehensive resources on Transformers, you may find these resources useful:

* The Illustrated Transformer by Jay Alammar: http://jalammar.github.io/illustrated-transformer/

* MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention: https://www.youtube.com/watch?v=ySEx_Bqxvvo

* Karpathy's course, Deep Learning and Generative Models (Lecture 6 covers Transformers): https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThs......

These resources cover different aspects of Transformers and can help you grasp the underlying concepts and mechanisms better.

jaidhyani · on April 16, 2023

I endorse all of this and will further endorse (probably as a follow-up once one has a basic grasp) "A Mathematical Framework for Transformer Circuits" which builds a lot of really useful ideas for understanding how and why transformers work and how to start getting a grasp on treating them as something other than magical black boxes.

https://transformer-circuits.pub/2021/framework/index.html

Buttons840 · on April 15, 2023

I've been reading this paper with pseudocode for various transformers and finding it helfpul: https://arxiv.org/abs/2207.09238

"This document aims to be a self-contained, mathematically precise overview of transformer architectures and algorithms (not results). It covers what transformers are, how they are trained, what they are used for, their key architectural components, and a preview of the most prominent models."

quantisan · on April 15, 2023

this one's been mentioned a lot: Let's build GPT: from scratch, in code, spelled out. https://youtu.be/kCc8FmEb1nY

andai · on April 15, 2023

The whole playlist is fantastic: https://youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9Gv...

detrites · on April 15, 2023

This hour-long MIT lecture is very good, it builds from the ground up until transformers. MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention: https://youtube.com/watch?v=ySEx_Bqxvvo

mdp2021 · on April 15, 2023

The uploads of the 2023 MIT 6.S191 course from Alexander Amini (et alii) is ongoing, periodical since mid March. (They published the lesson about Reinforcement Learning yesterday.)

andai · on April 15, 2023

Here's the original paper: https://arxiv.org/abs/1706.03762

jaidhyani · on April 16, 2023

The original paper is very good but I would argue it's not well optimized for pedagogy. Among other things, it's targeting a very specific application (translation) and in doing so adopts a more complicated architecture than most cutting-edge modes actually use (encoder-decoder instead of just one or the other). The writers of the paper probably didn't realize they were writing a foundational document at the time. It's good for understanding how certain conventions developed and important historically - but as someone who did read it as an intro to transformers, in retrospect I would have gone with other resources (e.g. "The Illustrated Transformer").