The way the article presents this is misleading. The attention mechanism builds a new vector as a linear combination of other vectors, but after the first layer these have also all been altered by passing through a transformer layer so it makes less sense to talk about "other tokens" in most cases (it becomes increasingly inaccurate the deeper into the model you go). It's also not really moving closer so much as adding, and what it's adding isn't the embedding-derived-vector but a transform of the embedding-derived-vector after it's been projected into a lower-dimensional-space for that attention head.
It would be more accurate to say that it's integrating information stored in other vectors-derived-from-token-embeddings-at-some-point (which can also entail erasing information)
It depends on the values of the vectors. (4, 4) + (3, 3) results in a new vector (7, 7) which is further away from both contributing vectors than either one was to each other originally. Additionally, negative coefficients are a thing.
You still have one vector per token, that's what they meant, also the fact that the vector associated with each token will ultimately be used to predict the next token, once again showing that it makes sense to talk about other tokens even though they're being transformed inside the model.
Prediction happens at the very end (sometimes functionally earlier, but not always) - most of what happens in the model can be thought of as collecting information in vectors-derived-from-token-embeddings, performing operations on those vectors, and then repeating this process a bunch of times until at some point it results in a meaningful token prediction.
It's pedagogically unfortunate that the residual stream is in the same space as the token embeddings, because it obscures how the residual stream is used as a kind of general compressed-information conduit through the model that attention heads read and write different information to to enable the eventual prediction task.
Skimming it, there are a few things about this explanation that rub me just slightly the wrong way.
1. Calling the input token sequence a "command". It probably only makes sense to think of this as a "command" on a model that's been fine-tuned to treat it as such.
2. Skipping over BPE as part of tokenization - but almost every transformer explainer does this, I guess.
3. Describing transformers as using a "word embedding". I'm actually not aware of any transformers that use actual word embeddings, except the ones that incidentally fall out of other tokenization approaches sometimes.
4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.
5. "what attention does is it moves the words in a sentence (or piece of text) closer in the word embedding" No, that's just incorrect.
6. You don't actually need a softmax layer at the end, since here they're just picking the top token and they can just do that pre-softmax since it won't change. It's also weird how they talked about this here when the most prominent use of softmax in transformers is actually in the attention component.
7. Really shortchanges the feedforward component. It may be simple, but it's really important to making the whole thing work.
> 4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.
Worth noting that rotary position embeddings, used in many recent architectures (LLaMA, GPT-NeoX, ...), are very similar to the original sin/cos position embedding in the transformer paper but using complex multiplication instead of addition
The positional embedding can be thought of: in the same way you can hear two pieces of music overlaid on each other, you can add both the vocab and pos embedding and it’s able to pick them apart.
If you asked yourself to identify when someone’s playing a high note or low note (pos embedding) and whether they’re playing Beethoven or Lady Gaga (vocab embedding) you could do it.
That’s why it’s additive and why it wouldn’t make much sense for it to be multiplicative.
> Transformer block: Guesses the next word. It is formed by an attention block and a feedforward block.
But the diagram shows transformer blocks chained in sequence. So the next transformer block in the sequence would only receive a single word as the input? Does not make sense.
Before going and digging into these, could you also explain what the necessary background is for this stuff to be meaningful?
In spite of having done a decent amount with neural networks, I'm a bit lost at how we suddenly got to what we're seeing now. It would be really helpful to understand the progression of things because I stepped away from this stuff for maybe 2 years and we seem to have crossed an ocean in the intervening time.
Selecting the likeliest token is only one of many sampling options, and it's extremely poor for most tasks, moreso when you consider the relationships between multiple executions of the model. _Some_ (not necessarily softmax) probability renormalization trained into the model is issential for a lot of techniques.
To expand on this, one of the most common tricks is Nucleus sampling. Roughly, you zero out the lowest probabilities such that the remaining sum to just above some threshold you decide (often around 80%).
The idea is that this is more general than eg changing the temperature of the softmax, or using top-k where you just keep the k most probable outcomes.
Note that if you do Nucleus sampling (aka top-p) with the threshold p=0% you just pick the maximum likelihood estimate.
That's true, but they didn't go into any other applications in this explainer and were presenting it strictly as a next-word-predictor. If they are going to include final softmax, they should explain why it's useful. It would be improved by being simpler (skip softmax) or more comprehensive (present a use case for softmax), but complexity without reason is bad pedagogy.
When I first tried to understand transformers, I superficially understood most material, but I always felt that I did not really get it on a "I am able to build it and I understand why I am doing it" level. I struggled to get my fingers on what exactly I did not understand. I read the original paper, blog posts, and watched more videos than I care to admit.
https://karpathy.ai/zero-to-hero.html If you want a deeper understanding of transform and how they fit in the whole picture of deep learning, this series is far and away the best resource I found. Karpathy goes into transformers by the sixth lecture, the previous lectures give a lot more context how deep learning works.
I agree that Karpathy's YouTube video is an excellent resource for understanding Transformers from scratch. It provides a hands-on experience that can be particularly helpful for those who want to implement the models themselves. Here's the link to the video titled "Let's build GPT: from scratch, in code, spelled out": https://youtu.be/kCc8FmEb1nY
Additionally, for more comprehensive resources on Transformers, you may find these resources useful:
I endorse all of this and will further endorse (probably as a follow-up once one has a basic grasp) "A Mathematical Framework for Transformer Circuits" which builds a lot of really useful ideas for understanding how and why transformers work and how to start getting a grasp on treating them as something other than magical black boxes.
"This document aims to be a self-contained, mathematically precise overview of transformer architectures and algorithms (not results). It covers what transformers are, how they are trained, what they are used for, their key architectural components, and a preview of the most prominent models."
This hour-long MIT lecture is very good, it builds from the ground up until transformers. MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention: https://youtube.com/watch?v=ySEx_Bqxvvo
The uploads of the 2023 MIT 6.S191 course from Alexander Amini (et alii) is ongoing, periodical since mid March. (They published the lesson about Reinforcement Learning yesterday.)
The original paper is very good but I would argue it's not well optimized for pedagogy. Among other things, it's targeting a very specific application (translation) and in doing so adopts a more complicated architecture than most cutting-edge modes actually use (encoder-decoder instead of just one or the other). The writers of the paper probably didn't realize they were writing a foundational document at the time. It's good for understanding how certain conventions developed and important historically - but as someone who did read it as an intro to transformers, in retrospect I would have gone with other resources (e.g. "The Illustrated Transformer").
I know we don't have access to the details at OpenAI - but it does seem like there have been significant changes to the BPE token size over time. It seems there is a push towards much larger tokens than the previous ~3 char tokens (at least by behavior)
BPE is not set to a certain length, but a target vocabulary size. It starts with bytes (or characters) as the basic unit in which everything is split up and merges units iteratively (choosing the most frequent pairing) until the vocab size is reached. Even 'old' BPE models contain plenty of full tokens. E.g. RoBERTa:
(You have to scroll down a bit to get to the larger merges and image the lines without the spaces, which is what a string would look like after a merge.)
I recently did some statistics. Average number of pieces per token (sampled on fairly large data, these are all models that use BBPE):
RoBERTa base (English): 1.08
RobBERT (Dutch): 1.21
roberta-base-ca-v2 (Catalan): 1.12
ukr-models/xlm-roberta-base-uk (Ukrainian): 1.68
In all these cases, the median token length in pieces was 1.
(Note: I am not debating that newer OpenAI models don't use a larger vocab. I just want to show that older BBPE models didn't use 3 char pieces. They were 1 piece per token for most tokens.)
As someone has pointed out, with BPE you specify the vocab size, not the token size. It's a relatively simple algo, this Huggingface course does a nice job of explaining it [2]. Plus the original paper has a very readable Python example [3].
I agree except for (6). A language model assigns probabilities to sequences. The model needs normalised distributions, eg using a softmax, so that’s the right way of thinking about it.
This is true in general but not in the use case they presented. If they had explained why a normalized distribution is useful it would have made sense - but they just describe this as pick-the-top-answer next-word predictor, which makes the softmax superfluous.
I think we're on the verge of few-shot believable voice impersonation. Between that, realtime deepfake videos, and AIs being more than good enough to solve CAPTCHAs, it seems like we're at most a few years from having no means of verifying a human on the other end of any given digital communication unless someone figures out and implements a new solution quickly.
There are (mostly) solutions but I lot of people won't like them. As with things today like notarized signatures or just transacting in person, they basically depend on some sort of in-person attestation by a reliable authority. Of course, that means adding a lot more friction to certain types of transactions.
I can see how that might destroy many business models. But from the top of my head I can't come up with any whose loss would have a dramatic negative effect on my wellbeing. Could someone elaborate why I should be worried?
Why would passwords, personal devices, policed platforms etc. fail as an authentication method between known counterparties? Between unknown counterparties the issue is much bigger than just about being a human or not.
It does make it kind of hard to verify someone's identity.
That said I think trying to verify someone's identity through online means only became viable a few years ago when everyone had a somewhat working camera and microphone available, and with any luck the risk of deepfakes will cause an early end to the the scourge of people trying to film themselves holding a form of ID.
During COVID brokerages did start allowing online transactions for certain things they didn't used to. However, at least my brokerage has reverted to requiring either a direct or indirect in-person presence.
If a common-sense LLM is listening to grandma's calls (privacy alarms going off but hear me out), it can stop her from wiring her life savings to an untrustworthy destination, without having seen that particular scam before.
Once we can run our own personal, private LLMs it will definitely open up a world of possibilities.
Actually applications like this will probably be implemented on the cloud-based models, since 98% of the public does not care about privacy as much as people on this forum.
It will open up a whole new category of vulnerabilities of the 'how to fool the LLM while convincing granny' type as well. Then there is the liability question, i.e. if the LLM is lured by one of those and granny sends her money to Nigeria or the punks around the corner - take your pick - then is the LLM vendor liable for (part of the) loss? In this it may start to resemble the similar conundrum in self-driving vehicles where a nearly-perfect but sometimes easily fooled self-driving system will lull drivers into a false sense of security since the system has never failed - until it did not see that broken car standing in the middle of the road and slammed right into it. When granny comes to rely on the robot voice telling her what is suspect and what is not she may end up trusting the thing over her own better judgement just like the driver who dozed off behind the wheel of the crashed self-driving vehicle did.
It is not the fact that the poster writes «Done.» where actually and comically it's not "done" at all in said proposal,
nor the other possible point that Statistical Large Language Models are not problem solvers, as in fact are special in Machine Learning for optimizing for goals transversal to actual "solutions" (not to mention that it is already a Sam Altman, while proud of the results («They are not laughing now, are they»), the first to jump into alarm when people drool "So it's AGI!?" on him),
but it must be noted that those scams happen because people live lightheartedly their responsibilities (among which, realizing that they are not in the XVIII century anymore) - and the post has a tint of this dangerous laid back approach.
No prob on this side C., I just took the occasion of your input for substantiveness.
(I would suggest that you mark sarcasm, e.g. with '/S' or equivalent. Once upon a time we all thought that rhetoric is immediately recognizable, then we met people who would believe ideas beyond the boundary of "anything".)
When a dangerous exploit is discovered, best practice is to give good actors playing defense a heads up and time to prepare before publicly revealing the exploit, and preferably only revealing the exploit once critical systems have been hardened against it.
Cheap general AI is a potential exploit of everything everywhere all the time. No one is prepared for it.
A billion eyes are powerful. Giving every actor an eye-making machine is uncharted territory. And opening it up is a one-way gate - you can always delay opening it up another day, but you can't un-open it once you've let it out. So you should be very, very certain you're prepared to commit yourself, everyone you know, and the entire human race to living in that world before you make it so.
The Ghostwriter expansion pack was next fucking level. Clues hidden in documents to solve a mystery - one of my favorite early childhood software experiences.