Transformers Are Graph Neural Networks

jdc · on Sept 12, 2020

Better explained by Chaitanya Joshi (while at the NTU Graph Deep Learning Lab)

here: https://threadreaderapp.com/thread/1233220586358181888.html

and here: https://graphdeeplearning.github.io/post/transformers-are-gn...

iflp · on Sept 12, 2020

I appreciate the effort the authors put to this post, but this is like saying DNNs are stacked logistics regression: the connection is superficial, and doesn't lead to deep insights about how they really work.

hobofan · on Sept 12, 2020

> deep insights about how they really work

It's not about "how they really work", but what data they operate on and what problems they can be applied to. When I first heard the term "transformer" from a friend, I didn't have any association in my mind because it's a very opaque term, but once he explained it to me as Graph Neural Networks, it very quickly clicked.

cjauvin · on Sept 12, 2020

I'm genuinely a bit surprised by that, that was always my high-level understanding of what the essence of neural networks was (at least feedforward vanilla ones), would you care to elaborate?

woko · on Sept 12, 2020

I think this post means the same as this tweet:

> Transformers are a special case of Graph Neural Networks. This may be obvious to some.

https://twitter.com/OriolVinyalsML/status/123378359362695168...

iflp · on Sept 13, 2020

It depends on what kind of understanding you want to achieve. It can be helpful to think of DNN as approximating the corresponding infinitely-wide versions. Depending on how you deal with certain scaling, they then act like a linear filter of the error signal in function space, or for single-hidden-layer networks at least, an interacting particle system. In both cases you can understand the convergence of gradient descent training using these analogies, although gaps from real-world practice exist.

mhh__ · on Sept 12, 2020

To get the neural network to do something useful you need to formalize some way of training it too?

wnoise · on Sept 12, 2020

While they can be thought of as stacked regression, it's only logistic regression with one particular non-linearity. And for many non-linearities you'll have a hard time usefully interpreting them as a regression.

unishark · on Sept 13, 2020

I think the common ones have statistical interpretations (that predate deep learning by a lot). Perhaps the one for the rectified linear unit is pretty obscure. But as I understand it, the statistics concept is called the "Tobit" model. It's meaning is not so obscure though, just a prediction that can be non-negative only, which is pretty common like a mass or energy.

p1esk · on Sept 12, 2020

Same here, I’m not sure what he’s trying to say.

fizixer · on Sept 12, 2020

ML publication is a complete mess right now.

Anyone can claim anything as long as they do a write-up and include some equations and pretty plots.

It was hard enough 5 years ago to filter out handful of good papers from the sea of bad research. Now it's getting near impossible.

czzr · on Sept 12, 2020

Somebody should train a model to do it.

SimplyUnknown · on Sept 13, 2020

You mean like arxiv-sanity? As I understand it, it trains a SVM on papers you like to suggest papers that are on the same side of the hyperplane. Could be used as a quality classifier by only liking high-quality work.

fpgaminer · on Sept 12, 2020

It's interesting watching these attempts to understand Transformers. Are they Graph networks? Are they Hopfields networks? Are they convolutions?

There was this research: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreti...

Turns out, GPT-like architectures appear to use the same representation throughout all the layers. So you can use the final head layer as a lens to see what words the network is thinking of as it ... "thinks". That's a bit contrary to what I imagined a GPT-like architecture was doing. I would have assumed that its embedding of ideas changed throughout the network, only reaching a sensible embedding near the end. At least from the perspective of the head layer, that doesn't appear to be the case.

Facebook's research demonstrates that the MLP layer can be dropped in favor of more attention over learned knowledge vectors: https://ai.facebook.com/blog/making-transformer-networks-sim...

> Reading new Transformer papers makes me feel that training these models requires something akin to black magic when determining the best learning rate schedule, warmup strategy and decay settings.

I found GPT-*'s schedule straightforward. In fact, most training schedules these days are "boring". Adam with warmup and either a linear or cosine decay. CNNs have been doing that for awhile, and the hyperparameters now are fairly robust. If you get within an order of magnitude you'll land within a few percent of optimal accuracy.

The OpenAI paper Scaling Laws for Neural Language Models does a good job of exploring the hyperparameters of GPT-like networks. It's a fascinating read.

That paper suggests another thing about Transformers that we don't understand. Beyond some minimums, the layer count, embedding size, and number of attention heads _isn't important_. The most important factor in the performance of a model is simply the number of parameters.

That's quite unusual, as the classic intuition is that adding more layers to a model improves performance. Yet for GPT-like architectures that isn't the case. You can get the same performance gains by just increasing the embedding size. 4 heads? 12 heads? Doesn't matter. Weird.

> Initially introduced for machine translation, Transformers have gradually replaced RNNs in mainstream NLP. The architecture takes a fresh approach to representation learning: Doing away with recurrence entirely

The more I study Transformers, the more I suspect that their success has more to do with our utter failure to train RNNs. In theory, RNNs have infinite attention. In practice, the only tool we have for optimizing models is backprop, and so to train an RNN we have to use BPTT. This defacto creates a learning horizon. There is absolutely no training signal that tells an RNN to remember something beyond the BPTT horizon. So why would it?

Our training loop for RNNs involves giving them, for example, 1024 tokens, running them for 1024 iterations, and computing loss on 1024 predictions.

A Transformer's training is nearly identical. They get 1024 tokens of context, make 1024 predictions, and compute loss.

The difference? Let's consider a scenario where it's the last prediction, and that prediction should be a copy of the first word in the sequence. Very simple, right? Yet for an RNN to learn how to do that, it needs to backprop the loss through 1024 iterations of its model. Even for a simple 1 layer RNN that makes it look like a 1024 layer model. Vanishing gradients turned up to 11!!

For a Transformer this is dead simple. Its attention mechanism allows the last column of the Transformer to directly use knowledge from the entire sequence. Not only can it see the first token in the sequence, it can see all the computations it previously performed on the sequence, all at once. All with easy backprop. No vanishing gradients here.

So is it really any surprise that Transformers have dominated RNNs? Given the exact same compute, memory, parameters, etc Transformers make much more efficient use of the resources. (During training)

To put it another way, a Transformer is exactly like an RNN. It just uses attention to access history rather than recurrence. And we know that backprop and recurrence are incompatible. So Transformers win.

Of course, that's a huge problem. We have a temporary win. A HUGE win. But Transformers haven't solved what RNNs were meant to solve. By the end of a book, human brains can remember things from the beginning of the book. RNNs trained with BPTT cannot. Transformers cannot. Even Transformers with linear attention mechanisms cannot. There's another leap here left to do.

We need some kind of long term memory mechanism. A big Transfomer model can probably approximate the active parts of a human brain. The image-gpt model and papers using Transformers in place of CNNs for classification tasks have shown that Transformers are a generic substrate. But the big missing puzzle piece is long term memory. Give a Transformer some method to query a bank of memory and I think we're going to see that next big leap. Plus, with access to a memory bank, a Transformer can free up all the resources its using today to behave _like_ long term memory and instead use them for more "thinking".

Yet we have absolutely no mechanisms available to us today to build such a thing. To teach any kind of model to act as long term memory requires some way to show to an example from its distant past, see how well it remembers it, and then ... backprop that. But we can't backprop it, because we can't backprop across an entire book, let alone WikiText or WebText2.

We're really back at square one. We need some way to make the theory of RNNs a reality. Transforms haven't bought us that. They've only bought us a short-term cheat in performance.

EDIT: Just so it's clear, none of my comment is meant as a criticism of the linked article. I actually thought the OP was great and sources a lot of research in trying to understand the mechanisms of Transformers. Really I just springboarded off the article to dump some of my own musings.

nullc · on Sept 13, 2020

With GPT3 you can give it a bit more long term memory by priming it with text that has "self commentary" written and repeated through each paragraph.

[This is a post to Hacker news and I'm making a point to explain a gimmick for giving GPT3 self generated longer term 'memory'.]

Most obvious forms of memory have a problem that they aren't differentiable so you can't train with them in place. This idea works around the issue because english text contains things like running commentary a times, and so a model trained on it already has some idea of how to use it.

[This is a post to Hacker news and I'm making a point to explain a gimmick for giving GPT3 self generated longer term 'memory' and the limitations of other approaches.]

I've had some success at getting this to help generate better text. I wonder though if it would be effective to generate a new training corpus this way. E.g. get GPT3 to generate annotations for arbitrary input text by using some summerization prompt, then use that to go augment the entire training corpus with the summaries injected inline like virtual-thought bubbles with beginning and ending symbols that don't occur in the training material. Then the network is retrained on this augmented data and then can generate its own prompts.

Bonus: the operator could be given access to the otherwise normally hidden "internal monolog" text, to increase control over the output or understand more about the state of the model.

You can't differentiate across the different executions, due to sampling-- but perhaps you don't need to... it doesn't do any gradient descent to perform one shot learning.

I am guessing that this must not work at scale because it's an obvious enough idea and a similar approach for database access (e.g. have it generate keywords from the text, then inject tokens encoding some text search results for those keywords the stream, and skip over them in training and just keep them as context; thus training a model that can use a search to improve its results) must have been tried but I've never heard anyone report it working.

drran · on Sept 13, 2020

Memory is easy. It was proposes 20 years ago, but nobody bothered to translate the paper to English, because 20 years ago AI was a toy.

Memory is just association. When Foo is at input, Memory must bring up Bar, Baz, etc., which are in association with Foo, as separate input. It's better if association kind (before, after, inside, together, opposite, same, etc.) will be stored and retrieved by Memory too. Not a hard task to do by today standards.

However, Long Term Memory is orthogonal to AI training. It's kind of "self-attention" mechanism, because LTM need to watch _training process_, and then note what, when and how put input into LTM, and how to associate it with other things, which are already in LTM. In short, LTM requires meta training, to watch a lot of training sessions, to understand that. It will be hard to define proper loss function for LTM, so it may be better to implement LTM as simple non-AI algorithm first. IMHO, for LTM, rate of training convergence can be used as loss function for meta-training of LTM itself.

BTW, LTM also need a way to translate between input encoding, or single input encoding must be used for all trainings.

PS.

Also, when bringing up associations (memories) for Foo, LTM can also bring up associations for Bar, Baz, etc. For example, LTM can bring up 10 direct (tier 1) associations for Foo, then 3 tier 2 main associations for Bar, Baz, etc., then 1 tier 3 association for tier 2 associations, and so on, up to e.g. 7 tiers. Beware, it can lead to "inner monologue" of machine. :-)

YetAnotherNick · on Sept 12, 2020

Actually we do have few mechanisms for long term memory like Neural Turing Machine, which has explicit memory cells which neural network could read and write. I think the only thing that is holding back NTM is that it is computationally not efficient like fixed sized context transformer.

Muller20 · on Sept 13, 2020

What's holding back NTM is that they are hard to train, even worse than RNNs. They are not much less efficient than a Transformer. Instead, Transformer has all the advantages of the NTM but it is much easier to train.

Actually, the way I see it, Transformer is a direct descendent of memory-based architectures (NTM, MemNet, stack-based RNNs...) that is both expressive and easy to train.

sjg007 · on Sept 13, 2020

When I realized that what transformers do is transform input into output which is also input I was amazed but it makes sense. It’s exactly like a markov chain. Think of a snake eating itself. What’s important is that the output is basically a Probability distribution. Now you can post process that output to get a finite value but you really want to put it back in and turn the wheel again.

But you are right, they are trained on next word prediction so there’s no long term memory. I imagine people are working on transformers with a memory bank. But RNNs seem to be the brute force solution here... what I am guessing is that you need to maintain some kind of index to decide where to backprop. If it hasn’t been discovered yet, I bet it will be some kind of bloom filter.

hrbigelow · on Sept 13, 2020

One idea in the Hopfield Networks is All You Need paper, was that the softmax-based attention mechanism is equivalent to a Hopfield energy update, and in which the attention keys are the Hopfield "memories". But, the keys are produced as a transformation of the input, so it seems to me, the Transformer does not actually store keys as "memories" the way a Hopfield network stores memories (as energy minima). Is this correct, or am I missing something about the paper?

murbard2 · on Sept 13, 2020

> There is absolutely no training signal that tells an RNN to remember something beyond the BPTT horizon. So why would it?

Because they generalize. Char-RNN learn to balance parenthesis separated by a longer distance than the BPTT window because they've learned that counting parenthesis is useful for prediction based on parenthetical statements shorter than the BPTT window.

Erlich_Bachman · on Sept 13, 2020

Very well thought-out! We need more of this. Doesn't mean that all of your ideas are correct or don't make unnecessary assumptions, etc. But it shows a very clear thinking path that is easy to understand. You should write and publish more if you can!

nmca · on Sept 12, 2020

You might be interested in RTRL (real time recurrent learning).

adamnemecek · on Sept 12, 2020

Attention really reminds me of focusing in proof theory https://ncatlab.org/nlab/show/focusing

postfacto · on Sept 13, 2020

I thought Transformers were “Robots in disguise”.