I found this to be a great read on what I think the parent is referring to: http...

gwern · on Feb 3, 2019

Self-attention uses 1x1 convolutions, and while the original Transformer is fully-connected-only (I think?) the latest & greatest uses convolutions (https://arxiv.org/abs/1901.11117), and even if you're using the original, it's very often being used in conjunction with convolutions elsewhere. So you could argue that the real title would have to be 'The Unreasonable Effectiveness of Attention', but given all the other stuff convolutions have done without attention like WaveNet, I think it's currently fairer to go with convolutions than attention. But we'll see how it goes over the next 4 years...

phowon · on Feb 4, 2019

A 1x1 convolution is such an edge case of convolution that it's really not worth discussing its inclusion as related to the success of the Transformer. Calling the Transformer "convolutions with attention" demonstrates a new-complete misunderstanding of the architecture.

There's a reason Transformer's original paper is entitled "Attention is All You Need", because it throw out all the previous structures people assumed were necessarily to solving these problems (recurrence from RNNs, local-transformations for convolutions) and just threw multiple layers of large multi-headed attentions at the problem and got even better results.

gwern · on Feb 4, 2019

> A 1x1 convolution is such an edge case of convolution that it's really not worth discussing its inclusion

It is, nevertheless, still a convolution and calls to convolution code is how self-attention is implemented. Look inside a SAGAN or something and you'll see the conv2d calls.

> Calling the Transformer "convolutions with attention" demonstrates a new-complete misunderstanding of the architecture.

You're reading that in an overly narrow way and imputing to me something I didn't mean. And again, the original Transformer may not be used in conjunction with convolutions, but it often is, and the best current variant uses convolutions internally and so involves convolutions no matter how you want to slice it. Attention is a powerful construct, but convolutions are pretty darn powerful too, it turns out, even outside images.

phowon · on Feb 4, 2019

>Look inside a SAGAN or something and you'll see the conv2d calls.

...Yes, because SAGANs operate on images, so the foundational operation is a convolution.

>You're reading that in an overly narrow way and imputing to me something I didn't mean.

You characterized the Transformer as "convolutions with attention". You then attributed the success of Transformer-based models to "the non-locality & easy optimization of convolutions". The "SOTA for most (all?) sequence-related tasks" applies the regular Transformer variants, not the Evolved Transformer which was published about 5 days ago.

No one is denying that convolutions are useful across many domains. But no one seriously working in the domain of NLP would consider convolutions to be anywhere near the most novel or notable parts of the Transformer.

(In case you do want to look it up, OpenAI's GPT also uses character-level convolutions for its word embeddings. However, BERT does not.)

pilooch · on Feb 4, 2019

Interesting conversation. I would add that papers by Lecun and others have been using character based convolutions on pure text since 2015 with great success. VDCNN is still a very good way to go for classification, and is much faster to train than RNN due to effective parallelization.

On a side note, sad to see these conversations about SOTA deep learning to be so adversarial... You're wrong / you're right kinda thing. It's an empirical science mostly at the moment, surf the gradient, be right and wrong at the same time !

phowon · on Feb 4, 2019

And convolution-based models still find use in all sorts of cool applications in language, such as: https://arxiv.org/abs/1805.04833

With regards to adversarial discussions, it's one thing to argue about whether method A or method B gives better results in a largely empirical and experimental field. But giving a very misleading characterization of a model is actively detrimental especially when it would give casual readers the impression that the Transformer is a "convolution-based" model, which no one in the field would do.

phowon · on Feb 3, 2019

But no part of that Transformer section makes any reference to convolutions.