Single Headed Attention RNN

albertzeyer · on Nov 27, 2019

The writing style is amusing. :)

Some notes from a first glance:

* In the experiments, I see that he uses the Single Headed Attention model actually also with 4 heads, which is kind of a contradiction to the name, isn't it?

* The main motivation is performance (training speed mostly). So some absolute number of e.g. training time would be nice to have in the comparisons. He e.g. mentions that the Adaptive Transformer can also be trained on a single GPU within hours, and in the comparison, the Adaptive Transformer gets much better BPC (enwik8), and uses even slightly less parameters. So, isn't the Adaptive Transformer thus better in every aspect (speed and BPC)? Or how does it compare in speed? As far as I remember, also the Sparse Transformer is more efficient (as it has sparsity), so again the speed comparison would be interesting here. Or is the argumentation for inference speed? But then the inference speed should be compared, or not?

dual_basis · on Nov 27, 2019

I don't think that was his motivation, I think his motivation was stated quite clearly in the abstract:

> The author's lone goal is to show that the entire field might have evolved a different direction if we had instead been obsessed with a slightly different acronym and slightly different result.

citilife · on Nov 27, 2019

Honestly, I wish all research papers were written this way. Easy to understand, kept me entertained, and presented meaningful results with a way to reproduce (on a single GPU).

I grant all research papers on deep learning can't be reproducible with a single GPU in a reasonable time, but it should happen more often IMO. It seems lazy to just toss out a paper saying "we hit new benchmarks, by increasing the parameters and throwing more compute". I'd like to see "we hit new benchmarks with a new design, the old ones had this issue, etc.

Anyway, great read, recommend. Also, happy for the author haha

"The author has also moved to a one bedroom apartment in San Francisco, removing themselves from proximity to the alley of questionable odors and unsavory noises."

LudwigNagasena · on Nov 27, 2019

Now imagine reading papers is your job and you try to skim through dozens of wannabe stand up comedians each day.

etiam · on Dec 7, 2019

As opposed to wannabe tax lawyers? I'll take the comedians thank you very much.

6gvONxR4sf7o · on Nov 27, 2019

Is dozens of papers per day how academics work? Holy crap.

throwlaplace · on Nov 27, 2019

Notice the word skim. Yes when I'm trying to figure something out I often have 10 tabs with papers open and I'm flipping between them.

darkmighty · on Nov 27, 2019

Depends. If the paper is dense you're taking hours per paper, rereading many times and rederiving details. You can skim through many you don't want to dive deeply though (if it's a topic you just need a feel for).

roenxi · on Nov 27, 2019

The author is very intelligent and is doing three things differently from a 'standard' paper:

1) Reducing the density of information per paragraph (vs. packing information in)

2) Clearly outlined motivation and context (vs. just referencing some other papers and assuming they've been read)

3) Deploying comedy (vs. professionalism)

The first two improve the paper, the third is a step backwards because the reader has to spend effort separating fact & fiction. The combination in this case works and is a lot of fun but it would have been a catastrophic and cringeworthy exercise if the execution of (1) and (2) hadn't worked out so well.

The real trick here is excellent writing and the comedy is simply draws attention to it. Much like how an army marching in a silly way draws attention to its discipline. The silly march itself is not a good idea.

EvgeniyZh · on Nov 28, 2019

I'm not sure reducing the density of information is a good thing. It probably makes it easier to read for someone who is not entirely familiar with a field but makes it slower for the people for whom the paper is written. (Too dense papers are really hard to read, but that's relatively rare)

czr · on Nov 27, 2019

for those who aren't familiar with the author, he previously worked at metamind / salesforce research doing nlp and has published many successful nlp papers [0]. he opted to write an informal paper for this project (similar to yolov3 [1]), but the work itself should still be taken seriously.

[0] https://scholar.google.com/citations?user=AolIi4QAAAAJ

[1] https://pjreddie.com/media/files/papers/YOLOv3.pdf

1maginary · on Nov 27, 2019

You just have to love Stephen Merity.

His work on QRNN's saved me quite a bit of time and money when I was doing my undergrad dissertation on language models.

This SHA-RNN seems to have surfaced from a similar line of thinking that spawned the QRNN.

technics256 · on Nov 27, 2019

Are qRNNs still used much?

polymorph1sm · on Nov 28, 2019

check out MultiFiT [0] from fastai, it uses QRNN for speed.

[0] http://nlp.fast.ai/classification/2019/09/10/multifit.html

lopuhin · on Nov 27, 2019

The paper rises a great point on tokenization affecting perplexity, that we can't compare perplexities of different tokenizers even re-normalizing taking token counts into account, say BPE vs word tokenization. This example nails it: https://twitter.com/Smerity/status/1192252147598909441

pcwelder · on Nov 27, 2019

I don't see his point. Doesn't renormalizing token counts essentially eliminate the effect of tokenization? The perplexity which then we get essentially is representative of how well a model compresses the test document. Isn't that the whole point? A better model compresses the document better, how does it matter if you model each character or each word or bigrams or even directly the bits?

The main disadvantage of word-level models is large vocabulary size, however, the tweet completely ignores the advantage--sequence length becomes shorter, it has to look only a few tokens back to find the reference to "Bob" and "Alice".

The same model at word level writes more sensible sentences than at character level. There's a tradeoff between larger vocabulary and modelling longer dependencies. A model which can encode a text document more effectively is better; tokenization is just a part of the modelling. You just need to take care of the "number of words" of "per word" part of "perplexity per word" and you can directly compare their performances.

The author is wrong that entropy collapses after "A" is given of "Alice". Entropy will only collapse if the model has really "understood" the context and modelled that "Bob" and "Alice" are the only options here. The entropy won't collapse for a sentencepice based bi-gram model, for example.

In his example, it is not clear if the wordpiece model is at an advantage. Suppose both the models "understand" that there are two options "Bob" and "Alice". Then the word-level model only has to predict one token which can be either of the names. Perplexity = 0.5. The sentence-piece model also has to choose between two tokens "B" and "A", the second token won't add to perplexity since it'll be known. Perplexity = 0.5.

lopuhin · on Nov 27, 2019

Good point, assuming some extent of collapse is crucial, and the question is if different perplexities due to tokenization can happen in principle. You are right that in "Alice" vs. "A|lice" example we get the same perplexity after re-normalization, I can't come up with an example where it would be different right now.

kwrobel · on Nov 27, 2019

I agree. Perplexities (probability of a text) can be compared using different tokenization after normalization.

_pd19 · on Nov 27, 2019

Perhaps I am missing the point of this article. The RNN approach seems to get similar performance, but uses more parameters and misses the parallelization benefits that Transformers have and recurrent networks do not.

What is the benefit of the RNN here?

vsef · on Nov 27, 2019

The parallelism in a transformer doesn't necessarily translate to less or faster compute. Each layer has to be computed in serial after the previous layer, and the computation of each attention head is quadratic in the size of of the input sequence. When used this way for language modeling, the transformer also has to be run step-by-step for inference, the parallelism that was a boon at training is no longer available.

The author doesn't do much absolute wall time comparison but does mention that only the adaptive transformer configuration trained in similar time on the single gpu.

lucidrains · on Nov 27, 2019

Another work in the opposite direction, introducing gating in Transformer-xl https://arxiv.org/abs/1910.06764

octocop · on Nov 27, 2019

Hilarious papers, i'm about to drop a SHA-RNN on my GPu to make it sweat

sbpayne · on Nov 28, 2019

Did anyone one else read "SHA-RNN" as "SHHHAAAAARRRROOOOONNN" in Ozzy's voice?

reubens · on Nov 27, 2019

Now that was some refreshing reading.

Dasemu · on Nov 27, 2019

madenine · on Nov 27, 2019

Now I really want pop music made by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin

toxik · on Nov 27, 2019

A dissenting voice from the positive reception here on HN, I thought that this paper was a joke. Single author, no affiliation, snarky language. Why not be civil instead?

sqrt17 · on Nov 27, 2019

> Single author, no affiliation, snarky language.

I'd say that all of these are factors that don't add or detract from the value of the paper itself - it's a "hey I tried this and it works ok despite not going in the obvious direction". So, limited experiments but IMO competently done and with usable information.

It's a pity that all papers nowadays have a gazillion authors, from well-funded research labs, with as-dry-as-possible language that hides the real research behind a "we knew this all along rather than figuring it out along the way" facade. OTOH that's what you get in a large fairly mature research field, where most competent people get hired by research labs and then do lots of collaborative research that scales well and subsequently need to show publication counts to secure further funding.

6gvONxR4sf7o · on Nov 27, 2019

Its a shame that professionalism and showing personality are so at odds all over the place, from papers to the workplace. For the most part, professional has aligned with formal. It's clear why, but still sad :(

toxik · on Dec 3, 2019

Why is it sad? The whole point in professionalism is disaffective communication.

dual_basis · on Nov 27, 2019

While informal, I do not think his tone lacked civility.

I strongly prefer papers written in this style. Not only are they more enjoyable to read, but they are often easier to understand and more geniune as well. Papers written in a formal style often obscure the real motivation and instead provide a fancy-sounding retroactive justification. It makes the authors feel smarter, and I guess some readers feel smarter as well, but it belies the reality of research.

baylearn · on Nov 27, 2019

Language can be debated, I agree with you, but

What’s wrong with single authorship and no affiliation?

At the end of the day, if the paper proposes some idea or method, and achieves the stated claims (with reproducible code), then I don’t care who wrote it, how many authors there were, and who the authors work for.

citilife · on Nov 27, 2019

If you don't know he's a relatively successful (if you count citations) author (seems previously) from Salesforce research.

He has worked on YOLO (computer vision) and NLP related problems.

https://scholar.google.com/citations?user=AolIi4QAAAAJ

czr · on Nov 27, 2019

he hasn't worked on YOLO, only NLP. YOLO is another example of a well-known, successful researcher (Joseph Redmon) writing an informal paper

eachro · on Nov 28, 2019

I'll second this. The style is clunky and reads as though the author were trying too hard to make every sentence entertaining which mostly detracts from the work. I know Stephen Merity is a serious researcher and the content here is legit given his body of work. But the style /prose in this preprint reminded me a lot of the some of garbage Siraj Raval peddled. Again, to reiterate, I am not commenting on the substance, only the style.

octocop · on Nov 27, 2019

Because it is funny

6gvONxR4sf7o · on Nov 27, 2019

What makes this uncivil to you?