Hacker News new | past | comments | ask | show | jobs | submit login
Single Headed Attention RNN (arxiv.org)
212 points by spatters on Nov 27, 2019 | hide | past | favorite | 37 comments



The writing style is amusing. :)

Some notes from a first glance:

* In the experiments, I see that he uses the Single Headed Attention model actually also with 4 heads, which is kind of a contradiction to the name, isn't it?

* The main motivation is performance (training speed mostly). So some absolute number of e.g. training time would be nice to have in the comparisons. He e.g. mentions that the Adaptive Transformer can also be trained on a single GPU within hours, and in the comparison, the Adaptive Transformer gets much better BPC (enwik8), and uses even slightly less parameters. So, isn't the Adaptive Transformer thus better in every aspect (speed and BPC)? Or how does it compare in speed? As far as I remember, also the Sparse Transformer is more efficient (as it has sparsity), so again the speed comparison would be interesting here. Or is the argumentation for inference speed? But then the inference speed should be compared, or not?


I don't think that was his motivation, I think his motivation was stated quite clearly in the abstract:

> The author's lone goal is to show that the entire field might have evolved a different direction if we had instead been obsessed with a slightly different acronym and slightly different result.


Honestly, I wish all research papers were written this way. Easy to understand, kept me entertained, and presented meaningful results with a way to reproduce (on a single GPU).

I grant all research papers on deep learning can't be reproducible with a single GPU in a reasonable time, but it should happen more often IMO. It seems lazy to just toss out a paper saying "we hit new benchmarks, by increasing the parameters and throwing more compute". I'd like to see "we hit new benchmarks with a new design, the old ones had this issue, etc.

Anyway, great read, recommend. Also, happy for the author haha

"The author has also moved to a one bedroom apartment in San Francisco, removing themselves from proximity to the alley of questionable odors and unsavory noises."


Now imagine reading papers is your job and you try to skim through dozens of wannabe stand up comedians each day.


As opposed to wannabe tax lawyers? I'll take the comedians thank you very much.


Is dozens of papers per day how academics work? Holy crap.


Notice the word skim. Yes when I'm trying to figure something out I often have 10 tabs with papers open and I'm flipping between them.


Depends. If the paper is dense you're taking hours per paper, rereading many times and rederiving details. You can skim through many you don't want to dive deeply though (if it's a topic you just need a feel for).


The author is very intelligent and is doing three things differently from a 'standard' paper:

1) Reducing the density of information per paragraph (vs. packing information in)

2) Clearly outlined motivation and context (vs. just referencing some other papers and assuming they've been read)

3) Deploying comedy (vs. professionalism)

The first two improve the paper, the third is a step backwards because the reader has to spend effort separating fact & fiction. The combination in this case works and is a lot of fun but it would have been a catastrophic and cringeworthy exercise if the execution of (1) and (2) hadn't worked out so well.

The real trick here is excellent writing and the comedy is simply draws attention to it. Much like how an army marching in a silly way draws attention to its discipline. The silly march itself is not a good idea.


I'm not sure reducing the density of information is a good thing. It probably makes it easier to read for someone who is not entirely familiar with a field but makes it slower for the people for whom the paper is written. (Too dense papers are really hard to read, but that's relatively rare)


for those who aren't familiar with the author, he previously worked at metamind / salesforce research doing nlp and has published many successful nlp papers [0]. he opted to write an informal paper for this project (similar to yolov3 [1]), but the work itself should still be taken seriously.

[0] https://scholar.google.com/citations?user=AolIi4QAAAAJ

[1] https://pjreddie.com/media/files/papers/YOLOv3.pdf


You just have to love Stephen Merity.

His work on QRNN's saved me quite a bit of time and money when I was doing my undergrad dissertation on language models.

This SHA-RNN seems to have surfaced from a similar line of thinking that spawned the QRNN.


Are qRNNs still used much?


check out MultiFiT [0] from fastai, it uses QRNN for speed.

[0] http://nlp.fast.ai/classification/2019/09/10/multifit.html


The paper rises a great point on tokenization affecting perplexity, that we can't compare perplexities of different tokenizers even re-normalizing taking token counts into account, say BPE vs word tokenization. This example nails it: https://twitter.com/Smerity/status/1192252147598909441


I don't see his point. Doesn't renormalizing token counts essentially eliminate the effect of tokenization? The perplexity which then we get essentially is representative of how well a model compresses the test document. Isn't that the whole point? A better model compresses the document better, how does it matter if you model each character or each word or bigrams or even directly the bits?

The main disadvantage of word-level models is large vocabulary size, however, the tweet completely ignores the advantage--sequence length becomes shorter, it has to look only a few tokens back to find the reference to "Bob" and "Alice".

The same model at word level writes more sensible sentences than at character level. There's a tradeoff between larger vocabulary and modelling longer dependencies. A model which can encode a text document more effectively is better; tokenization is just a part of the modelling. You just need to take care of the "number of words" of "per word" part of "perplexity per word" and you can directly compare their performances.

The author is wrong that entropy collapses after "A" is given of "Alice". Entropy will only collapse if the model has really "understood" the context and modelled that "Bob" and "Alice" are the only options here. The entropy won't collapse for a sentencepice based bi-gram model, for example.

In his example, it is not clear if the wordpiece model is at an advantage. Suppose both the models "understand" that there are two options "Bob" and "Alice". Then the word-level model only has to predict one token which can be either of the names. Perplexity = 0.5. The sentence-piece model also has to choose between two tokens "B" and "A", the second token won't add to perplexity since it'll be known. Perplexity = 0.5.


Good point, assuming some extent of collapse is crucial, and the question is if different perplexities due to tokenization can happen in principle. You are right that in "Alice" vs. "A|lice" example we get the same perplexity after re-normalization, I can't come up with an example where it would be different right now.


I agree. Perplexities (probability of a text) can be compared using different tokenization after normalization.


Perhaps I am missing the point of this article. The RNN approach seems to get similar performance, but uses more parameters and misses the parallelization benefits that Transformers have and recurrent networks do not.

What is the benefit of the RNN here?


The parallelism in a transformer doesn't necessarily translate to less or faster compute. Each layer has to be computed in serial after the previous layer, and the computation of each attention head is quadratic in the size of of the input sequence. When used this way for language modeling, the transformer also has to be run step-by-step for inference, the parallelism that was a boon at training is no longer available.

The author doesn't do much absolute wall time comparison but does mention that only the adaptive transformer configuration trained in similar time on the single gpu.


Another work in the opposite direction, introducing gating in Transformer-xl https://arxiv.org/abs/1910.06764


Hilarious papers, i'm about to drop a SHA-RNN on my GPu to make it sweat


Did anyone one else read "SHA-RNN" as "SHHHAAAAARRRROOOOONNN" in Ozzy's voice?


Now that was some refreshing reading.


ok


Now I really want pop music made by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin


A dissenting voice from the positive reception here on HN, I thought that this paper was a joke. Single author, no affiliation, snarky language. Why not be civil instead?


> Single author, no affiliation, snarky language.

I'd say that all of these are factors that don't add or detract from the value of the paper itself - it's a "hey I tried this and it works ok despite not going in the obvious direction". So, limited experiments but IMO competently done and with usable information.

It's a pity that all papers nowadays have a gazillion authors, from well-funded research labs, with as-dry-as-possible language that hides the real research behind a "we knew this all along rather than figuring it out along the way" facade. OTOH that's what you get in a large fairly mature research field, where most competent people get hired by research labs and then do lots of collaborative research that scales well and subsequently need to show publication counts to secure further funding.


Its a shame that professionalism and showing personality are so at odds all over the place, from papers to the workplace. For the most part, professional has aligned with formal. It's clear why, but still sad :(


Why is it sad? The whole point in professionalism is disaffective communication.


While informal, I do not think his tone lacked civility.

I strongly prefer papers written in this style. Not only are they more enjoyable to read, but they are often easier to understand and more geniune as well. Papers written in a formal style often obscure the real motivation and instead provide a fancy-sounding retroactive justification. It makes the authors feel smarter, and I guess some readers feel smarter as well, but it belies the reality of research.


Language can be debated, I agree with you, but

What’s wrong with single authorship and no affiliation?

At the end of the day, if the paper proposes some idea or method, and achieves the stated claims (with reproducible code), then I don’t care who wrote it, how many authors there were, and who the authors work for.


If you don't know he's a relatively successful (if you count citations) author (seems previously) from Salesforce research.

He has worked on YOLO (computer vision) and NLP related problems.

https://scholar.google.com/citations?user=AolIi4QAAAAJ


he hasn't worked on YOLO, only NLP. YOLO is another example of a well-known, successful researcher (Joseph Redmon) writing an informal paper


I'll second this. The style is clunky and reads as though the author were trying too hard to make every sentence entertaining which mostly detracts from the work. I know Stephen Merity is a serious researcher and the content here is legit given his body of work. But the style /prose in this preprint reminded me a lot of the some of garbage Siraj Raval peddled. Again, to reiterate, I am not commenting on the substance, only the style.


Because it is funny


What makes this uncivil to you?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: