I'm skeptical that RNNs alone will outperform transformers. Perhaps some sort of...

pizza · on March 24, 2023

I think RWKV ameliorates this to some degree:

How it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's very simple once you understand it.

RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.

gwern · on March 24, 2023

https://twitter.com/arankomatsuzaki/status/16390003799784038...

solomatov · on March 24, 2023

I don't see why this can't be done with transformers. I guess, somebody already tried doing this.

solomatov · on March 24, 2023

As far as I remember in RNN times, the best models were RNNs with attention. Does this thing has any attention mechanism? If it does, then it has the same problem with the O(n^2) computation where n is the window size. My understanding is that transfers are superior due to the fact that they are much faster to train/evaluate than RNNs.

yieldcrv · on March 24, 2023

What does RNN stand for?

edit: recurrent neural network