I'm skeptical that RNNs alone will outperform transformers. Perhaps some sort of transformer + rnn combo?
The issue with RNNs is that feedback signals decay over time, so the model will be biased towards more recent words.
Transformers on the other hand don't have this bias. A word 10,000 words ago could be just as important as a word 5 words ago. The tradeoff is that the context window for transformers is a hard cutoff point.
How it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's very simple once you understand it.
RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.
As far as I remember in RNN times, the best models were RNNs with attention. Does this thing has any attention mechanism? If it does, then it has the same problem with the O(n^2) computation where n is the window size. My understanding is that transfers are superior due to the fact that they are much faster to train/evaluate than RNNs.
The issue with RNNs is that feedback signals decay over time, so the model will be biased towards more recent words.
Transformers on the other hand don't have this bias. A word 10,000 words ago could be just as important as a word 5 words ago. The tradeoff is that the context window for transformers is a hard cutoff point.