Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm skeptical that RNNs alone will outperform transformers. Perhaps some sort of transformer + rnn combo?

The issue with RNNs is that feedback signals decay over time, so the model will be biased towards more recent words.

Transformers on the other hand don't have this bias. A word 10,000 words ago could be just as important as a word 5 words ago. The tradeoff is that the context window for transformers is a hard cutoff point.



I think RWKV ameliorates this to some degree:

How it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's very simple once you understand it.

RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.



I don't see why this can't be done with transformers. I guess, somebody already tried doing this.


As far as I remember in RNN times, the best models were RNNs with attention. Does this thing has any attention mechanism? If it does, then it has the same problem with the O(n^2) computation where n is the window size. My understanding is that transfers are superior due to the fact that they are much faster to train/evaluate than RNNs.


What does RNN stand for?

edit: recurrent neural network




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: