Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think RWKV ameliorates this to some degree:

How it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's very simple once you understand it.

RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.




I don't see why this can't be done with transformers. I guess, somebody already tried doing this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: