I think RWKV ameliorates this to some degree: *How it works: RWKV gathers inform... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		pizza on March 24, 2023 \| parent \| context \| favorite \| on: RWKV RNN: Better than ChatGPT? I think RWKV ameliorates this to some degree: How it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's very simple once you understand it. RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.

gwern on March 24, 2023 | [–]

https://twitter.com/arankomatsuzaki/status/16390003799784038...

solomatov on March 24, 2023 | [–]

I don't see why this can't be done with transformers. I guess, somebody already tried doing this.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact