Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>> A user of an LLM might give the model some long text and then say "Translate this into German please". A Transformer can look back at its whole history.

Which isn't necessary. If you say "translate the following to german." Instead, all it needs is to remember the task at hand and a much smaller amount of recent input. Well, and the ability to output in parallel with processing input.



It's necessary for arbitrary information processing if you can forget and have no way to "unforget".

A model can decide to forget something that turns out to be important for some future prediction. A human can go back and re-read/listen etc, A transformer is always re-reading but a RNN can't and is fucked.


If the networks are to ever be a path to a closer to general intelligence, they will anyway need to be able to ask for context to be repeated, or to have separate storage where they can "choose" to replay it themselves. So this problem likely has to be solved another way anyway, both for transformers and for RNNs.


For a transformer, context is already always being repeated every token. They can fetch information that became useful anytime they want. I don't see what problem there is to solve here.


For a transformer, context is limited, so the same kind of problem applies after you exceed some size.


That's just because we twisted it's arm. One could for example feed the reversed input after, ie abc|cba where | is a special token. That would allow it to react to any part of the message.


I think this might be key, in addition to some landmark tokens to quickly backtrack to. The big question is how to train such model.

There is a recent paper from Meta that propose a way to train a model to backtrack its generation to improve generation alignment [0].

[0] https://arxiv.org/html/2409.14586v1


Also, a lightweight network could do a first pass to identify tasks, instructions, constraints etc, and then a second pass could use the RNN.

Consider the flood fill algorithm or union-find algorithm, which feels magical upon first exposure.

https://en.wikipedia.org/wiki/Hoshen%E2%80%93Kopelman_algori...

Having 2 passes can enable so much more than a single pass.

Another alternative could be to have a first pass make notes in a separate buffer while parsing the input. The bandwidth of the note taking and reading can be much much lower than that required for fetching the billions of parameters.


People did something similar to what you are describing 10 years ago: https://arxiv.org/abs/1409.0473

But it's trained on translations, rather than the whole Internet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: