Feeding the last layer back as the input embedding has been done many times, e.g...

empath75 · 2024-12-10T18:13:39 1733854419

I read a paper not long ago that showed that deleting, duplicating and reordering layers doesn't actually seem to matter that much and it feeding back is just a kind of re-ordering.

TeMPOraL · 2024-12-10T18:19:07 1733854747

So you're saying that feeding the last layer back to the first makes the model layer-order independent, or kinda infinitely deep, if you squint? :).

torginus · 2024-12-10T18:40:38 1733856038

Imo this kind of makes sense - LLMs without a feedback loop can learn to have one themselves by encoding information in the previously generated tokens.

imtringued · 2024-12-10T19:41:57 1733859717

They can't, because that would increase training loss. The training loss acts as a gatekeeper for reasoning.

fabmilo · 2024-12-10T19:21:21 1733858481

from my understanding that is what they do, see the paper: > We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments. I agree the feedback is necessary, and the mechanism simple and cheap, but I don't think is optimal.

zxexz · 2024-12-11T06:08:57 1733897337

Yes, they use a pre-trained model, but they do further training (please correct me if I mis-read, and also I realize my above comment could be interpreted as saying they train a new model entirely from scratch).

> We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments. The learning rate is set to 1 × 10−4 while the effective batch size is 128. Following Deng et al. (2024), we also reset the optimizer when the training stages switch.