Hacker News new | past | comments | ask | show | jobs | submit login

Feeding the last layer back as the input embedding has been done many times, e.g. Transformer-XL. The models are trained like this, it's not like they're taking a pre-trained Llama and just feeding it to itself. It's a simple, computationally cheap mechanism to add feedback.



I read a paper not long ago that showed that deleting, duplicating and reordering layers doesn't actually seem to matter that much and it feeding back is just a kind of re-ordering.


So you're saying that feeding the last layer back to the first makes the model layer-order independent, or kinda infinitely deep, if you squint? :).


Imo this kind of makes sense - LLMs without a feedback loop can learn to have one themselves by encoding information in the previously generated tokens.


They can't, because that would increase training loss. The training loss acts as a gatekeeper for reasoning.


from my understanding that is what they do, see the paper: > We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments. I agree the feedback is necessary, and the mechanism simple and cheap, but I don't think is optimal.


Yes, they use a pre-trained model, but they do further training (please correct me if I mis-read, and also I realize my above comment could be interpreted as saying they train a new model entirely from scratch).

> We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments. The learning rate is set to 1 × 10−4 while the effective batch size is 128. Following Deng et al. (2024), we also reset the optimizer when the training stages switch.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: