Feeding the last layer back as the input embedding has been done many times, e.g. Transformer-XL. The models are trained like this, it's not like they're taking a pre-trained Llama and just feeding it to itself. It's a simple, computationally cheap mechanism to add feedback.
I read a paper not long ago that showed that deleting, duplicating and reordering layers doesn't actually seem to matter that much and it feeding back is just a kind of re-ordering.
Imo this kind of makes sense - LLMs without a feedback loop can learn to have one themselves by encoding information in the previously generated tokens.
from my understanding that is what they do, see the paper:
> We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments.
I agree the feedback is necessary, and the mechanism simple and cheap, but I don't think is optimal.
Yes, they use a pre-trained model, but they do further training (please correct me if I mis-read, and also I realize my above comment could be interpreted as saying they train a new model entirely from scratch).
> We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments. The learning rate is
set to 1 × 10−4 while the effective batch size is 128. Following Deng et al. (2024), we also reset the optimizer
when the training stages switch.