Hey, I'm Arnaud, first author of the paper. The answer is a bit mixed. We actual...

Hey, I'm Arnaud, first author of the paper. The answer is a bit mixed. We actually started looking into this because of a repetition problem that appeared in a low-data regime for a sequence generation task. Basically, the left-to-right GPT was stuck repeating the same token once it sampled twice the same in a row during generation. And to mitigate that, we tried to generate the sequence in a random order and it seemed to help and we see less of this repetition issue. We initially thought when we don't have enough data, shuffling would be like data-augmentation and might actually help the model reach better performance. But this is not what we found in the experiments: apparently as learning in any order is a harder task, the model memorise the data more.