My intuition is that the harder it is for an LLM to do something during training...

My intuition is that the harder it is for an LLM to do something during training the more actual compression/learning will be encoded in it's weights. With multi-token/diffusion it becomes much easier to "reward/loss hack" your way, this won't matter much during pretraining, but I assume a lot of "cheating" will happen in the finetune/RL phase.