> *The explanation in the original paper turns out not to be true; you can get r...

> The explanation in the original paper turns out not to be true; you can get rid of most of their assumptions and it still works

I’ll admit it is amusing that some assumptions on why it works were incorrect. The core idea of a Markov chain[0] where each state change leads to higher likelihood, is bound to work, even if the rest doesn’t.

In my mind, the Muse paper[1] gets closer to why it works: ultimately, the denoiser tries to match the latent space for an implicit encoder. The Muse system does this more directly and more effectively, by using cross-entropy loss on latent tokens instead.

In a way, the whole problem is no different from a language translation task. The only difference is that the output needs to be decoded into pixels instead of BPE tokens.

[0]: https://arxiv.org/abs/1503.03585

[1]: https://arxiv.org/abs/2301.00704