> Is it super resolution? nope, we don't do Imagen-style super-resolution. we go...

sorenjan · on Jan 24, 2024

I was referring to the input image in the diagram, what is that and how is the output image generated from it? Is it 256x256 noise that gets denoised into an image? I guess what I'm really asking is what guides the process into the final image if it's not text to image?

stefanbaumann · on Jan 24, 2024

The "input image" is just the noisy sample from the previous timestep, yes.

The overall architecture diagram does not explicitly show the conditioning mechanism, which is a small separate network. For this paper, we only trained on class-conditional ImageNet and completely unconditional megapixel-scale FFHQ.

Training large-scale text-to-image models with this architecture is something we have not yet attempted, although there's no indication that this shouldn't work with a few tweaks.

sorenjan · on Jan 24, 2024

Thank you, I'm not used to reading this kind of research papers but I think I got the gist of it now.

Can this architecture be used to distill models that need fewer timesteps like LCMs or SDXL turbo?

stefanbaumann · on Jan 24, 2024

Both Latent Consistency Models and Adversarial Diffusion Distillation (the method behind SDXL Turbo) are methods that do not depend on any specific properties of the backbone. So, as Hourglass Diffusion Transformers are just a new kind of backbone that can be used just like the Diffusion U-Nets in Stable Diffusion (XL), these methods should also be applicable to it.