I was referring to the input image in the diagram, what is that and how is the output image generated from it? Is it 256x256 noise that gets denoised into an image? I guess what I'm really asking is what guides the process into the final image if it's not text to image?
The "input image" is just the noisy sample from the previous timestep, yes.
The overall architecture diagram does not explicitly show the conditioning mechanism, which is a small separate network. For this paper, we only trained on class-conditional ImageNet and completely unconditional megapixel-scale FFHQ.
Training large-scale text-to-image models with this architecture is something we have not yet attempted, although there's no indication that this shouldn't work with a few tweaks.
Both Latent Consistency Models and Adversarial Diffusion Distillation (the method behind SDXL Turbo) are methods that do not depend on any specific properties of the backbone. So, as Hourglass Diffusion Transformers are just a new kind of backbone that can be used just like the Diffusion U-Nets in Stable Diffusion (XL), these methods should also be applicable to it.
nope, we don't do Imagen-style super-resolution. we go direct to high resolution with a single-stage model.