>i'd love an extension of this for the diffusion transformer All you need to do ...

swyx · on March 11, 2024

seems too simple. isn't there also a temporal dimension you need to encode?

GaggiX · on March 11, 2024

For the conditioning and t there are different possibilities, for example the unpooled text embeddings (if the model is conditioned on text) usually go in the cross-attn while the pooled text embedding plus t is used in adaLN blocks like StyleGAN (the first one), but there are many other different strategies.