Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>i'd love an extension of this for the diffusion transformer

All you need to do is replace the U-net with a transformer encoder (remove the embeddings, and project the image patches into vectors of size n_embd), and the diffusion process can remain the same.



seems too simple. isn't there also a temporal dimension you need to encode?


For the conditioning and t there are different possibilities, for example the unpooled text embeddings (if the model is conditioned on text) usually go in the cross-attn while the pooled text embedding plus t is used in adaLN blocks like StyleGAN (the first one), but there are many other different strategies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: