>i'd love an extension of this for the diffusion transformer
All you need to do is replace the U-net with a transformer encoder (remove the embeddings, and project the image patches into vectors of size n_embd), and the diffusion process can remain the same.
For the conditioning and t there are different possibilities, for example the unpooled text embeddings (if the model is conditioned on text) usually go in the cross-attn while the pooled text embedding plus t is used in adaLN blocks like StyleGAN (the first one), but there are many other different strategies.
All you need to do is replace the U-net with a transformer encoder (remove the embeddings, and project the image patches into vectors of size n_embd), and the diffusion process can remain the same.