Data is mel spectrograms. To be clear about the conditional labels, I was trying...

lukasga · on Jan 7, 2023

I'm doing my thesis right now in diffusion models (for audio) and have experienced a lot of the same things you mention. One paper which I found illuminating was this one: https://proceedings.mlr.press/v139/nichol21a.html

Particularly relevant to the noisy training you mentioned earlier is their alternative timestep sampling procedure they propose which seem to reduce gradient noise significantly judging from their experiments. Would love to hear or discuss if you have found any other design changes which have improved training / sample qualities :)

radarsat1 · on Jan 7, 2023

Thanks for the feedback, glad to hear I'm not completely crazy ;). I think I saw that paper cited in my reading but haven't read it in full, will take a look thanks!

Some of the results I've had have been from trying to apply it using 1D unets (also audio). I am getting slightly better results now using larger (and more standard) 2D unets but it's really taking a long time to train, especially given that I'm still experimenting with a subset of my data.

I'm beginning to suspect that because it's learning to predict very small signal residuals, improvement in output quality is very incremental in a way that is not directly correlated to the size or nature of the dataset. Like, even if I just train it on sinusoids it takes a really long time improve. (compared to a GAN approach). None of these conclusions are very formal mind you, would love to hear this confirmed. The training dynamics just seem very different from what I am used to with either MSE or discriminative loss.

siraben · on Jan 7, 2023

I see. What types of sampling methods are you using? IIRC they are different approaches of solving the diffusion ODE and creating a sample, but I’ve only played around with them during inference.