Data is mel spectrograms. To be clear about the conditional labels, I was trying to get it to come up with a vector quantized code, so it's not conditionally labeled but rather I was using an embedding layer with a VQ layer to have it come up with its own codebook. This works well with VQGAN so I was surprised that for diffusion it just keeps setting all the codes to the same value and ignoring them, but maybe I'm doing something wrong. Still working on it.
I'm just expressing here that my expectation was that this method would be less finicky than GAN because it uses an MSE loss, but unfortunately it seems to have its own difficulties. No silver bullet, I guess. The integration sampling can be quite sensitive to imperfections and diverge easily, at least in early stages of training.
I decided to write this because it feels like the early days of GAN where overall there seems to be lots of these "explain diffusion from scratch" type articles out there, but not yet a lot discussing common pitfalls and how to deal with them.
I'm doing my thesis right now in diffusion models (for audio) and have experienced a lot of the same things you mention. One paper which I found illuminating was this one: https://proceedings.mlr.press/v139/nichol21a.html
Particularly relevant to the noisy training you mentioned earlier is their alternative timestep sampling procedure they propose which seem to reduce gradient noise significantly judging from their experiments. Would love to hear or discuss if you have found any other design changes which have improved training / sample qualities :)
Thanks for the feedback, glad to hear I'm not completely crazy ;). I think I saw that paper cited in my reading but haven't read it in full, will take a look thanks!
Some of the results I've had have been from trying to apply it using 1D unets (also audio). I am getting slightly better results now using larger (and more standard) 2D unets but it's really taking a long time to train, especially given that I'm still experimenting with a subset of my data.
I'm beginning to suspect that because it's learning to predict very small signal residuals, improvement in output quality is very incremental in a way that is not directly correlated to the size or nature of the dataset. Like, even if I just train it on sinusoids it takes a really long time improve. (compared to a GAN approach). None of these conclusions are very formal mind you, would love to hear this confirmed. The training dynamics just seem very different from what I am used to with either MSE or discriminative loss.
I see. What types of sampling methods are you using? IIRC they are different approaches of solving the diffusion ODE and creating a sample, but I’ve only played around with them during inference.
I'm just expressing here that my expectation was that this method would be less finicky than GAN because it uses an MSE loss, but unfortunately it seems to have its own difficulties. No silver bullet, I guess. The integration sampling can be quite sensitive to imperfections and diverge easily, at least in early stages of training.
I decided to write this because it feels like the early days of GAN where overall there seems to be lots of these "explain diffusion from scratch" type articles out there, but not yet a lot discussing common pitfalls and how to deal with them.