Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Diffusion model is now SOTA in audio and image generation. Has anyone given it a shot on texts?

Audio is more similar to language than images because of more stronger time dependency.

The paper says the critical step they took for making diffusion model work for audio was splitting the frequency bands and applying diffusion separately to the bands (because full band model had limitations due to poor modeling of correlations between low frequency and high frequency features).

I think something could be done on text side as well.



There are two problems with this. Diffusion models work on a single rule of thumb: if you keep adding small, noisy gaussian steps to a "nice" distribution many times, you get uniform gaussian at the end.

So, for text: a) what is the equivalent of a small, noisy step? and b) what is the equivalent of a uniform gaussian in language space?

If you can solve a and b, you can make diffusion work for text, but there hasn't been any significant progress there afaik.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: