Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Could you or someone else kindly explain how it works for diffusion based LLMs, based on the style of GP? What happens for instance, if I give as input the prompt 'Describe Berlin' or maybe the more fitting 'Berlin city on a sunny day'?


Many diffusion based approaches have been tried for language. I like a diffusion style that goes a bit like this:

* empty string

* Berlin

* Berlin city

* Berlin is a city

* Berlin is a city in Germany

* Berlin is the capital of Germany

* Berlin is the capital of Germany, located in the North East.

* Berlin is the capital. It is located in the North East of Germany.

* Berlin is the populous capital city of Germany. It is in the North East of Germany, by the river Spree.

* Berlin is the capital city of Germany with 3.6 million inhabitants. It is located in the North East of Germany. It's centred on the river Spree.

...

At every step, the diffusion slightly rephrases to add more information content and detail to that of the previous level, by inserting, replacing and deleting tokens.

It is usually more costly to sample from these diffusion style models (which is not the case for diffusion for images, as unlike text, images tend to have a fixed size). But, one might imagine this approach to scale up to generate texts of arbitrary lengths. Just keep on adding more and more detail on your text, and you'll end up with a book or a coherent novel.


I haven't studied diffusion methods in great detail, but my simplistic ELI5 understanding goes something like: during training you gradually add noise to an image, step by step, and then the model learns how to remove the noise at each step. Adding noise to an image is, of course, very easy to do.

But for this approach adding "noise" would be much less straightforward, wouldn't it? You'd have to have some way to work out how to take "Berlin is the capital city of Germany with ..." down to an empty string in a way that marginally removes detail. For some passages this could be rather difficult to do. I feel like gathering training data would be a huge trial here.


Yes, the noising-denoising steps in this approach are summarizing-extending. The idea is that (unlike images) any piece of text lies somewhere on the length scale, like a fractal. You could take the whole text and treat it as a large scale object, or take shorter sentences and treat them on the short scale. What you are right in, is that you need a dataset of things being described twice. Like abstracts and papers, or messages and tldr's.


Another approach I've seen suggested is to start with what's essentially a randomly initialized embedding vector and have a way to turn any embedding vector into text, then what you'd do is diffuse the embedding vector for a number of steps before turning it into text at the end. This is kinda similar to what stable diffusion does with images, but is a bit more tricky to get to work well with text than pixels.


> but is a bit more tricky to get to work well with text than pixels

Interesting. Can't you just start with 4K of randomly assigned tokens and tweak it until it's right (including the tokens for EOF), like they do for images where they start with noise and move towards an image?


Anyone knows why this isn't possible?


This is not my corner of DL research, but my understanding is that pixel noise is far more acceptable in a e.g. 1084 x 1084 canvas than it is when dealing with a sentence of 36 words.


I think you'd start out with gibberish and at each step the output would become more refined, so first you'd maybe see scrambled words emblematic of Berlin, which would then eventually synthesize into a sentence. I think you'd have a fixed length of text, with maybe some way for the model to mark some of it as truncated.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: