It's the nature of ML models--nobody is 100% sure what it understands until they...

jhbadger · on Oct 2, 2022

But why isn't the list of words from those captions available anywhere (at least as far as I can tell)? There may be 600 million captions, but the number of unique words would probably be 10 or 20 thousand at most, completely feasible to browse or grep.

bawolff · on Oct 2, 2022

Its kind of a weird complaint. If i am having a conversation with someone, i wouldn't be concerned about knowing the set of all possible nouns.

jhbadger · on Oct 3, 2022

SD isn't a person you can converse with. It's just a program trained on captions and can do no more than what's in them. It's like those old adventure games that would always complain "I don't that word" except even worse because SD will happily make a picture with words it doesn't know and not tell you.

bawolff · on Oct 3, 2022

I think anyone who has both played an IF game and has played with stable diffusion knows there is a world of difference between the two.

The main difference is that coming up with a word SD doesn't know that's not contrived is really difficult. In an IF game, you are constantly guessing the correct word.

nl · on Oct 3, 2022

The complete vocabulary is available here: https://huggingface.co/openai/clip-vit-base-patch32/resolve/...

It's a bit less than 50k words, but that includes space-padded duplicates.

thedorkknight · on Oct 2, 2022

I haven't downloaded the database myself, but I imagine if you did it wouldn't be too hard to get that data. Looks like you can get the torrent here https://laion.ai/blog/laion-400-open-dataset/

spijdar · on Oct 2, 2022

I don't think the underlying model is word based, but character based. You could download the caption data for LAION and grep that, but it's not strictly 1:1 with what SD was trained against.

nl · on Oct 3, 2022

No, it's word based.

The vocabulary is here: https://huggingface.co/openai/clip-vit-base-patch32/resolve/...

It is contextual though, so words in different orders mean different things.

spijdar · on Oct 4, 2022

Huh, interesting, I had just ... assumed CLIP's tokenizer was character based, like GPT's was. At least, I think GPT's is character based?

Is there any reason it couldn't be character based, besides the (presumably very large) increase in resources needed to train and run inference? This is all way out of my league, but seems like you could get interesting results from this, since (by my caveman understanding) this hypothetical transformer could make some sense of words it had never seen before, so spelling variants or neologisms and such.

nl · on Oct 4, 2022

I started a proper reply but had to board a plane.

It's actually a byte-pair encoded (BPE is better than character encoding but can do the things you mentioned) list of things that includes words. You can find common English suffixes in it listed separately too.

spijdar · on Oct 4, 2022

Thanks for the responses, I really appreciate the help. My only background with ML is playing with LSTMs and simple sequence-to-sequence models back before transformers, and the last few days I've been trying to deep dive as much as I can into the "state-of-the-art". I dislike treating the technology as a magical black box...

nl · on Oct 4, 2022

Here's the response I half wrote before:

GPT (and many other modern NLP models) use byte-pair encoding. Your summary of the benefits of this is correct - it can deal novel words much better.

Byte-pair encoding (BPE) is better than character encoding because it can deal with unicode (and emojis).

CLIP uses a BPE encoding of the vocabulary: The transformer operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49,152 vocab size

So strictly this vocabulary is NOT (just) words, it is common sequences of byte pairs. You can see this if you examine the vocabulary - you'll find things like "tive" which isn't a word but is a very common English suffix.

jhbadger · on Oct 3, 2022

Thank you. This is really helpful. Yes, you don't know exactly how SD will respond, but for example you can grep celebrity names and can know whether SD has any chance of drawing a picture with them in it or not rather than just randomly guessing.

pwinnski · on Oct 3, 2022

It's a word list, so as I'm sure you've already figured out, you have to grep first and last names separately. For example, "jennifer" as a first name is token 19786, while "garner</w>" is token 20340. If you want "james garner" instead, looks like that's tokens 6963 and 20340. Except, since it's a word list, there's still no guarantee that either celebrity is necessarily represented until you try.

moyix · on Oct 3, 2022

You can download it here:

https://github.com/rom1504/img2dataset/blob/main/dataset_exa...

You probably would want to stop after getting the metadata, unless you have 240TB available for the images :)

More details and links to dataset explorers here: https://laion.ai/blog/laion-5b/

pwinnski · on Oct 3, 2022

49,407 tokens, many of which are not useful. It's an arduous process to narrow down, so this link decided to go in the other direction, working from zero up rather than 49,407 down.