It's the nature of ML models--nobody is 100% sure what it understands until they try something and get results.
It was given a lot of tagged data: 600 million captioned images from LAION-5B. So if you want to know what it might support, you could try any one of the captions from those 600 million images.
But why isn't the list of words from those captions available anywhere (at least as far as I can tell)? There may be 600 million captions, but the number of unique words would probably be 10 or 20 thousand at most, completely feasible to browse or grep.
SD isn't a person you can converse with. It's just a program trained on captions and can do no more than what's in them. It's like those old adventure games that would always complain "I don't that word" except even worse because SD will happily make a picture with words it doesn't know and not tell you.
I think anyone who has both played an IF game and has played with stable diffusion knows there is a world of difference between the two.
The main difference is that coming up with a word SD doesn't know that's not contrived is really difficult. In an IF game, you are constantly guessing the correct word.
I haven't downloaded the database myself, but I imagine if you did it wouldn't be too hard to get that data. Looks like you can get the torrent here https://laion.ai/blog/laion-400-open-dataset/
I don't think the underlying model is word based, but character based. You could download the caption data for LAION and grep that, but it's not strictly 1:1 with what SD was trained against.
Huh, interesting, I had just ... assumed CLIP's tokenizer was character based, like GPT's was. At least, I think GPT's is character based?
Is there any reason it couldn't be character based, besides the (presumably very large) increase in resources needed to train and run inference? This is all way out of my league, but seems like you could get interesting results from this, since (by my caveman understanding) this hypothetical transformer could make some sense of words it had never seen before, so spelling variants or neologisms and such.
I started a proper reply but had to board a plane.
It's actually a byte-pair encoded (BPE is better than character encoding but can do the things you mentioned) list of things that includes words. You can find common English suffixes in it listed separately too.
Thanks for the responses, I really appreciate the help. My only background with ML is playing with LSTMs and simple sequence-to-sequence models back before transformers, and the last few days I've been trying to deep dive as much as I can into the "state-of-the-art". I dislike treating the technology as a magical black box...
GPT (and many other modern NLP models) use byte-pair encoding. Your summary of the benefits of this is correct - it can deal novel words much better.
Byte-pair encoding (BPE) is better than character encoding because it can deal with unicode (and emojis).
CLIP uses a BPE encoding of the vocabulary: The transformer operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49,152 vocab size
So strictly this vocabulary is NOT (just) words, it is common sequences of byte pairs. You can see this if you examine the vocabulary - you'll find things like "tive" which isn't a word but is a very common English suffix.
Thank you. This is really helpful. Yes, you don't know exactly how SD will respond, but for example you can grep celebrity names and can know whether SD has any chance of drawing a picture with them in it or not rather than just randomly guessing.
It's a word list, so as I'm sure you've already figured out, you have to grep first and last names separately. For example, "jennifer" as a first name is token 19786, while "garner</w>" is token 20340. If you want "james garner" instead, looks like that's tokens 6963 and 20340. Except, since it's a word list, there's still no guarantee that either celebrity is necessarily represented until you try.
49,407 tokens, many of which are not useful. It's an arduous process to narrow down, so this link decided to go in the other direction, working from zero up rather than 49,407 down.
It was given a lot of tagged data: 600 million captioned images from LAION-5B. So if you want to know what it might support, you could try any one of the captions from those 600 million images.