Thanks for the responses, I really appreciate the help. My only background with ...

nl · on Oct 4, 2022

Here's the response I half wrote before:

GPT (and many other modern NLP models) use byte-pair encoding. Your summary of the benefits of this is correct - it can deal novel words much better.

Byte-pair encoding (BPE) is better than character encoding because it can deal with unicode (and emojis).

CLIP uses a BPE encoding of the vocabulary: The transformer operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49,152 vocab size

So strictly this vocabulary is NOT (just) words, it is common sequences of byte pairs. You can see this if you examine the vocabulary - you'll find things like "tive" which isn't a word but is a very common English suffix.