This is a good idea, though one problem is that Einsum notation (as realized in Numpy and Pytorch) doesn't support the notion of co-contravariance, and the site is based on their Einsum notation. I could potentially add the variances for the examples, though that would move away from how the site currently works (where the information about the reduction comes only from the einsum input).
o1 is an application of the Bitter Less. To quote Sutton: "The two methods that seem to scale arbitrarily in this way are search and learning." (emphasis mine -- in the original Sutton also emphasized learning).
OpenAI and others have previously pushed the learning side, while neglecting search. Now that gains from adding compute at training time have started to level off, they're adding compute at inference time.
Vision Transformers do a shocking amount of compression in the tokenizer. In the [Chameleon paper](https://arxiv.org/pdf/2405.09818) they say the tokenizer "encodes a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192". That's 256 pixels per token (512 * 512 / 1024). If we assume that a pixel is 24 bits (3x 8 bit channels), this implies that they've compressed 256 * 24 = 6144 bits into 13 = (log2(8192)). [An Image is Worth 32 Tokens for Reconstruction and Generation](https://yucornetto.github.io/projects/titok.html) pushes this even further. If these models work similarly, it's no wonder they struggle with some vision tasks.
It’s not as simple as that. If you ask GPT-4o to create a copy of these images, it generally creates one faithfully (e.g. an image with 5 squares will be produced), so it’s “seeing” things reasonably enough.
It doesn’t seem to have the logic though to answer these questions.
GPT-4o is very good at some visual tasks like optical character recognition. So the selective blindness might just be what you say here -- all of its capacity is dedicated to minimizing loss on a few narrow tasks that had the most training data (like OCR). So it's not necessarily an inherent failure of the architecture to generalize, it could just be a capacity issue that will naturally be resolved with more scale.
for some reason I started thinking about trying to describe the taste of a fruit to someone who hasn't tried it as something that can be similar to this as a non-visual sensory modal in humans
* the cl100k_base tokenizer has ~100k tokens -- previous tokenizers had ~50k. (enc.n_vocab gives 100277 but some numbers in that range don't work, starting at 100256)
* it has exactly 1110 tokens which are just digits. 10 1 digit tokens, 100 2 digit tokens and 1000 3 digit tokens! (none have preceding spaces). this is a huge improvement from GPT2's tokenizer, which was a huge mess.
* there are <|fim_prefix|>, <|fim_middle|>, and <|fim_suffix|> tokens (see Efficient Training of Language Models to Fill in the Middle)
The biggest news to me is the improved handling of numbers. This could explain some improved performance on arithmetic. One disappointment is that it tokenizes from the front, e.g. "1000000" -> 100|000|0. This is one of those "so close!" moments -- I would work for free to fix this.
The link in the paper to their Java implementation is now broken: does anyone have a current link?