Hacker News new | past | comments | ask | show | jobs | submit | joelburget's comments login

I wrote an OCaml implementation of this paper a few years ago, which I've now extracted into its own [repo](https://github.com/joelburget/constructive-reals/blob/main/C...)

The link in the paper to their Java implementation is now broken: does anyone have a current link?


And more recently, [Language Models Use Trigonometry to Do Addition](https://arxiv.org/abs/2502.00873)


This is a good idea, though one problem is that Einsum notation (as realized in Numpy and Pytorch) doesn't support the notion of co-contravariance, and the site is based on their Einsum notation. I could potentially add the variances for the examples, though that would move away from how the site currently works (where the information about the reduction comes only from the einsum input).


A couple of these I'd like references on if anyone happens to have them.

1. "current science suggests that the actual health impact from consuming most types of plastic might well be essentially zero"

2. "the (weak) evidence we have now suggests running strengthens your knees"


o1 is an application of the Bitter Less. To quote Sutton: "The two methods that seem to scale arbitrarily in this way are search and learning." (emphasis mine -- in the original Sutton also emphasized learning).

OpenAI and others have previously pushed the learning side, while neglecting search. Now that gains from adding compute at training time have started to level off, they're adding compute at inference time.


I think the key part of the bitter lesson is that (scalable) ability to learn from data should be favored over built-in biases.

There are at least three major built-in biases in GPT-O1:

- specific reasoning heuristics hard coded in the RL decision making

- the architectural split between pre-trained LLM and what appears to be a symbolic agent calling it

- the reliance on one-time SGD driven learning (common to all these pre-trained transformers)

IMO search (reasoning) should be an emergent behavior of a predictive architecture capable of continual learning - chained what-if prediction.


Vision Transformers do a shocking amount of compression in the tokenizer. In the [Chameleon paper](https://arxiv.org/pdf/2405.09818) they say the tokenizer "encodes a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192". That's 256 pixels per token (512 * 512 / 1024). If we assume that a pixel is 24 bits (3x 8 bit channels), this implies that they've compressed 256 * 24 = 6144 bits into 13 = (log2(8192)). [An Image is Worth 32 Tokens for Reconstruction and Generation](https://yucornetto.github.io/projects/titok.html) pushes this even further. If these models work similarly, it's no wonder they struggle with some vision tasks.


It’s not as simple as that. If you ask GPT-4o to create a copy of these images, it generally creates one faithfully (e.g. an image with 5 squares will be produced), so it’s “seeing” things reasonably enough.

It doesn’t seem to have the logic though to answer these questions.

The complete data set is here to play around with it yourself: https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...


GPT-4o is very good at some visual tasks like optical character recognition. So the selective blindness might just be what you say here -- all of its capacity is dedicated to minimizing loss on a few narrow tasks that had the most training data (like OCR). So it's not necessarily an inherent failure of the architecture to generalize, it could just be a capacity issue that will naturally be resolved with more scale.


Is that not just traditional OCR applied on top of LLM?


It's possible they have a software layer that does that. But I was assuming they don't, because the open source multimodal models don't.


No it’s not, it’s a multimodal transformer model.


for some reason I started thinking about trying to describe the taste of a fruit to someone who hasn't tried it as something that can be similar to this as a non-visual sensory modal in humans


Vision transformers should be our default guess as to how GPT-4o works, yet this article never mentions them.


It works on all human languages, just inefficiently. I ran it over a sample I found on wikipedia:

    sample = "ฟองมันฟันหนู, ฟันหนูฟองมัน, ฝนทองฟองมัน"
    len(sample), len(enc.encode(sample))
This returns `39, 40` so it's just encoding one character at a time. It's probably like this for almost all non-English text.


Yeah, at least it does it with Russian


A few interesting findings:

* the cl100k_base tokenizer has ~100k tokens -- previous tokenizers had ~50k. (enc.n_vocab gives 100277 but some numbers in that range don't work, starting at 100256)

* it has exactly 1110 tokens which are just digits. 10 1 digit tokens, 100 2 digit tokens and 1000 3 digit tokens! (none have preceding spaces). this is a huge improvement from GPT2's tokenizer, which was a huge mess.

* there are <|fim_prefix|>, <|fim_middle|>, and <|fim_suffix|> tokens (see Efficient Training of Language Models to Fill in the Middle)

The biggest news to me is the improved handling of numbers. This could explain some improved performance on arithmetic. One disappointment is that it tokenizes from the front, e.g. "1000000" -> 100|000|0. This is one of those "so close!" moments -- I would work for free to fix this.


For those looking to run this on a Mac, the following seems to have worked for me (M1, Big Sur 11.2.3):

``` brew tap gcenx/wine brew install --cask --no-quarantine wine-crossover brew install winetricks winetricks corefonts vcrun6 vb5run native_oleaut32 vcrun2010 richtx32 comdlg32 comctl32ocx wine BookStory_en.exe ```


Oh sweet - thanks for this. I will add it to the README


Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: