I'm a bit surprised it gets this question wrong (ChatGPT gets it right, even on instant). All the pre-reasoning models failed this question, but it's seemed solved since o1, and Sonnet 4.5 got it right.
Yeah, earlier in the GPT days I felt like this was a good example of LLMs being "a blurry jpeg of the web", since you could give them something that was very close to an existing puzzle that exists commonly on the web, and they'd regurgitate an answer from that training set. It was neat to me to see the question get solved consistently by the reasoning models (though often by churning a bunch of tokens trying and verifying to count 888 + 88 + 8 + 8 + 8 as nine digits).
I wonder if it's a temperature thing or if things are being throttled up/down on time of day. I was signed in to a paid claude account when I ran the test.
https://claude.ai/share/876e160a-7483-4788-8112-0bb4490192af
This was sonnet 4.6 with extended thinking.