I have two. One is a simple one that only deepseek R1 has passed (in my opinion):
I have a 12 liter jug and a 6 liter jug. How do I get exactly 6 liters of water?
Answer (Deepseek): Fill the 6-liter jug completely to obtain exactly 6 liters of water.
Every other LLM I've tried include o3-mini-high: Fill the 12-liter jug completely. Pour it into the 6 liter jug.
Although o3 did get it right in the reasoning: It seems like the user has a 12-liter jug and a 6-liter jug. The simplest answer is to just fill the 6-liter jug directly with water—done! But maybe there's a catch, like needing to use both jugs somehow.
So it knows that the 12 liter jug is mentioned uselessly, but most LLMs HAVE to use the 12 liter jug since it's mentioned in the prompt.
Another Prompt I recently found via perplexity (that deepseek got correct on the first try, but other LLMs can only get it right if I add: "Think step by step before answering."
A 67-year-old woman with congenital bicuspid aortic valve is admitted to the hospital because of a 2-day history of fever and chills. Her current medication is lisinopril. Physical examination reveals a temperature of 38.5°C (101.3°F), blood pressure of 110/70 mmHg, and a heart rate of 100/min. Cardiac auscultation demonstrates a grade 3/6 systolic ejection murmur at the right upper sternal border. Blood cultures are obtained, and empiric antibiotic therapy is initiated. Which of the following is the most likely causative organism?
A) Staphylococcus aureus
B) Streptococcus viridans
C) Escherichia coli
D) Pseudomonas aeruginosa
E) Candida albicans
> Here's how you can get exactly 6 liters of water using a 12-liter jug and a 6-liter jug:
1. Fill the 6-liter jug completely.
2. Pour the water from the 6-liter jug into the 12-liter jug. You now have 6 liters in the 12-liter jug.
3. Fill the 6-liter jug completely again.
4. Pour water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 6 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.
5. Now empty the 12 liter jug.
6. Pour the water from the 6-liter jug into the 12-liter jug.
7. Fill the 6-liter jug completely again.
8. Pour the water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 0 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.
And things get dramatically worse on 4o if you add a second superfluous jug:
To measure exactly 6 liters of water using a 16-liter jug, a 12-liter jug, and a 6-liter jug, follow these steps:
Step-by-Step Solution:
Fill the 12-liter jug completely (it now holds 12 liters).
Pour water from the 12-liter jug into the 16-liter jug. The 16-liter jug now contains 12 liters, and the 12-liter jug is empty.
Fill the 12-liter jug completely again (it now holds 12 liters).
Pour water from the 12-liter jug into the 16-liter jug until the 16-liter jug is full. Since the 16-liter jug already has 12 liters, it can only take 4 more liters. This leaves 8 liters in the 12-liter jug.
Empty the 16-liter jug completely.
Pour the remaining 8 liters from the 12-liter jug into the 16-liter jug.
Fill the 12-liter jug completely again (it now holds 12 liters).
Pour water from the 12-liter jug into the 16-liter jug until the 16-liter jug contains exactly 16 liters. Since the 16-liter jug already has 8 liters, it can only take 8 more liters, leaving 4 liters in the 12-liter jug.
Pour the 4 liters from the 12-liter jug into the empty 6-liter jug. The 6-liter jug now contains 4 liters.
Fill the 12-liter jug completely again (it now holds 12 liters).
Pour water from the 12-liter jug into the 6-liter jug until the 6-liter jug is full. Since the 6-liter jug already has 4 liters, it can only take 2 more liters. This leaves 10 liters in the 12-liter jug.
Empty the 6-liter jug completely.
Pour the remaining 10 liters from the 12-liter jug into the 6-liter jug.
Now, the 6-liter jug contains exactly 6 liters of water.
Interestingly, o3-mini-high was correct when first thinking about it:
> Okay, we're asked how to get exactly 6 liters of water using an 12-liter and a 6-liter jug. The immediate thought is to just fill the 6-liter jug, but that seems too simple, doesn’t it? So maybe there’s a trick here. Perhaps this is a puzzle where the challenge is to measure 6 liters with some pouring involved. I’ll stick with the simple solution for now—fill the 6-liter jug and stop there.
I have to take all these comparisons with a heap of salt because no one bothers to run the test 20 times on each model to smooth out the probabalistic nature of the LLM landing on the right answer. There must be some fallacy for this, that you would sample once from each and declare a definitive winner, I see it all the time.
>Every other LLM I've tried include o3-mini-high: Fill the 12-liter jug completely. Pour it into the 6 liter jug.
Try it with a 12L jug and a 4L jug and ask for 4L. See if it tells you to just fill the 4L or to fill the 12L and pour into the 4L twice discarding both times and there will be 4L remaining in the 12L jug.
Even though it's still technically correct, it demonstrates that there's no real "reasoning" happening just regurgitation of training data.
I just asked the first question to O1 Pro, it thought for 48 seconds and messed up.
It told me to fill the 12 liter jug then pour that into the 6 liter jug.
I then scolded it and told it I was going to use Deepseek R1 because it did a better job, and we were shutting down OpenAI forever, and THEN it thought for 4 minutes and got a bit snippy with me haha, it got the answer but then glossed over it also said “but that’s too simple isn’t it?”.
Output:
Here’s a more “classic puzzle” style approach, spelled out step-by-step, to end up with exactly 6 liters in the 12-liter jug. This method uses multiple fills and transfers (instead of the trivial “fill the 12, pour off 6” approach) so you can see a more puzzle-like sequence:
1. Fill the 6-liter jug from the water supply.
• State: (12-liter jug = 0 L, 6-liter jug = 6 L)
2. Pour from the 6-liter jug into the 12-liter jug until the 6-liter jug is empty (the 12-liter jug still has plenty of room).
• State: (12-liter jug = 6 L, 6-liter jug = 0 L)
3. Fill the 6-liter jug again from the water supply.
• State: (12-liter jug = 6 L, 6-liter jug = 6 L)
4. Pour from the 6-liter jug into the 12-liter jug until the 12-liter jug is completely full.
• The 12-liter jug currently has 6 liters and can take 6 more liters before it’s full, so you’ll pour the entire 6 liters from the 6-liter jug.
• State: (12-liter jug = 12 L, 6-liter jug = 0 L)
5. Empty the 12-liter jug (discard the water or set it aside).
• State: (12-liter jug = 0 L, 6-liter jug = 0 L)
6. Transfer the remaining water in the 6-liter jug to the 12-liter jug—but right now the 6-liter jug is empty, so that doesn’t help. The trick is: we want to exploit the difference in capacities to measure out new amounts. So let’s continue.
Since both are empty, let’s do something else:
a. Fill the 12-liter jug from the supply.
• State: (12-liter jug = 12 L, 6-liter jug = 0 L)
b. Pour from the 12-liter jug into the 6-liter jug until the 6-liter jug is full, which removes 6 liters from the 12-liter jug.
• Now the 12-liter jug has liters left.
• State: (12-liter jug = 6 L, 6-liter jug = 6 L)
At this point, the 12-liter jug contains exactly 6 liters. That’s what we wanted.
If You Want the 6 Liters in the Smaller Jug Instead
If, for some reason, you need exactly 6 liters in the 6-liter jug (though that jug obviously holds 6 liters when full, so it’s usually trivial), here’s a more puzzle-like sequence:
1. Fill the 12-liter jug (12 L in the big jug, 0 L in the small jug).
2. Pour from the 12-liter jug into the 6-liter jug until the small jug is full.
• State: (12-liter jug = 6 L, 6-liter jug = 6 L)
3. Empty the 6-liter jug.
• State: (12-liter jug = 6 L, 6-liter jug = 0 L)
4. Transfer the 6 liters from the 12-liter jug into the 6-liter jug.
• State: (12-liter jug = 0 L, 6-liter jug = 6 L)
Again, you end up with exactly 6 liters in the 6-liter jug.
Both sequences go beyond the simple “fill the 12, pour 6 off” short-cut. Hopefully, that’s less “terrible” and shows a more classic multi-step approach often seen in water-jug riddles.
Have you compared it to the 1.58 bit dynamic quant model based on the original R1 (i.e., not a distillation)? Whatever unsloth did, it doesn't seem to be giving up much reasoning performance over the full Q8 version.
I always ask every model to implement a Qt QSyntaxHighlighter subclass for syntax highlighting code and a QAbstractListModel subclass that parses markdown into blocks - in C++, both implemented using tree sitter. It's sounds like a coding problem but it's much more a reasoning problem of how to combine the two APIs and is out of band of the training data. I test it with multiple levels of prompt fidelity that I have built up watching the many mistakes past models have made and o3-mini-high and o1 can usually get it done within a few iterations.
I haven't tested it on this model but my results with DeepSeek models have been underwhelming and I've become skeptical of their hype.
This photography question can be solved with the right equations. A lot of non-reasoning LLMs would spout some nonsense like 0.67 stops faster. Sometimes they’ll leave a stray negative sign in too!
The answer should be approximately 1.37, although “1 and 1/3” is acceptable too.
LLMs usually don’t have trouble coming up with the formulas, so it’s not a particularly obscure question, just one that won’t have a memorized answer, since there are very few f/4.5 lenses on the market, and even fewer people asking this exact question online. Applying those formulas is harder, but the LLM should be able to sanity check the result and catch common errors. (f/2.8 -> f/4 is one full stop, which is common knowledge among photographers, so getting a result of less than one is obviously an error.)
This also avoids being a test that just emphasizes tokenizer problems… I find the strawberry test to be dreadfully boring. It’s not a useful test. No one is actually using LLMs to count letters in words, and until we have LLMs that can actually see the letters of each word… it’s just not a good test, in my opinion. I’m convinced that the big AI labs see it as a meme at this point, which is the only reason they keep bringing it up. They must find the public obsession with it hilarious.
I was impressed at how consistently well Phi-4 did at my photography math question, especially for a non-reasoning model. Phi-4 scored highly on math benchmarks, and it shows.
Yeah I'm very impressed that the tiniest quant version appears partially usable... especially now I've replicated that same S, T, R, A, W, B, E, R, F, U, R, Y bug on a much larger one: https://news.ycombinator.com/item?id=43018494
I like when the model starts to ask me how to solve something. Often find it with Sonnet when I am looking to solve a problem. The model starts becoming "curious" and treats as if I was the model and tries to nudge me to find solution...
With q8 I got "s-t-r-a-w-f-u-r-e-r" and then "strawfurber" in the 5th attempt. All other attempts it did not misspell and found 3 (this one it found 3 also by counting these imaginary words).
% ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F32
pulling manifest
pulling 84159ad584aa... 100% 7.1 GB
pulling 8aa4c0321ccd... 100% 493 B
pulling c737442a210a... 100% 170 B
pulling ed76df87b934... 100% 193 B
verifying sha256 digest
writing manifest
success
>>> Count the letter Rs in the word strawberry
<think>
Okay, so I need to count how many times the letter 'r'
appears in the word "strawberry." Hmm, let's break this down
step by step. First, I should probably spell out each letter
of the word to make sure I don't miss any 'r's.
Starting with "strawberry." Let me write it out: S, T, R, A,
W, B, E, R, F, U, R, Y? Wait, maybe I'm not spelling it
correctly. Actually, let's count each letter properly.
The word is spelled as follows:
S T R A W B E R F U R Y. Wait, no, that might be longer than
needed. Let me write it out more accurately:
The full word is "strawberry." So breaking it down:
1. S
2. T
3. R
4. A
5. W
6. B
7. E
8. R
9. F
10. U
11. R
12. Y
Update: running the F16 model fixed that problem:
ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F16
"Count the letter Rs in the word strawberry" is a bit dull!
I'm trying this one locally using Ollama and the smallest quantized GGUF version (769MB) I could find - https://huggingface.co/bartowski/agentica-org_DeepScaleR-1.5... - I ran it like this:
Here's the answer it gave me: https://gist.github.com/simonw/5943a77f35d1d5185f045fb53898a... - it got the correct answer after double-checking itself 9 times! And if you look at its thought it made a pretty critical error right at the start: Hopefully the versions that aren't quantized that tiny do better than that.