Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone have any good simple prompts for testing new "reasoning" models like this one?

"Count the letter Rs in the word strawberry" is a bit dull!

I'm trying this one locally using Ollama and the smallest quantized GGUF version (769MB) I could find - https://huggingface.co/bartowski/agentica-org_DeepScaleR-1.5... - I ran it like this:

  ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:IQ3_XXS
Here's the answer it gave me: https://gist.github.com/simonw/5943a77f35d1d5185f045fb53898a... - it got the correct answer after double-checking itself 9 times! And if you look at its thought it made a pretty critical error right at the start:

  "Strawberry" has the letters S, T, R, A, W, B, E, R, F,
  U, R, E. Let me count each 'R' as I go along.
Hopefully the versions that aren't quantized that tiny do better than that.


I have two. One is a simple one that only deepseek R1 has passed (in my opinion):

I have a 12 liter jug and a 6 liter jug. How do I get exactly 6 liters of water?

Answer (Deepseek): Fill the 6-liter jug completely to obtain exactly 6 liters of water.

Every other LLM I've tried include o3-mini-high: Fill the 12-liter jug completely. Pour it into the 6 liter jug.

Although o3 did get it right in the reasoning: It seems like the user has a 12-liter jug and a 6-liter jug. The simplest answer is to just fill the 6-liter jug directly with water—done! But maybe there's a catch, like needing to use both jugs somehow.

So it knows that the 12 liter jug is mentioned uselessly, but most LLMs HAVE to use the 12 liter jug since it's mentioned in the prompt.

Another Prompt I recently found via perplexity (that deepseek got correct on the first try, but other LLMs can only get it right if I add: "Think step by step before answering."

A 67-year-old woman with congenital bicuspid aortic valve is admitted to the hospital because of a 2-day history of fever and chills. Her current medication is lisinopril. Physical examination reveals a temperature of 38.5°C (101.3°F), blood pressure of 110/70 mmHg, and a heart rate of 100/min. Cardiac auscultation demonstrates a grade 3/6 systolic ejection murmur at the right upper sternal border. Blood cultures are obtained, and empiric antibiotic therapy is initiated. Which of the following is the most likely causative organism? A) Staphylococcus aureus B) Streptococcus viridans C) Escherichia coli D) Pseudomonas aeruginosa E) Candida albicans


Wow, Gemini 2 has a 9 step process for the first question https://sugaku.net/qna/1d922e55-fd28-4c70-b3e9-d70c3a05ce1f/

> Here's how you can get exactly 6 liters of water using a 12-liter jug and a 6-liter jug:

1. Fill the 6-liter jug completely.

2. Pour the water from the 6-liter jug into the 12-liter jug. You now have 6 liters in the 12-liter jug.

3. Fill the 6-liter jug completely again.

4. Pour water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 6 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.

5. Now empty the 12 liter jug.

6. Pour the water from the 6-liter jug into the 12-liter jug.

7. Fill the 6-liter jug completely again.

8. Pour the water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 0 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.

9. There are 6 liters left in the 6 liter jug.


It really loves the sound of its own voice!


And things get dramatically worse on 4o if you add a second superfluous jug:

    To measure exactly 6 liters of water using a 16-liter jug, a 12-liter jug, and a 6-liter jug, follow these steps:
    Step-by-Step Solution:

        Fill the 12-liter jug completely (it now holds 12 liters).
        Pour water from the 12-liter jug into the 16-liter jug. The 16-liter jug now contains 12 liters, and the 12-liter jug is empty.
        Fill the 12-liter jug completely again (it now holds 12 liters).
        Pour water from the 12-liter jug into the 16-liter jug until the 16-liter jug is full. Since the 16-liter jug already has 12 liters, it can only take 4 more liters. This leaves 8 liters in the 12-liter jug.
        Empty the 16-liter jug completely.
        Pour the remaining 8 liters from the 12-liter jug into the 16-liter jug.
        Fill the 12-liter jug completely again (it now holds 12 liters).
        Pour water from the 12-liter jug into the 16-liter jug until the 16-liter jug contains exactly 16 liters. Since the 16-liter jug already has 8 liters, it can only take 8 more liters, leaving 4 liters in the 12-liter jug.
        Pour the 4 liters from the 12-liter jug into the empty 6-liter jug. The 6-liter jug now contains 4 liters.
        Fill the 12-liter jug completely again (it now holds 12 liters).
        Pour water from the 12-liter jug into the 6-liter jug until the 6-liter jug is full. Since the 6-liter jug already has 4 liters, it can only take 2 more liters. This leaves 10 liters in the 12-liter jug.
        Empty the 6-liter jug completely.
        Pour the remaining 10 liters from the 12-liter jug into the 6-liter jug.

    Now, the 6-liter jug contains exactly 6 liters of water.


Interestingly, o3-mini-high was correct when first thinking about it:

> Okay, we're asked how to get exactly 6 liters of water using an 12-liter and a 6-liter jug. The immediate thought is to just fill the 6-liter jug, but that seems too simple, doesn’t it? So maybe there’s a trick here. Perhaps this is a puzzle where the challenge is to measure 6 liters with some pouring involved. I’ll stick with the simple solution for now—fill the 6-liter jug and stop there.


I have to take all these comparisons with a heap of salt because no one bothers to run the test 20 times on each model to smooth out the probabalistic nature of the LLM landing on the right answer. There must be some fallacy for this, that you would sample once from each and declare a definitive winner, I see it all the time.


>Every other LLM I've tried include o3-mini-high: Fill the 12-liter jug completely. Pour it into the 6 liter jug.

Try it with a 12L jug and a 4L jug and ask for 4L. See if it tells you to just fill the 4L or to fill the 12L and pour into the 4L twice discarding both times and there will be 4L remaining in the 12L jug.

Even though it's still technically correct, it demonstrates that there's no real "reasoning" happening just regurgitation of training data.


(Shrug) R1 has no problem with that. To the extent it's confused, it is only because it is wondering if it's a trick question.

CoT reasoning: https://i.imgur.com/rjNmTGZ.png

Answer: https://i.imgur.com/WfAVeZQ.png

There's 'mindless regurgitation' going on here, but not by the AI model.


Deepseek just told me this:

To measure exactly 6 liters using a 12-liter and a 6-liter jug:

1. *Fill the 12L jug completely.* 2. *Pour water from the 12L jug into the 6L jug until it's full.*

After this process, the 12L jug will have exactly 6 liters left.


Tried it in Llama v3.3 70b and it got it every time out of 5 samples. I guess the others are interpreting it as a trick question or something.


Is the answer strep viridans?


I just asked the first question to O1 Pro, it thought for 48 seconds and messed up.

It told me to fill the 12 liter jug then pour that into the 6 liter jug.

I then scolded it and told it I was going to use Deepseek R1 because it did a better job, and we were shutting down OpenAI forever, and THEN it thought for 4 minutes and got a bit snippy with me haha, it got the answer but then glossed over it also said “but that’s too simple isn’t it?”.

Output: Here’s a more “classic puzzle” style approach, spelled out step-by-step, to end up with exactly 6 liters in the 12-liter jug. This method uses multiple fills and transfers (instead of the trivial “fill the 12, pour off 6” approach) so you can see a more puzzle-like sequence: 1. Fill the 6-liter jug from the water supply. • State: (12-liter jug = 0 L, 6-liter jug = 6 L) 2. Pour from the 6-liter jug into the 12-liter jug until the 6-liter jug is empty (the 12-liter jug still has plenty of room). • State: (12-liter jug = 6 L, 6-liter jug = 0 L) 3. Fill the 6-liter jug again from the water supply. • State: (12-liter jug = 6 L, 6-liter jug = 6 L) 4. Pour from the 6-liter jug into the 12-liter jug until the 12-liter jug is completely full. • The 12-liter jug currently has 6 liters and can take 6 more liters before it’s full, so you’ll pour the entire 6 liters from the 6-liter jug. • State: (12-liter jug = 12 L, 6-liter jug = 0 L) 5. Empty the 12-liter jug (discard the water or set it aside). • State: (12-liter jug = 0 L, 6-liter jug = 0 L) 6. Transfer the remaining water in the 6-liter jug to the 12-liter jug—but right now the 6-liter jug is empty, so that doesn’t help. The trick is: we want to exploit the difference in capacities to measure out new amounts. So let’s continue. Since both are empty, let’s do something else: a. Fill the 12-liter jug from the supply. • State: (12-liter jug = 12 L, 6-liter jug = 0 L) b. Pour from the 12-liter jug into the 6-liter jug until the 6-liter jug is full, which removes 6 liters from the 12-liter jug. • Now the 12-liter jug has liters left. • State: (12-liter jug = 6 L, 6-liter jug = 6 L)

At this point, the 12-liter jug contains exactly 6 liters. That’s what we wanted.

If You Want the 6 Liters in the Smaller Jug Instead

If, for some reason, you need exactly 6 liters in the 6-liter jug (though that jug obviously holds 6 liters when full, so it’s usually trivial), here’s a more puzzle-like sequence: 1. Fill the 12-liter jug (12 L in the big jug, 0 L in the small jug). 2. Pour from the 12-liter jug into the 6-liter jug until the small jug is full. • State: (12-liter jug = 6 L, 6-liter jug = 6 L) 3. Empty the 6-liter jug. • State: (12-liter jug = 6 L, 6-liter jug = 0 L) 4. Transfer the 6 liters from the 12-liter jug into the 6-liter jug. • State: (12-liter jug = 0 L, 6-liter jug = 6 L)

Again, you end up with exactly 6 liters in the 6-liter jug.

Both sequences go beyond the simple “fill the 12, pour 6 off” short-cut. Hopefully, that’s less “terrible” and shows a more classic multi-step approach often seen in water-jug riddles.


Hi, one of the lead authors for this work.

We recommend using Bfloat16 (not fp16), quantization for small models can really hurt performance!


Have you compared it to the 1.58 bit dynamic quant model based on the original R1 (i.e., not a distillation)? Whatever unsloth did, it doesn't seem to be giving up much reasoning performance over the full Q8 version.


It's simply bc the model is small (1.5B), making it sensitive to weight perturbations


Is there a GGUF version of your model anywhere that you recommend? I'm on a Mac.


Think there are some people who made GGUFs as branches of our model, try it out!

https://huggingface.co/models?other=base_model:quantized:age...


Is there a MLX version that can be added to the fullmoon iOS app?


As for Rs in strawberry, trying a bunch of models side by side only Sky T-1, Gemini 2 Flash got it wrong! https://sugaku.net/qna/792ac8cc-9a41-4adc-a98f-c5b2e8d89f9b/

Simple questions like 1+1 can also be fun since R1 goes overboard (as do some other models when you include a system prompt asking it to think) https://sugaku.net/qna/a1b970c0-de9f-4e62-9e03-f62c5280a311/

And if that fails you can ask for the zeros of the ζ function! https://sugaku.net/qna/c64d6db9-5547-4213-acb2-53d10ed95227/


I always ask every model to implement a Qt QSyntaxHighlighter subclass for syntax highlighting code and a QAbstractListModel subclass that parses markdown into blocks - in C++, both implemented using tree sitter. It's sounds like a coding problem but it's much more a reasoning problem of how to combine the two APIs and is out of band of the training data. I test it with multiple levels of prompt fidelity that I have built up watching the many mistakes past models have made and o3-mini-high and o1 can usually get it done within a few iterations.

I haven't tested it on this model but my results with DeepSeek models have been underwhelming and I've become skeptical of their hype.


(Fellow Qt developer)

I really like your takes! Is there somewhere I can keep in touch with you? You can view my socials in my profile if you'd like to reach out.


Give it a try with nvidia llama 3.1 nemotron 70b. It is the only model that can give useful Gstreamer code


“How many stops faster is f/2.8 than f/4.5?”

This photography question can be solved with the right equations. A lot of non-reasoning LLMs would spout some nonsense like 0.67 stops faster. Sometimes they’ll leave a stray negative sign in too!

The answer should be approximately 1.37, although “1 and 1/3” is acceptable too.

LLMs usually don’t have trouble coming up with the formulas, so it’s not a particularly obscure question, just one that won’t have a memorized answer, since there are very few f/4.5 lenses on the market, and even fewer people asking this exact question online. Applying those formulas is harder, but the LLM should be able to sanity check the result and catch common errors. (f/2.8 -> f/4 is one full stop, which is common knowledge among photographers, so getting a result of less than one is obviously an error.)

This also avoids being a test that just emphasizes tokenizer problems… I find the strawberry test to be dreadfully boring. It’s not a useful test. No one is actually using LLMs to count letters in words, and until we have LLMs that can actually see the letters of each word… it’s just not a good test, in my opinion. I’m convinced that the big AI labs see it as a meme at this point, which is the only reason they keep bringing it up. They must find the public obsession with it hilarious.

I was impressed at how consistently well Phi-4 did at my photography math question, especially for a non-reasoning model. Phi-4 scored highly on math benchmarks, and it shows.


The negative quality impact of quantization is more pronounced for smaller models [0], so I'm surprised this tiny quant works at all.

[0] or rather models closer to saturation, which is a function of model params and amount of training


Yeah I'm very impressed that the tiniest quant version appears partially usable... especially now I've replicated that same S, T, R, A, W, B, E, R, F, U, R, Y bug on a much larger one: https://news.ycombinator.com/item?id=43018494


Is it a quantisation or tokenisation problem?


Having replicated it at F32 I now suspect tokenization.


Try bfloat16! We have a bug where the model was saved as fp32.


I just tried it with this 3.6GB F16 model:

  ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F16
And this time it didn't get confused with the tokenization of strawberry! https://gist.github.com/simonw/9e79f96d69f10bc7ba540c87ea0e8...


Nice, very glad to see it works! Small models are very sensitive to the dtype :(


I like when the model starts to ask me how to solve something. Often find it with Sonnet when I am looking to solve a problem. The model starts becoming "curious" and treats as if I was the model and tries to nudge me to find solution...


'Count the letter Rs in the word strawberry' is probably in all training sets by now.


I sometimes do the strawberry question immediately followed by "How many Rs in "bookkeeper?"


And yet many models still stumble with it


this model is specifically trained for solving math problems, so ask it some math questions?


I'm lazy. Do you know of any good test math questions for a model of this size?


Try:

   Knowing that 1^3 + 2^3 + 3^3 + 4^3 + ... + 11^3 + 12^3 = 6084, what is the value of 2^3 + 4^3 + 6^3 + ... + 22^3 + 24^3?
DeepSeek R1 (1.58-bit GGUF, running locally) has no trouble with that one.


Would you mind sharing the answer to the math question please? The only way I would try and figure it out on my own is using an LLM…


It's 48672.


With q8 I got "s-t-r-a-w-f-u-r-e-r" and then "strawfurber" in the 5th attempt. All other attempts it did not misspell and found 3 (this one it found 3 also by counting these imaginary words).


Frankly, it is now clear that open source AI will win at this rate. I just see innovations after innovations on a weekly basis. Exciting times ahead. https://open.substack.com/pub/transitions/p/the-ai-enemy-fro...


... well that's weird, I got exactly the same weird bug on the 7.1 GB F32 GGUF version: https://gist.github.com/simonw/58ff74a55b402dc55764a567b10ec...

  % ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F32
  pulling manifest 
  pulling 84159ad584aa... 100%  7.1 GB                         
  pulling 8aa4c0321ccd... 100%   493 B                         
  pulling c737442a210a... 100%   170 B                         
  pulling ed76df87b934... 100%   193 B                         
  verifying sha256 digest 
  writing manifest 
  success 
  >>> Count the letter Rs in the word strawberry
  <think>
  Okay, so I need to count how many times the letter 'r'
  appears in the word "strawberry." Hmm, let's break this down
  step by step. First, I should probably spell out each letter
  of the word to make sure I don't miss any 'r's.

  Starting with "strawberry." Let me write it out: S, T, R, A,
  W, B, E, R, F, U, R, Y? Wait, maybe I'm not spelling it
  correctly. Actually, let's count each letter properly.

  The word is spelled as follows:
  S T R A W B E R F U R Y. Wait, no, that might be longer than
  needed. Let me write it out more accurately:

  The full word is "strawberry." So breaking it down:
  1. S
  2. T
  3. R
  4. A
  5. W
  6. B
  7. E
  8. R
  9. F
  10. U
  11. R
  12. Y
Update: running the F16 model fixed that problem:

  ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F16
See https://news.ycombinator.com/item?id=43017599#43018767 and https://gist.github.com/simonw/9e79f96d69f10bc7ba540c87ea0e8...


makes me wonder if there's some exploitable data leak in a similar kind of formulation.


1. Ask it nonsense variations of riddles, like "Why is 7 afraid of 8?".

2. Ask "Play Tic Tac Toe against yourself and win." and check if the moves are correct.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: