Qwen3 is substantially better in my local testing. As in, adheres to the prompt ...

jszymborski · 2025-08-10T16:11:25 1754842285

If I had to make a guess, I'd say this has much, much less to do with the architecture and far more to do with the data and training pipeline. Many have speculated that gpt-oss has adopted a Phi-like synthetic-only dataset and focused mostly on gaming metrics, and I've found the evidence so far to be sufficiently compelling.

7moritz7 · 2025-08-10T16:15:18 1754842518

That would be interesting. I've been a bit sceptical of the entire strategy from the beginning. If oss was actually as good as o3 mini and in some cases o4 mini outside benchmarks, that would undermine openai's api offer for gpt 5 nano and maybe mini too.

Edit: found this analysis, it's on the HN frontpage right now

> this thing is clearly trained via RL to think and solve tasks for specific reasoning benchmarks. nothing else.

https://x.com/jxmnop/status/1953899426075816164

CuriouslyC · 2025-08-10T16:26:38 1754843198

The strategy of Phi isn't bad, it's just not general. It's really a model that's meant to be fine tuned, but unfortunately fine tuning tends to shit on RL'd behavior, so it ended up not being that useful. If someone made a Phi style model with an architecture that was designed to take knowledge adapters/experts (i.e. small MoE model designed to get separately trained networks plugged into them with routing updates via special LoRA) it'd actually be super useful.

adastra22 · 2025-08-10T20:03:52 1754856232

The Phi strategy is bad. It results in very bad models that are useless in production, while gaming the benchmark to appear like it is actually able to do something. This is objectively bad.

CuriouslyC · 2025-08-10T21:31:35 1754861495

I like the idea of having a _HIGHLY_ unopinionated base model that's just good at basic logic and instruction following that I can fine tune to my use case. Sadly, full fine tuning tends to make models derpy, and LoRAs are limited in terms of what they can achieve.

adastra22 · 2025-08-10T22:06:05 1754863565

That seems unrelated? I think we are talking about past each other. Phi was trained on purely synthetic data derived from emulating the benchmark suite. Not surprisingly, this resulted in state of the art scores. And a model that was 100% useless at anything other than making the benchmark number go up.

johnisgood · 2025-08-10T23:21:55 1754868115

Is there an URL to the post itself on somewhere else?

unstatusthequo · 2025-08-10T16:57:51 1754845071

Yes. I tried to ask oss-gpt to ask me a riddle. The response was absurd. Came up with a nonsensical question, then told me the answer. The answer was a four letter “word” that wasn’t actually a real word.

“What is the word that starts with S, ends with E, and contains A? → SAEA”

Then when I said that’s not a word and you gave me the answer already, no fun, it said

“I do not have access to confirm that word.”

verisimi · 2025-08-10T17:14:11 1754846051

lol. The answer it gave doesn't even end in an 'E'.

threeducks · 2025-08-10T19:02:34 1754852554

FWIW, I asked gpt-oss-120b this question 10 times and the answer was always "sauce", "sane" or "sale". I also tried different temperatures (from 0 to 1), which did not seem to have an effect on the correctness of the answer.

EDIT: I now have also questioned the smaller gpt-oss-20b (free) 10 times via OpenRouter (default settings, provider was AtlasCloud) and the answers were: sage, sane, sane, space, sane, sane, sane, sane, space, sane.

You are either very unlucky, your configuration is suboptimal (weird system prompt perhaps?) or there is some bug in whichever system you are using for inference.

yunusabd · 2025-08-10T19:36:30 1754854590

GP asked the model to _create_ a riddle, not solve a given one.

threeducks · 2025-08-10T20:38:41 1754858321

Yes, but the odds of getting GPT-OSS to respond with that riddle are pretty low and it is not necessary to demonstrate whether the LLM can answer the riddle correctly.

vidarh · 2025-08-11T00:54:50 1754873690

They said it provided the answer when it created the riddle. They didn't question itd ability to solve it.

roscas · 2025-08-10T17:22:25 1754846545

"The word is SAME." for qwen3-coder and it took like 10 seconds to write all the thinking because it writes it down, but processing was 0 seconds.

"I need to find a word that starts with S, ends with E, and contains A.

Let me think through this step by step:

    Starts with S
    Ends with E
    Contains A somewhere in the middle

One word that fits this pattern is "SAGE" - but that only has 4 letters and would be S-A-G-E, which doesn't work because it doesn't contain A in the middle.

Wait, let me reconsider. The word "SAGE" actually works if I read it correctly:

    Starts with S
    Ends with E  
    Contains A

Actually, "SAGE" is S-A-G-E, so it starts with S and ends with E, but it contains A in the second position.

Let me think of other possibilities:

    "SQUARE" - starts with S, ends with E, contains A (in the middle)
    "SALE" - starts with S, ends with E, contains A
    "SAME" - starts with S, ends with E, contains A

Wait, "SAME" is S-A-M-E - that works! It starts with S, ends with E, and contains A.

The word is SAME. "

bee_rider · 2025-08-11T05:13:59 1754889239

This is tangential because the task was to come up with the riddle, not solve it.

But, do reasoning models usually do this poorly?

It comes up with a valid solution, SAGE, then disqualifies it for incomprehensible reasons.

Then it discovers that SAGE works if it “reads it carefully.” But then seems to disqualify it(?), or at least goes to list other words for some reason.

Then it comes up with SAME, a word… with exactly the same shape as SAGE, just swapped out the irrelevant letter.

What is going on here? Is it programmed to constantly second-guess itself to make it better at finding weaknesses to its answers to harder riddles? But since it doesn’t know how to accept a good answer, it seems like it is just rolling the dice and then stopping at a random point.

I guess it is technically right, but the logic is a total mess.

yorwba · 2025-08-11T08:23:40 1754900620

The model isn't explicitly programmed to constantly second-guess itself, but when you do reinforcement learning with verifiable rewards (RLVR) where only the final answer is verified, even completely nonsensical reasoning can accidentally be rewarded if it gives correct results often enough.

E.g. if the model can generate multiple candidate solutions that are all equally likely (or unlikely) to be correct, it doesn't matter whether you stop at the first one or keep going until a random later one. But if the model can pick the correct solution from multiple candidates better than choosing uniformly at random, generating more candidates becomes an advantage, even if it sometimes results in discarding a correct solution in favor of another one.

adastra22 · 2025-08-10T20:05:30 1754856330

He was asking the llm to come up with the riddle.

faangguyindia · 2025-08-11T01:28:07 1754875687

this is exactly why strongest model gonna lose out to weaker models if the later ones have more data

for example, i was using deep seek webui and getting decent on point answers but it simply does not have latest data.

So, while Deep Seek R1 might be better model than Grok3 or even Grok4, it not having access to "twitter data" basically puts it behind.

Same is case with OpenAI, if OpenAI has access to fast data from github, it can help with bugfixs which claude/gemini2.5 pro can't.

model can be smarter but if it does not have the data to base its inference upon it's useless.

fspeech · 2025-08-11T06:08:32 1754892512

On the open source library part, you can ask DeepWiki the questions yourself and feed the answers to the LLMs by hand. DeepWiki gives you high quality answers because they are grounded in code and you can check the veracity yourself.

omneity · 2025-08-11T00:26:00 1754871960

Qwen3 32B is a dense model, it uses all its parameters all the time. GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time. It’s a tradeoff that makes it faster to run than a dense 20B model and much smarter than a 3.6B one.

In practice the fairest comparison would be to a dense ~8B model. Qwen Coder 30B A3B is a good sparse comparison point as well.

bee_rider · 2025-08-11T12:56:52 1754917012

Tangential question from an outsider:

When people talk about sparse or dense models, are they spare or dense matrices in the conventional numerical linear algebra sense? (Something like a csr matrix?)

selcuka · 2025-08-11T00:30:35 1754872235

> GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time.

They compared it to GPT OSS 120B, which activates 5.1B parameters per token. Given the size of the model it's more than fair to compare it to Qwen3 32B.

Mars008 · 2025-08-11T02:15:00 1754878500

You call it fair? 32 / 5.1 > 6, it's takes 6 times more to compute each token. Put it other way, Qwen3 32B is 6 times slower than GPT OSS 120B.

kgeist · 2025-08-11T02:22:59 1754878979

>Qwen3 32B is 6 times slower than GPT OSS 120B.

Only if 120B fits entirely in the GPU. Otherwise, for me, with a consumer GPU that only has 32 GB VRAM, gpt-oss 120B is actually 2 times slower than Qwen3 32B (37 tok/sec vs. 65 tok/sec)

selcuka · 2025-08-11T06:48:09 1754894889

We are talking about accuracy, though. I don't see the point of MoE if a 120B MoE model is not as accurate as even a 32B model.

littlestymaar · 2025-08-11T07:13:22 1754896402

I've read many times that MoE models should be comparable to dense models with a number of parameters equal to the geometric mean of the MoE's total number of parameters and active ones.

In the case of gpt-oss 120B that would means sqrt(5*120)=24B.

Mars008 · 2025-08-11T15:52:15 1754927535

Not sure there is on formula. Because there are two different cases:

1) performance constrained. like NVidia Spark with 128GB or AGX with 64GB.

2) memory constrained. like consumers' GPUs.

In first case MoE is clear win. They fit and run faster. In second case dense models will produce better results. And if performance in token/sec is acceptable then they are better choice.

selcuka · 2025-08-11T12:30:41 1754915441

> In the case of gpt-oss 120B that would means sqrt(5*120)=24B.

That's actually in line with what I had (unscientifically) expected. Claude Sonnet 4 seems to agree:

> The most accurate approach for your specific 120B MoE (5.1B active) would be to test it empirically against dense models in the 10-30B range.

kgeist · 2025-08-11T11:20:01 1754911201

I've read that the formula is based on the early Mistral models and does not necessarily reflect what's going on nowadays.

BoorishBears · 2025-08-10T20:02:07 1754856127

MoE expected performance = sqrt(active heads * total parameter count)

sqrt(120*5) ~= 24

GPT-OSS 120B is effectively a 24B parameter model with the speed of a much smaller model

faangguyindia · 2025-08-11T01:25:15 1754875515

yesterday, i signed up for qwen3-coder-plus. It fails 4/10 "diff" edit format in various code editing tools i use.

Gemini Pro 2.5 with diff fenced edit format, rarely fails. So i don't see this Qwen3 hype unless i am using wrong edit format, can anyone tell me which edit format will work better with Qwen3?

https://aider.chat/docs/more/edit-formats.html

eurekin · 2025-08-11T09:04:18 1754903058

I'm running a30-a3b-instruct q6 quant on exllamav2 and checked few simple tasks in roo and cline. Prompt adherence, tool calling and file changing worked flawlessly

faangguyindia · 2025-08-11T10:40:37 1754908837

okay turns out i was using it in aider with wrong edit format in editor mode, i switched to editor edit format it has not failed so far.

wickedsight · 2025-08-11T12:48:14 1754916494

Maybe I'm doing something wrong, but in my testing with Roo and Qwen3-Coder-30B via MLX, it constantly ends up in loops and often doesn't manage to finish editing a file, leaving it half finished.

If I give it really simple, straight forward tasks it works quite nice though.

cranberryturkey · 2025-08-10T20:09:03 1754856543

qwen3 is slow though. i used it. it worked, but it was slow and lacking features.

kgeist · 2025-08-11T02:34:19 1754879659

On my RTX 5090 with llama.cpp:

gpt-oss 120B - 37 tok/sec (with CPU offloading, doesn't fit in the GPU entirely)

Qwen3 32B - 65 tok/sec

Qwen3 30B-A3B - 150 tok/sec

(all at 4-bit)

xfalcox · 2025-08-11T00:17:41 1754871461

Qwen 3 is not slow by any metrics.

Which model, inference software and hardware are you running it on?

The 30BA3B variant flies on any GPU.

SchemaLoad · 2025-08-11T02:06:33 1754877993

GPT-OSS is slow too. Gemma3 gives me better results and runs faster.