Qwen3 is substantially better in my local testing. As in, adheres to the prompt better (pretty much exactly for the 32B parameter variant, very impressive) and is more organic sounding.
In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear particularly good at logical puzzles either.
So presumably, this comes down to...
- training technique or data
- dimension
- lower number of large experts vs higher number of small experts
If I had to make a guess, I'd say this has much, much less to do with the architecture and far more to do with the data and training pipeline. Many have speculated that gpt-oss has adopted a Phi-like synthetic-only dataset and focused mostly on gaming metrics, and I've found the evidence so far to be sufficiently compelling.
That would be interesting. I've been a bit sceptical of the entire strategy from the beginning. If oss was actually as good as o3 mini and in some cases o4 mini outside benchmarks, that would undermine openai's api offer for gpt 5 nano and maybe mini too.
Edit: found this analysis, it's on the HN frontpage right now
> this thing is clearly trained via RL to think and solve tasks for specific reasoning benchmarks. nothing else.
The strategy of Phi isn't bad, it's just not general. It's really a model that's meant to be fine tuned, but unfortunately fine tuning tends to shit on RL'd behavior, so it ended up not being that useful. If someone made a Phi style model with an architecture that was designed to take knowledge adapters/experts (i.e. small MoE model designed to get separately trained networks plugged into them with routing updates via special LoRA) it'd actually be super useful.
The Phi strategy is bad. It results in very bad models that are useless in production, while gaming the benchmark to appear like it is actually able to do something. This is objectively bad.
I like the idea of having a _HIGHLY_ unopinionated base model that's just good at basic logic and instruction following that I can fine tune to my use case. Sadly, full fine tuning tends to make models derpy, and LoRAs are limited in terms of what they can achieve.
That seems unrelated? I think we are talking about past each other. Phi was trained on purely synthetic data derived from emulating the benchmark suite. Not surprisingly, this resulted in state of the art scores. And a model that was 100% useless at anything other than making the benchmark number go up.
Yes. I tried to ask oss-gpt to ask me a riddle. The response was absurd. Came up with a nonsensical question, then told me the answer. The answer was a four letter “word” that wasn’t actually a real word.
“What is the word that starts with S, ends with E, and contains A? → SAEA”
Then when I said that’s not a word and you gave me the answer already, no fun, it said
FWIW, I asked gpt-oss-120b this question 10 times and the answer was always "sauce", "sane" or "sale". I also tried different temperatures (from 0 to 1), which did not seem to have an effect on the correctness of the answer.
EDIT: I now have also questioned the smaller gpt-oss-20b (free) 10 times via OpenRouter (default settings, provider was AtlasCloud) and the answers were: sage, sane, sane, space, sane, sane, sane, sane, space, sane.
You are either very unlucky, your configuration is suboptimal (weird system prompt perhaps?) or there is some bug in whichever system you are using for inference.
Yes, but the odds of getting GPT-OSS to respond with that riddle are pretty low and it is not necessary to demonstrate whether the LLM can answer the riddle correctly.
"The word is SAME." for qwen3-coder and it took like 10 seconds to write all the thinking because it writes it down, but processing was 0 seconds.
"I need to find a word that starts with S, ends with E, and contains A.
Let me think through this step by step:
Starts with S
Ends with E
Contains A somewhere in the middle
One word that fits this pattern is "SAGE" - but that only has 4 letters and would be S-A-G-E, which doesn't work because it doesn't contain A in the middle.
Wait, let me reconsider. The word "SAGE" actually works if I read it correctly:
Starts with S
Ends with E
Contains A
Actually, "SAGE" is S-A-G-E, so it starts with S and ends with E, but it contains A in the second position.
Let me think of other possibilities:
"SQUARE" - starts with S, ends with E, contains A (in the middle)
"SALE" - starts with S, ends with E, contains A
"SAME" - starts with S, ends with E, contains A
Wait, "SAME" is S-A-M-E - that works! It starts with S, ends with E, and contains A.
This is tangential because the task was to come up with the riddle, not solve it.
But, do reasoning models usually do this poorly?
It comes up with a valid solution, SAGE, then disqualifies it for incomprehensible reasons.
Then it discovers that SAGE works if it “reads it carefully.” But then seems to disqualify it(?), or at least goes to list other words for some reason.
Then it comes up with SAME, a word… with exactly the same shape as SAGE, just swapped out the irrelevant letter.
What is going on here? Is it programmed to constantly second-guess itself to make it better at finding weaknesses to its answers to harder riddles? But since it doesn’t know how to accept a good answer, it seems like it is just rolling the dice and then stopping at a random point.
I guess it is technically right, but the logic is a total mess.
The model isn't explicitly programmed to constantly second-guess itself, but when you do reinforcement learning with verifiable rewards (RLVR) where only the final answer is verified, even completely nonsensical reasoning can accidentally be rewarded if it gives correct results often enough.
E.g. if the model can generate multiple candidate solutions that are all equally likely (or unlikely) to be correct, it doesn't matter whether you stop at the first one or keep going until a random later one. But if the model can pick the correct solution from multiple candidates better than choosing uniformly at random, generating more candidates becomes an advantage, even if it sometimes results in discarding a correct solution in favor of another one.
On the open source library part, you can ask DeepWiki the questions yourself and feed the answers to the LLMs by hand. DeepWiki gives you high quality answers because they are grounded in code and you can check the veracity yourself.
Qwen3 32B is a dense model, it uses all its parameters all the time. GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time. It’s a tradeoff that makes it faster to run than a dense 20B model and much smarter than a 3.6B one.
In practice the fairest comparison would be to a dense ~8B model. Qwen Coder 30B A3B is a good sparse comparison point as well.
When people talk about sparse or dense models, are they spare or dense matrices in the conventional numerical linear algebra sense? (Something like a csr matrix?)
> GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time.
They compared it to GPT OSS 120B, which activates 5.1B parameters per token. Given the size of the model it's more than fair to compare it to Qwen3 32B.
Only if 120B fits entirely in the GPU. Otherwise, for me, with a consumer GPU that only has 32 GB VRAM, gpt-oss 120B is actually 2 times slower than Qwen3 32B (37 tok/sec vs. 65 tok/sec)
I've read many times that MoE models should be comparable to dense models with a number of parameters equal to the geometric mean of the MoE's total number of parameters and active ones.
In the case of gpt-oss 120B that would means sqrt(5*120)=24B.
Not sure there is on formula. Because there are two different cases:
1) performance constrained. like NVidia Spark with 128GB or AGX with 64GB.
2) memory constrained. like consumers' GPUs.
In first case MoE is clear win. They fit and run faster. In second case dense models will produce better results. And if performance in token/sec is acceptable then they are better choice.
yesterday, i signed up for qwen3-coder-plus. It fails 4/10 "diff" edit format in various code editing tools i use.
Gemini Pro 2.5 with diff fenced edit format, rarely fails. So i don't see this Qwen3 hype unless i am using wrong edit format, can anyone tell me which edit format will work better with Qwen3?
I'm running a30-a3b-instruct q6 quant on exllamav2 and checked few simple tasks in roo and cline. Prompt adherence, tool calling and file changing worked flawlessly
Maybe I'm doing something wrong, but in my testing with Roo and Qwen3-Coder-30B via MLX, it constantly ends up in loops and often doesn't manage to finish editing a file, leaving it half finished.
If I give it really simple, straight forward tasks it works quite nice though.
In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear particularly good at logical puzzles either.
So presumably, this comes down to...
- training technique or data
- dimension
- lower number of large experts vs higher number of small experts