> GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time.
They compared it to GPT OSS 120B, which activates 5.1B parameters per token. Given the size of the model it's more than fair to compare it to Qwen3 32B.
Only if 120B fits entirely in the GPU. Otherwise, for me, with a consumer GPU that only has 32 GB VRAM, gpt-oss 120B is actually 2 times slower than Qwen3 32B (37 tok/sec vs. 65 tok/sec)
I've read many times that MoE models should be comparable to dense models with a number of parameters equal to the geometric mean of the MoE's total number of parameters and active ones.
In the case of gpt-oss 120B that would means sqrt(5*120)=24B.
Not sure there is on formula. Because there are two different cases:
1) performance constrained. like NVidia Spark with 128GB or AGX with 64GB.
2) memory constrained. like consumers' GPUs.
In first case MoE is clear win. They fit and run faster. In second case dense models will produce better results. And if performance in token/sec is acceptable then they are better choice.
They compared it to GPT OSS 120B, which activates 5.1B parameters per token. Given the size of the model it's more than fair to compare it to Qwen3 32B.