Qwen3.5 35B A3B is much much faster and fits if you get a 3 bit version. How fas...

throwdbaaway · 2026-03-08T09:01:04 1772960464

Using ik_llama.cpp to run a 27B 4bpw quant on a RTX 3090, I get 1312 tok/s PP and 40.7 tok/s TG at zero context, dropping to 1009 tok/s PP and 36.2 tok/s TG at 40960 context.

35B A3B is faster but didn't do too well in my limited testing.

ranger_danger · 2026-03-09T05:03:31 1773032611

with regular llama.cpp on a 3070ti I get 60tok/s TG with the 9B model, it's quite impressive.

ece · 2026-03-08T08:03:45 1772957025

The 27B is rated slightly higher for SWE-bench.

ranger_danger · 2026-03-09T04:58:13 1773032293

Don't sleep on the 9B version either, I get much faster speeds and can't tell any difference in quality. On my 3070ti I get ~60tok/s with it, and half that with the 35B-A3B.

andai · 2026-03-11T19:53:04 1773258784

27B needs less memory and does better on benchmarks, but 35B-A3B seems to run roughly twice as fast.