Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Open weight models matching Cloud 4 is exciting! It's really possible to run this locally since it's MoE


Where do you put the 480 GB to run it at any kind of speed? You have that much RAM?


You can get a used 5 year old Xeon Dell or Lenovo Workstation and 8x64GB of ECC DDR4 RAM for about $1500-$2000.

Or you can rent a newer one for $300/mo on the cloud


Everyone keeps saying this but it is not really useful. Without a dedicated GPU & VRAM, you are waiting overnight for a response... The MoE models are great but they need dedicated GPU & VRAM to work fast.


Well, yeah, you're supposed to put in a GPU. It's a MoE model, the common tensors should be on the GPU, which also does prompt processing.

The RAM is for the 400gb of experts.


It's 480B params, not 480GB. The 4 bit version of this is 270GB. I believe it's trained at bf16, so you need over a TB of memory to operate the model at bf16. No one should be trying to replace claude with a quantized 8 bit or 4 bit model. It's simply not possible. Also, this model isn't going to be as versed as Claude at certain libraries and languages. I have something written entirely my claude which uses the Fyne library extensively in golang for UI. Claude knows it inside and out as it's all vibe coded, but the 4 bit Qwen3 coder just hallucinated functions and parameters that don't exist because it wasn't willing to admit it didn't know what it was doing. Definitely don't judge a model by it's quant is all I'm saying.


You rent an a100x8 or higher and pay $10k a month in costs, which will work well if you have a whole team using it and you have the cash. I’ve seen people spending $200-500 per day on Claude code. So if this model is comparable to Opus then it’s worth it.


If you're running it for personal use, you don't need to put all of it onto GPU vram. Cheap DDR5 ram is fine. You just need a GPU in the system to do compute for the prompt processing and to hold the common tensors that run for every token.

For reference, a RTX 3090 has about 900GB/sec memory bandwidth, and a Mac Studio 512GB has 819GB/sec memory bandwidth.

So you just need a workstation with 8 channel DDR5 memory, and 8 sticks of RAM, and stick a 3090 GPU inside of it. Should be cheaper than $5000, for 512GB of DDR5-6400 that runs at a memory bandwidth of 409GB/sec, plus a RTX 3090.


> So if this model is comparable to Opus then it’s worth it.

Qwen says this is similar in coding performance to Sonnet 4, not Opus.


You don't actually need 480GB of RAM, but if you want at least 3 tokens / s, it's a must.

If you have 500GB of SSD, llama.cpp does disk offloading -> it'll be slow though less than 1 token / s


> but if you want at least 3 tokens / s

3 t/s isn't going to be a lot of fun to use.


beg to differ, I'm living fine with 1.5tk/sec


Spec decoding on a small draft model could help increase it by say 30 to 50%!


i'm not willing to trade any more quality for performance. no draft, no cache for kv either. i'll take the performance cost, it just makes me think carefully about my prompt. i rarely every need more than one prompt to get my answers. :D


Speculative decoding doesn't change output tokens.


Draft model doesn’t degrade quality!


I beg to differ, especially when it comes to code.


As far as inference costs go 480GB of RAM is cheap.


Ye! Super excited for Coder!!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: