Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This may be obvious to people who do this regularly, but what kind of machine is required to run this? I downloaded & tried it on my Linux machine that has a 16GB GPU and 64GB of RAM. This machine can run SD easily. But Qwen-image ran out of space both when I tried it on the GPU and on the CPU, so that's obviously not enough. But am I off by a factor of two? An order of magnitude? Do I need some crazy hardware?


> This may be obvious to people who do this regularly

This is not that obvious. Calculating VRAM usage for VLMs/LLMs is something of an arcane art. There are about 10 calculators online you can use and none of them work. Quantization, KV caching, activation, layers, etc all play a role. It's annoying.

But anyway, for this model, you need 40+ GB of VRAM. System RAM isn't going to cut it unless it's unified RAM on Apple Silicon, and even then, memory bandwidth is shot, so inference is much much slower than GPU/TPU.


Also I think you need a 40GB "card", not just 40GB of vram. I wrote about this upthread, you're probably going to need one card, I'd be surprised if you could chain several GPUs together.


Oh right, I forgot some diffusion models can't offload / split layers. I don't use vision generation models much at all - was just going off LLM work. Apologies for the potential misinformation.


Not sure what you mean or new to llms, but two RTX 3090 will work for this, and even lower-end cards will (RTX3060) once it's GGUF'd


This isn't a transformer, it's a diffusion model. You can't split diffusion models across compute nodes.


do you mean https://github.com/pollockjj/ComfyUI-MultiGPU? One GPU would do the computation, but others could pool in for VRAM expansion, right? (I've not used this node)


Nah, that won’t gain you much (if anything?) over just doing the layer swaps on RAM. You can put the text encoder on the second card but you can also just put it in your RAM without much for negatives.


will the new AMD AI CPUs work? like an AI HX 395 or the slower 370? I'm stuck on an A2000 w/16GB of VRAM and wondering what's a worthwhile upgrade.


It may fit but image generation on anything but Nvidia is so slow it won’t be worth it.


I believe it's roughly the same size as the model files. If you look in the transformers folder you can see there are around 9 5gb files, so I would expect you need ~45gb vram on your GPU. Usually quantized versions of models are eventually released/created that can run on much less vram but with some quality loss.


Why doesn't huggingface list the aggregate model size?


I've been bugging them about this for a while. There are repos that contain multiple model weights in a single repo which means adding up the file sizes won't work universally, but I'd still find it useful to have a "repo size" indicator somewhere.

I ended up building my own tool for that: https://tools.simonwillison.net/huggingface-storage


HF does this for ggufs, and it’ll show you what quantizations will work on the GPU(s) you’ve selected. Hopefully that feature gets expanded to support more model types.


I've been wondering this for literally years now...


Huggingface is just a git hosting service, like github. You can add up the sizes of all the files in the directory yourself


That’s what we have computers for though - to compute.


Model size = file for fp8, so if this was released at fp16 then 40-ish, if it's quantized to fp4 then 10ish


You're probably going to have to wait a couple of days for 4 bit quantized versions to pop up. It's 20B parameters.


   # Configure NF4 quantization
   quant_config = PipelineQuantizationConfig(
       quant_backend="bitsandbytes_4bit",
       quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
       components_to_quantize=["transformer", "text_encoder"],
   )

   # Load the pipeline with NF4 quantization
   pipe = DiffusionPipeline.from_pretrained(
       model_name,
       quantization_config=quant_config,
       torch_dtype=torch.bfloat16,
       use_safetensors=True,
       low_cpu_mem_usage=True
   ).to(device)
seems to use 17gb of vram like this

update: doesn't work well. this approach seems to be recommended: https://github.com/QwenLM/Qwen-Image/pull/6/files


Qwen-Image requires at least 24GB VRAM for the full model, but you can run the 4-bit quantized version with ~8GB VRAM using libraries like AutoGPTQ.


16GiB RAM with 8-bit quantization.

This is a slightly scaled up SD3 Large model (38 layers -> 60 layers).


For prod inference, 1xH100 is working well.


two p40 cards together will run this for under $300


> I think the fact that, as far as I understand, it takes 40GB of VRAM to run, is probably dampening some of the enthusiasm.

For PCs I take it one that has two PCIe 4.0 x16 or more recent slots? As in: quite some consumers motherboards. You then put two GPU with 24 GB of VRAM each.

A friend runs this (don't know if the tried this Qwen-Image yet): it's not an "out of this world" machine.


maybe not "out of this world" but still not cheap. probably $4,000 with 3090s. pretty big chunk of change for some ai pictures.


You can’t split diffusion models like that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: