This may be obvious to people who do this regularly, but what kind of machine is...

icelancer · 2025-08-04T20:49:37 1754340577

> This may be obvious to people who do this regularly

This is not that obvious. Calculating VRAM usage for VLMs/LLMs is something of an arcane art. There are about 10 calculators online you can use and none of them work. Quantization, KV caching, activation, layers, etc all play a role. It's annoying.

But anyway, for this model, you need 40+ GB of VRAM. System RAM isn't going to cut it unless it's unified RAM on Apple Silicon, and even then, memory bandwidth is shot, so inference is much much slower than GPU/TPU.

cellis · 2025-08-04T21:37:50 1754343470

Also I think you need a 40GB "card", not just 40GB of vram. I wrote about this upthread, you're probably going to need one card, I'd be surprised if you could chain several GPUs together.

icelancer · 2025-08-04T23:34:15 1754350455

Oh right, I forgot some diffusion models can't offload / split layers. I don't use vision generation models much at all - was just going off LLM work. Apologies for the potential misinformation.

rapfaria · 2025-08-04T22:04:55 1754345095

Not sure what you mean or new to llms, but two RTX 3090 will work for this, and even lower-end cards will (RTX3060) once it's GGUF'd

axoltl · 2025-08-04T23:55:16 1754351716

This isn't a transformer, it's a diffusion model. You can't split diffusion models across compute nodes.

karolist · 2025-08-04T22:09:43 1754345383

do you mean https://github.com/pollockjj/ComfyUI-MultiGPU? One GPU would do the computation, but others could pool in for VRAM expansion, right? (I've not used this node)

AuryGlenz · 2025-08-05T06:54:38 1754376878

Nah, that won’t gain you much (if anything?) over just doing the layer swaps on RAM. You can put the text encoder on the second card but you can also just put it in your RAM without much for negatives.

xarope · 2025-08-05T05:16:48 1754371008

will the new AMD AI CPUs work? like an AI HX 395 or the slower 370? I'm stuck on an A2000 w/16GB of VRAM and wondering what's a worthwhile upgrade.

AuryGlenz · 2025-08-05T06:52:54 1754376774

It may fit but image generation on anything but Nvidia is so slow it won’t be worth it.

mortsnort · 2025-08-04T19:30:27 1754335827

I believe it's roughly the same size as the model files. If you look in the transformers folder you can see there are around 9 5gb files, so I would expect you need ~45gb vram on your GPU. Usually quantized versions of models are eventually released/created that can run on much less vram but with some quality loss.

foobarqux · 2025-08-04T19:59:06 1754337546

Why doesn't huggingface list the aggregate model size?

simonw · 2025-08-04T21:08:54 1754341734

I've been bugging them about this for a while. There are repos that contain multiple model weights in a single repo which means adding up the file sizes won't work universally, but I'd still find it useful to have a "repo size" indicator somewhere.

I ended up building my own tool for that: https://tools.simonwillison.net/huggingface-storage

Gracana · 2025-08-05T16:30:00 1754411400

HF does this for ggufs, and it’ll show you what quantizations will work on the GPU(s) you’ve selected. Hopefully that feature gets expanded to support more model types.

bavell · 2025-08-05T04:18:22 1754367502

I've been wondering this for literally years now...

matcha-video · 2025-08-04T20:23:44 1754339024

Huggingface is just a git hosting service, like github. You can add up the sizes of all the files in the directory yourself

AuryGlenz · 2025-08-05T06:55:27 1754376927

That’s what we have computers for though - to compute.

halJordan · 2025-08-04T21:11:36 1754341896

Model size = file for fp8, so if this was released at fp16 then 40-ish, if it's quantized to fp4 then 10ish

zippothrowaway · 2025-08-04T19:27:32 1754335652

You're probably going to have to wait a couple of days for 4 bit quantized versions to pop up. It's 20B parameters.

pollinations · 2025-08-04T21:33:22 1754343202

   # Configure NF4 quantization
   quant_config = PipelineQuantizationConfig(
       quant_backend="bitsandbytes_4bit",
       quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
       components_to_quantize=["transformer", "text_encoder"],
   )

   # Load the pipeline with NF4 quantization
   pipe = DiffusionPipeline.from_pretrained(
       model_name,
       quantization_config=quant_config,
       torch_dtype=torch.bfloat16,
       use_safetensors=True,
       low_cpu_mem_usage=True
   ).to(device)

seems to use 17gb of vram like this

update: doesn't work well. this approach seems to be recommended: https://github.com/QwenLM/Qwen-Image/pull/6/files

ethan_smith · 2025-08-04T23:26:18 1754349978

Qwen-Image requires at least 24GB VRAM for the full model, but you can run the 4-bit quantized version with ~8GB VRAM using libraries like AutoGPTQ.

liuliu · 2025-08-04T21:07:44 1754341664

16GiB RAM with 8-bit quantization.

This is a slightly scaled up SD3 Large model (38 layers -> 60 layers).

philipkiely · 2025-08-04T22:52:04 1754347924

For prod inference, 1xH100 is working well.

cjtrowbridge · 2025-08-05T01:58:14 1754359094

two p40 cards together will run this for under $300

TacticalCoder · 2025-08-04T20:12:07 1754338327

> I think the fact that, as far as I understand, it takes 40GB of VRAM to run, is probably dampening some of the enthusiasm.

For PCs I take it one that has two PCIe 4.0 x16 or more recent slots? As in: quite some consumers motherboards. You then put two GPU with 24 GB of VRAM each.

A friend runs this (don't know if the tried this Qwen-Image yet): it's not an "out of this world" machine.

ticulatedspline · 2025-08-05T00:44:12 1754354652

maybe not "out of this world" but still not cheap. probably $4,000 with 3090s. pretty big chunk of change for some ai pictures.

AuryGlenz · 2025-08-05T06:56:29 1754376989

You can’t split diffusion models like that.