When one does quantization, it's done in blocks. Bitsandbytes uses a blocksize o...

Dylan16807 · 2024-03-30T04:19:45.000000Z

> "This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart!

Doesn't it need all the weight metadata for a layer to use that layer?

* If yes, then can't any algorithm offload x% of its data as a balancing act between speed and RAM?

* If no, then what's it for and when does it get used?

danielhanchen · 2024-03-30T04:55:10.000000Z

Oh yes you need all the metadata, but because it's 2 numbers the scale and zero_point, I think the movement of singular digits are super fast to the GPU registers - cannot confirm though.

It's like in cuBLAS you do alphaAB + beta*C, and alpha and beta are both scalars which can be on the CPU, and moved to the GPU in nanaseconds.

I'm unsure though since I haven't tested it

Dylan16807 · 2024-03-30T05:49:03.000000Z

It still has to go through the entire memory system. It's hard for me to imagine that transferring a number from the CPU to the GPU is faster than transferring a byte, and if you have 2 CPU-resident numbers per GPU-resident byte that's a lot of transferring.

danielhanchen · 2024-03-30T06:55:45.000000Z

I don't disagree - fair point there definitely is a latency transfer overhead. I would suspect one had prefetch it by calling `.to("cuda", non_blocking = True)` say 2 layers ahead, so you can in theory hide the movement.

I think somewhere the blog did mention HQQ for 1 bit is slower for now, maybe due to the transfer overhead, although I couldn't exactly remember where

Dylan16807 · 2024-03-30T07:15:09.000000Z

My point is more that if it's that many bytes flowing around on demand, you're basically swapping layers in and out of the GPU as you use them (or x% of each layer).

Which is fine, and it's a valid feature, but you don't need to split those bytes into "data" and "metadata" to make that happen.

Is there actually something they gain from this particular method of splitting?

danielhanchen · 2024-03-30T08:21:28.000000Z

I guess it's mainly to reduce VRAM usage. Assuming we don't do this, then a 7b model with 1bit group size 8 will use 3GB or something of VRAM, whilst a 4bit group size of 64 will use 4GB approx.

Assume we have a 100b model - with 4bit, VRAM is around 57GB or so. With 1bit, 43GB VRAM, but by moving the scalars and zero point to RAM, VRAM use is like 15GB or something, at the cost of like 28GB RAM usage.

I guess maybe a valid approach is to dynamically select which ones to move to RAM or VRAM, given your VRAM budget. Say you have a 40GB GPU, clearly move more of the scalars to GPU. But if you have a weak GPU, then you need to use more RAM.

Dylan16807 · 2024-03-30T18:51:59.000000Z

I still don't think I'm getting my point across.

What if you store 40% of the data and 40% of the metadata in CPU memory. Is that the same for performance?

Is there a reason we want particular parts of the layer to stay on the GPU? Or is it just number of bytes.

danielhanchen · 2024-03-31T01:37:20.000000Z

Oh, very good question - tbh im not sure. Another close technique is layer offloading - if your network can't fit and has layers 1, 2, ..., 32, we offload layers 16 to 32 to RAM, then load them in to GPU memory on the fly.

I'm gonna guess the performance hit is similar - although I have not tried it myself to verify for benchmarking

vladf · 2024-03-30T03:30:34.000000Z

Err, you are just restating what I’m saying, without addressing the concerns.

1 - is it fair to use ram in two places and report only one of them without any asterisk? (If you think this is fair-oh boy wait till you hear about my 0GB hbm use inference algorithm)

2 - i know how subchannel quantization works. Are they hitting those reported latency numbers with per layer cpu pingpong to rescale?

danielhanchen · 2024-03-30T05:01:27.000000Z

Oh sorry - did not expect to restate what you said whoops - all train of thought!

You can see from https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-hf_1bi... that the model disk space is 3GB + 100MB of LoRA weights. I also uploaded a 4bit one to https://huggingface.co/unsloth/llama-2-7b-bnb-4bit/tree/main which uses 3.87GB.

So because of the offloading trick, the GPU VRAM is less, but in actuality, still 3GB is needed.

Unsure on latency sadly

mobicham · 2024-03-31T11:56:54.000000Z

Hey Daniel! The VRAM is still the same as a pure n-bit model would take. Because we only need meta-data for a single nn.Linear at a time, you only need an additional (3GB-1.7GB)/224 = 5.8MB. If we compress the meta-data as well that would become much lower.

danielhanchen · 2024-04-01T02:15:07.000000Z

Hey :)) Oh I love the idea of async movement from CPU to GPU ahead of time - ingenious! Prefetching a small amount of metadata seems reasonable and very smart!