Any speed up that is 2x is definitely worth fixing. Especially since someone has...

menaerus · 2025-06-24T18:14:25 1750788865

So, what exactly is batch inference workload and how would someone running inference on local setup benefit from it? Or how would I even benefit from it if I had a single machine hosting multiple users simultaneously?

I believe batching is a concept only useful when during the training or fine tuning process.

zargon · 2025-06-24T18:36:30 1750790190

Batch inference is just running multiple inferences simultaneously. If you have simultaneous requests, you’ll get incredible performance gains, since a single inference doesn’t leverage any meaningful fraction of a GPU’s compute capability.

For local hosting, a more likely scenario where you could use batching is if you had a lot of different data you wanted to process (lots of documents or whatever). You could batch them in sets of x and have it complete in 1/x the time.

A less likely scenario is having enough users that you can make the first user wait a few seconds while you wait to see if a second user submits a request. If you do get a second request, then you can batch them and the second user will get their result back much faster than if they had had to wait for the first user’s request to complete first.

Most people doing local hosting on consumer hardware won’t have the extra VRAM for the KV cache for multiple simultaneous inferences though.

menaerus · 2025-06-24T18:47:13 1750790833

Wouldn't batching the multiple inference requests from multiple different users with multiple different contexts simultaneously impact the inference results for each of those users?

pests · 2025-06-24T20:36:05 1750797365

The different prompts being batched do not mathematically affect each other. When running inference you have massive weights that need to get loaded and unloaded just to serve the current prompt and however long its context is (maybe even just a few tokens even). This batching lets you manipulate and move the weights around less to serve the same amount of combined context.

menaerus · 2025-06-24T21:01:56 1750798916

Batching isn't about "moving weights around less". Where do you move the weights anyway once they are loaded into the GPU VRAM? Batching, as always in CS problems, is about maximizing the compute for a unit of a single round trip, and in this case DMA-context-from-CPU-RAM-to-GPU-VRAM.

Self attention premise is exactly that it isn't context free so it is also incorrect to say that batched requests do not mathematically affect each other. They do, and that's by design.

zargon · 2025-06-24T21:35:00 1750800900

> Where do you move the weights anyway once they are loaded into the GPU VRAM?

The GPU can’t do anything with weights while they are in VRAM. They have to be moved into the GPU itself first.

So it is about memory round-trips, but not between RAM and VRAM. It’s the round trips between the VRAM and the registers in the GPU die. When batch processing, the calculations for all batched requests can be done while the model parameters are in the GPU registers. Compared to if they were done sequentially, you would multiply the number of trips between the VRAM and the GPU by the number of individual inferences.

Also, batched prompts and outputs are indeed mathematically independent from each other.

menaerus · 2025-06-25T06:07:13 1750831633

Round-trip between VRAM and GPU registers? That's what the cache hierarchies are for. I think you confused quite a bit of concepts here.

Moving data to and from VRAM is ~100ns of latency. Moving data from RAM to VRAM through PCIe 5.0 is 1-10us of latency. So, ~1 to ~2 orders of magnitude of difference.

And this is the reason why batching is used - you don't want to pay the price of that latency for each and every CPU-to-GPU request but you want to push as much data as you can through a single round-trip.

hexaga · 2025-06-25T11:19:42 1750850382

Model weights are significantly larger than cache in almost all cases. Even an 8B parameter model is ~16G in half precision. The caches are not large enough to actually cache that.

Every weight has to be touched for every forward pass, meaning you have to wait for 16G to transfer from VRAM -> SRAM -> registers. That's not even close to 100ns: on a 4090 with ~1TB/s memory bandwidth that's 16 milliseconds. PCIe latency to launch kernels or move 20 integers or whatever is functionally irrelevant on this scale.

The real reason for batching is it lets you re-use that gigantic VRAM->SRAM transfer across the batch & sequence dimensions. Instead of paying a 16ms memory tax for each token, you pay it once for the whole batched forward pass.

menaerus · 2025-06-25T11:58:25 1750852705

You've made several incorrect assumptions and I am not bothered enough to try to correct them so I apologize for my ignorance. I'll just say that 16ms memory tax is wildly incorrect.

namibj · 2025-06-25T16:58:58 1750870738

You are either having a massive misconception of GPT-like decoder transformers, of how GPU data paths are architected, or are trolling. Go talk to a modern reasoning model to get yourself some knowledge, it's gonna be much better than what you appear to have.

pests · 2025-06-25T08:14:35 1750839275

> That's what the cache hierarchies are for

That’s the core point though. If you do batches the cache and registers are already primed and ready. The model runs in steps/layers accessing different weights in VRAM along the way. When batching you take advantage of this.

I’m in agreement that RAM to VRAM is important too but I feel the key speed up for inference batching is my above point.

menaerus · 2025-06-25T11:46:34 1750851994

Not really. Registers are irrelevant. They are not the bottleneck.

pests · 2025-06-25T19:02:52 1750878172

Computation happens in the registers. If you’re not moving data to registers you aren’t doing any compute.