This is the real answer, I don't know what people above are even discussing when batching is the biggest reduction in costs. If it costs say $50k to serve one request, with batching is also costs $50k to serve 100 at the same time with minimal performance loss, I don't know what the real number of users is before you need to buy new hardware, but I know it's in the hundreds so going from $50000 to $500 in effective costs is a pretty big deal (assuming you have the users to saturate the hardware).
My simple explanation of how batching works: Since the bottleneck of processing LLMs is in loading the weights of the model onto the GPU to do the computing, what you can do is instead of computing each request separately, you can compute multiple at the same time, ergo batching.
Let's make a visual example, let's say you have a model with 3 sets of weights that can fit inside the GPU's cache (A, B, C) and you need to serve 2 requests (1, 2). A naive approach would be to serve them one at a time.
(Legend: LA = Load weight set A, CA1 = Compute weight set A for request 1)
But you could instead batch the compute parts together.
LA->CA1->CA2->LB->CB1->CB2->LC->CC1->CC2
Now if you consider that the loading is hundreds if not thousands of times slower than computing the same data, then you'll see the big different, here's a "chart" visualizing the difference of the two approaches if it was just 10 times slower. (Consider 1 letter a unit of time.)
Time spent using approach 1 (1 request at a time):
The difference is even more dramatic in the real world because as I said, loading is many times slower than computing, you'd have to serve many users before you see a serious difference in speeds. I believe in the real world the restrictions is actually that serving more users requires more memory to store the activation state of the weights, so you'll end up running out of memory and you'll have to balance out how many people per GPU cluster you want to serve at the same time.
TL;DR: It's pretty expensive to get enough hardware to serve an LLM, but once you do have you can serve hundreds of users at the same time with minimal performance loss.
Thanks for the helpful reply! As I wasn't able to fully understand it still, I pasted your reply in chatgpt and asked it some follow up questions and here is what i understand from my interaction:
- Big models like GPT-4 are split across many GPUs (sharding).
- Each GPU holds some layers in VRAM.
- To process a request, weights for a layer must be loaded from VRAM into the GPU's tiny on-chip cache before doing the math.
- Loading into cache is slow, the ops are fast though.
- Without batching: load layer > compute user1 > load again > compute user2.
- With batching: load layer once > compute for all users > send to gpu 2 etc
- This makes cost per user drop massively if you have enough simultaneous users.
- But bigger batches need more GPU memory for activations, so there's a max size.
This does makes sense to me but does this sound accurate to you?
Would love to know if I'm still missing something important.
This seems a bit complicated to me. They don't serve very many models. My assumption is they just dedicate GPUs to specific models, so the model is always in VRAM. No loading per request - it takes a while to load a model in anyway.
The limiting factor compared to local is dedicated VRAM - if you dedicate 80GB of VRAM locally 24 hours/day so response times are fast, you're wasting most of the time when you're not querying.
Loading here refers to loading from VRAM to the GPUs core cache, loading from VRAM is extremely slow in terms of GPU time that GPU cores end up idle most of the time just waiting for more data to come in.
But you still have to load the data for each request. And in an LLM doesnt this mean the WHOLE kv cache because the kv cache changes after every computation? So why isnt THIS the bottleneck? Gemini is talking about a context window of a million tokens- how big would the kv cache fir this get?
Back in the GPT3 days people said that prompt engineering was going to be dead due to prompt tuning. And here we are 2 major versions later and I've yet to see it in production. I thought it would be useful not only to prevent leaks like these, but they would also produce more reliable results no?
If you don't know what prompt tuning is, it's when you freeze the whole model except a certain amount of embeddings at the beginning of the prompt and train only those embeddings. It works like fine tuning but you can swap them in and out as they work just like normal text tokens, they just have vectors that don't map directly to discrete tokens. If you know what textual inversion is in image models it's the same concept.
I think prompt tuning might be worth doing for specific tasks in agentic workflows. For general prompts using words instead of fine tuned input vectors might be good enough. It also easier to update.
The fact that the model leaks some wordy prompt doesn't mean it's actual prompt aren't finetuned emeddings. It wouldn't have a way to leak those using just output tokens and since you start finetuning from a text prompt it would most likely return this text or something close.
> As things have shifted more towards mass consumption of model weights it's become less and less common to see.
Not the real reason. The real reason is that training has moved to FP/BF16 over the years as NVIDIA made that more efficient in their hardware, the same reason you're starting to see some models being released in 8bit formats (deepseek).
Of course people can always quantize the weights to smaller sizes, but the master versions of the weights is usually 16bit.
The wear on the parity drive is the same regardless of raid technology you choose, unraid just lets you have mismatched data drives. In fact you could argue that unraid is healthier for the drives since a write doesn't trigger a write on all drives, just 2. The situation you described is true for any raid system.
If what you said were true then they would ban all porn and not just rape/incest/bestiality porn. They're banning specific genres of porn which makes it an obvious morality issue.
I can't back this up with facts but the chargeback myth smells of an old astroturfing campaign to justify the moral policing on porn in general. But nowadays porn is more commonly accepted so they're shifting to more specific genres.
The new myth seems to be that payment processors can he held legally liable for facilitating illegal transactions, but the only lawsuits vs payment processors I can find is about child pornography, which has always been banned on steam.
When added that there was an advocacy group that sent an open letter to payment processors a week ago for this same exact issue[1], then the chargeback excuse has zero merit.
So yeah, it's 100% a moral crusade. Which side you sit on the crusade it up to you.
> I think the bulk of this is kind of a silly semantic argument
> "After all, I didn't click 'Rent now'. I paid 60 bucks and clicked a button that said 'Buy'"
> With all the Terms of Service, you rarely actually "own" a piece of software. If all the "Buy" buttons were replaced with "Lease License" buttons - would everyone suddenly stop complain? I doubt it
That's not what the campaign is pushing for, the campaign is pushing for not allowing such things in the terms of service in the first place, within reason. The simplest way we can explain it is if you're paying for a subscription (i.e. MMOs) then the current behavior is fine. If you're paying a one-time fee for a game that depends on server somewhere to even play the game, then the publishers would be required to make end-of-life plans so the server isn't required to continue playing, even if in a limited state. I.e. all the single player games that require online connections because publishers are pushing for single-player microtransactions now.
The campaign aims to stop publishers from adding unnecessary dependencies that can shut down the games. This would stop for example publishers from killing a game when releasing a sequel to the same game, forcing users to repurchase what is essentially the same game. The Crew is a good example of this (and what started the campaign in the first place), other examples are sports games and other games like CoD that get yearly releases and services for older versions get killed arbitrarily.
> They have the right to distribute it the way they see fit.
The law and society decides that, game publishers want to of course control things end-to-end and rent-seek instead of sell, because having an eternal source of revenue is much more profitable for them, those have been established by the rest of society to be predatory. No consumer should be happy that adobe does subscriptions only, no consumer should be happy that apple controls their ecosystem end-to-end to the point that they get a cut of all monetary transactions in their system. Gamers already see similar behavior with Sony and Nintendo on the hardware side, and EA, Warner, Ubisoft, Activision, etc. are pushing things further on the software side. This campaign is meant to be a push-back from such behavior. There used to be a time where charging money for horse armor was a scandal, we'll never go back to those times, but most gamers agree that gaming has gone too far in nickel-and-diming the consumer, so this a pushback to at least not have publishers kill games because of their greed.
The law already restricts software sales in many ways, there's many cases of mass-refunds because software is sold with deceptive practices, or when they do anti-competitive shit like forced bundling or market dominance abuse.
I've found that the best experience of self hosting excalidraw is actually using it inside nextcloud, it's called whiteboard over there but it's actually excalidraw. Setup is bit finicky but workable if you understand how reverse proxies work.
Nextcloud allows you to have an actual file based workflow and collaboration works out of the box, so if you give someone the url they can see what you're doing and let them do edits as well.
The title is just fearmongering, it's removing the driver from being automatically installed from windows update, not preventing it from being installed altogether. They're also not revoking the signatures either so downloading and installing directly from the vendor site still works (and is still the recommended way to do it).
The equivalent in the linux world would be removing a driver from the main repo, requiring the user to either install the rpm/deb manually or use a third party repo.
Hence the "not bothered". These are a tiny little part of their delivery. They are not doing this to save a few megs in their next multi-gig update cycle. They are doing this to, again, make running older hardware more difficult.
This is insanely fast, my guess is that the tradeoff here is that the GPUs will always be working at max capacity and there will be minimal compute savings from batching, which I realize now is not really a tradeoff.
My only worry is that the diffusion objective will be worse than AR in terms of model capabilities, if that's the case hopefully multi-token AR models will perform as well as diffusion, or we can use this as a draft model for speculative decoding.
Why do you suspect dLLMs should not match (or surpass) arLLMs in quality? The general idea is that it is easier to treat the output as a structured whole (idea, points, concepts, words - in a tree) which is iteratively treated - that should go in the direction of "proper" quality.
Another intuition is simply that anytime your causal relationships in the training data are sequential you are having a lower probability of getting the correct token at a certain position because you have less of the causal information leading up to that position than you would have with AR and thus during training you almost always have a worse model with near certainty (think of the words in a function of source code, even if some of the functions are unsorted and thus a tree at the high level). Imagine you somehow already have N tokens in a sequence: is it easier to next predict token N+1 or N+15? I do like the performance tradeoff for some usecases though and I hope we see more models soon. For image tokens my argument does not hold because causality is not as clear as for text, math, code, or timeseries.
My intuition is that the harder it is for an LLM to do something during training the more actual compression/learning will be encoded in it's weights. With multi-token/diffusion it becomes much easier to "reward/loss hack" your way, this won't matter much during pretraining, but I assume a lot of "cheating" will happen in the finetune/RL phase.
My simple explanation of how batching works: Since the bottleneck of processing LLMs is in loading the weights of the model onto the GPU to do the computing, what you can do is instead of computing each request separately, you can compute multiple at the same time, ergo batching.
Let's make a visual example, let's say you have a model with 3 sets of weights that can fit inside the GPU's cache (A, B, C) and you need to serve 2 requests (1, 2). A naive approach would be to serve them one at a time.
(Legend: LA = Load weight set A, CA1 = Compute weight set A for request 1)
LA->CA1->LB->CB1->LC->CC1->LA->CA2->LB->CB2->LC->CC2
But you could instead batch the compute parts together.
LA->CA1->CA2->LB->CB1->CB2->LC->CC1->CC2
Now if you consider that the loading is hundreds if not thousands of times slower than computing the same data, then you'll see the big different, here's a "chart" visualizing the difference of the two approaches if it was just 10 times slower. (Consider 1 letter a unit of time.)
Time spent using approach 1 (1 request at a time):
LLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLC
Time spend using approach 2 (batching):
LLLLLLLLLLCCLLLLLLLLLLCCLLLLLLLLLLCC
The difference is even more dramatic in the real world because as I said, loading is many times slower than computing, you'd have to serve many users before you see a serious difference in speeds. I believe in the real world the restrictions is actually that serving more users requires more memory to store the activation state of the weights, so you'll end up running out of memory and you'll have to balance out how many people per GPU cluster you want to serve at the same time.
TL;DR: It's pretty expensive to get enough hardware to serve an LLM, but once you do have you can serve hundreds of users at the same time with minimal performance loss.