This is not a memory reduction technique that's somehow magical. Well, it does manage memory with some clever scheduling. The core of this idea is that you can schedule out inference on edge nodes in a memory and bandwidth optimized way that's a bit different than just splitting layers.
They propose that right now computation and latency dominate the costs for multi-node inference, and pick a network topology (star) that is savvy to that.
That said, it's 26-29 seconds per token for llama2-70b with their 8 edge devices, each using 4 gigs of RAM. That's amazing that they can run it at all, but this isn't going to be viable at the edge with current hardware.
I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.
Upshot - interesting paper -- smart ideas, large frontier models still need very exotic hardware and bandwidth interconnects - this may point a way forward to lowering the bandwidth interconnects part of the story.
I think the main advantage here is you COULD run it, even it it takes a while. That is a step up from current model limitations which require ram or vram to hold the model.
I think this lays some groundwork for running a 400B model on a 3090/4090 or even smaller GPU. If you can get a huge model like that running on a single gpu even if the mean time per token is in the seconds, that's acceptable for many use cases.
If this same technique can be used to extend context windows in addition to token autocomplete, that would be great in it's own regard.
Hopefully work like this continues as throwing a ton of vram at a model should be regarded as a performance optimization not necessarily a requirement.
> That is a step up from current model limitations which require ram or vram to hold the model.
Current? Apple recently published a neat paper on how they optimise for both inference (cpu/gpu) and memory use:
Our method involves constructing an inference cost model that takes into account
the characteristics of flash memory, guiding us to optimize in two critical areas:
reducing the volume of data transferred from flash and reading data in larger, more contiguous
chunks. Within this hardware-informed framework, we introduce two principal techniques.
First, “windowing” strategically reduces data transfer by reusing previously activated neurons,
and second, “row-column bundling”, tailored to the sequential data access strengths
of flash memory, increases the size of data chunks read from flash memory. These methods
collectively enable running models up to twice the size of the available DRAM, with
up to 4x and 20x increase in inference speed compared to naive loading approaches in CPU
and GPU, respectively.
It's already technically possible to run huge models locally when you don't have the RAM/VRAM needed - llama.cpp can 'mmap' the model from disk.
Of course an nvidia 4090 has a memory bandwidth of a 1000 GB per second; a CPU like the i7-13700K has a memory bandwidth of 90 GB per second; and a high-end NVMe SSD might only have read bandwidth of 10 GB per second.
So in approximate terms, an LLM and quantisation level that can produce 10 tokens per second on a 4090 will produce 1 token per second from RAM and a token every 10 seconds from SSD.
> I think the main advantage here is you COULD run it, even it it takes a while.
I mean, you COULD run it before as well, even if you don't have enough RAM or VRAM, by using something like `zram`. It'd probably be even slower (and border-line usable, depending on the use case), but it's not impossible to get things to run.
Do you think this could allow distributed inference only, or opens the door for distributed training of the model? Democratization of the models is in part hampered by the total compute a single person or small group can make use of, but if a project like folding@home, but for training large models is possible, it could change the game somewhat.
> I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.
Could be a big deal if it allows cluster of smaller GPUs to compete with a single large VRAM GPU.
Unfortunately I’m a few months of date - which is an eternity in LLM inference techniques - so I’m not sure what current state of distributed inference looks like.
Yeah, I think these methods could be baked into llama.cpp or some other higher up the toolchain python library or what have you. They shard out each layer (ish?) to the edges, and recombine that layer inference at the master node, while the outside edges load up their next bit if they need to; I would guess the devil is in the details for all the possible tensor types and architectures (for instance, how shall we implement skip layers?).
While I do think there's going to be a huge market for cloud-based LLM serving, the fact that consumer hardware can run close to SOTA models fairly easily (e.g. high-RAM MBP config), seems to me that the provider market won't be as big as investors are betting on.
Most of the rewards will be reaped by consumers rather than providers.
We're also in an age where the current levels of RAM in consumer devices were almost entirely optimized for prior to the existence of LLMs. I find it highly likely vendors will optimize for higher RAM capacity over other priorities in future hardware.
How long until a 256GB RAM laptop (shared with GPU) is reasonably cheap/available? I give it a few years at most.
It's possible that models grow orders of magnitude larger, but I find it more likely that the size of models will grow along the curve of cost of training decreasing/hardware cost improvements. There will be a sweet spot where it's economical to train larger models, and private companies won't push much beyond that.
Enterprises use LLMs too, and quite often there wouldn't be any client you could reasonably run the model on. (You wouldn't want to e.g. have an LLM summarize and categorize a user request on their device, since that would require you shipping your model and/or internal knowledge base to the client).
Yes, but if you can run a sufficient LLM on a $2,000 laptop, then the cost to serve it from the cloud will be similarly cheap. (e.g. reserve an appropriately sized EC2 instance for pennies on the dollar)
It's a highly competitive market. Companies aren't going to pay 100k/year to run a model on something that can run on a 2k consumer grade device.
128GB of gpu accessible/fast RAM can be had for $5000 on a macbook pro today. What will it be 3-4 years from now on linux/windows machines?
And we still haven't seen any SoC providers try to optimize for RAM capacity over compute yet.
Oh yes, I could definitely see the privacy-preserving consumer use case creating sufficient demand for efficiency that also bleeds over into the enterprise market.
That's what's happened with power efficiency and ARM CPUs, after all!
This is where I want highly sensitive healthcare consumers of LLMs to be at. Note summation, suggested diagnosis (provider always in control), and other augmented abilities for the clinical staff without the risk of health care data sent outside the device, or the very local network.
Even if bandwidth weren't an issue and all users had compatible hardware: You'd still be offloading a (semi-)trusted computation to user hardware, which is usually completely untrusted.
It would be nice for the inference time to be paired with measure of output quality. I'm not well versed in how the architecture works, but I have a hard time believing a 90% reduction in peak memory footprint comes cost-free.
It's not cost-free. It comes at the cost of greatly increased latency. 29.9 seconds per token with Llama 3.1-70B. This is from Table 1 (pg 8) of the paper.
That is s/token and not token/s. The cost is high.
The actual goal of the article is to highlight that we can optimise the overall speed by decreasing link latency. Yeah link latency, because it's not 1 machine but several low devices that are used together to serve the 70B LLM.
That looked like an analogy. Back in the days of a mechanical arm moving magnetic fields around in our PCs, you could have the illusion of infinite RAM as long as you're ok with microsecond operations now taking two million times longer. This is akin.
Is there any predictability/patterns for neuron/layer activation? If so, would it be reasonable to have a second tiny model that specifically tries to predict activation and preemptively swap those into memory?
From what I get skimming through the article the main cost is speed of token generation (token latency). You can always run a large model by reading directly from the disk and not care much about ram; but it is very slow. They try to improve that aspect doing some optimisations, but it is still definitely slower than using ram or vram.
I want to add that their chart shows s/token per device (edit: as per the heading on table 1 - it could also be confusing grammar), so it sounds like you are getting 4x the listed s/t on their 4 laptop cluster . Their laptops are not even hardwired - they are connecting over wifi.
This comes at a very interesting time for me. I have an ancient dual xeon workstation with 64gb memory that I was researching how to convert to run an llm. I can just run that with 4 instances on the same machine and see how it goes, without purchasing a better GPU, to start. It sounds like this will allow you to run very large models with minimal quants, on craigslist quality devices.
If it does what they say it does (and it seems to do), it will be an absolute game changer for most users.
I've only read the abstract but they don't mention quantizing the weights or otherwise trying to shrink the model in any way.
They're claiming to be able to efficiently run larger models without loading the entire thing into GPU memory. If they're using the same weights, the same architecture and just using tensor parallel operations to perform the forward pass that would imply no loss in quality.
I'm sure there are trade-offs but they're not clear by just looking at the abstract.
I read it like this too - no drop in weights or model quality just optimizing the lower boundaries of performance when you are splitting from vram to ram to disk (or network).
Exo is for partitioning over network across devices (implementing some bandwidth-reducing partitions) but still requires a minimum ram/vram requirement to load a model. This could, in theory, be combined to allow larger models to run on exo clusters with less gpu/ram than is required by the underlying model (at the cost of some performance no doubt, but still).
This looks like potentially some promising research that I'm looking into reproducing now. We want to lower the barrier to running large models as much as possible so if this works, it would be a potential addition to the exo offering.
Man I need to test the q8 version with llamafiles optimizations, it would be so nice to host it locally with the new ryzens, it could maybe fit my 96GB of ram
Realistically, you probably want to wait until Vulkan support trickles out. That way, you aren't at the whim of the various evil hardware drivers (everybody's suck), and the AI can give you a disappointingly confused answer much faster than running the LLM on a CPU can.
I'm not aware of any Debian family distro that packages it, but NixOS has at least ollama and llama-cpp in its repos. Honestly even if the more stable distributions did have these things packaged, I would hesitate to use the packaged versions because all of this stuff is still so quickly moving that you'd be on an old version and it would hurt.
They propose that right now computation and latency dominate the costs for multi-node inference, and pick a network topology (star) that is savvy to that.
That said, it's 26-29 seconds per token for llama2-70b with their 8 edge devices, each using 4 gigs of RAM. That's amazing that they can run it at all, but this isn't going to be viable at the edge with current hardware.
I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.
Upshot - interesting paper -- smart ideas, large frontier models still need very exotic hardware and bandwidth interconnects - this may point a way forward to lowering the bandwidth interconnects part of the story.
reply