Hacker News new | past | comments | ask | show | jobs | submit login
Serving 70B-scale LLMs efficiently on low-resource edge devices [pdf] (arxiv.org)
200 points by simonpure 10 hours ago | hide | past | favorite | 50 comments





This is not a memory reduction technique that's somehow magical. Well, it does manage memory with some clever scheduling. The core of this idea is that you can schedule out inference on edge nodes in a memory and bandwidth optimized way that's a bit different than just splitting layers.

They propose that right now computation and latency dominate the costs for multi-node inference, and pick a network topology (star) that is savvy to that.

That said, it's 26-29 seconds per token for llama2-70b with their 8 edge devices, each using 4 gigs of RAM. That's amazing that they can run it at all, but this isn't going to be viable at the edge with current hardware.

I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.

Upshot - interesting paper -- smart ideas, large frontier models still need very exotic hardware and bandwidth interconnects - this may point a way forward to lowering the bandwidth interconnects part of the story.


I think the main advantage here is you COULD run it, even it it takes a while. That is a step up from current model limitations which require ram or vram to hold the model.

I think this lays some groundwork for running a 400B model on a 3090/4090 or even smaller GPU. If you can get a huge model like that running on a single gpu even if the mean time per token is in the seconds, that's acceptable for many use cases.

If this same technique can be used to extend context windows in addition to token autocomplete, that would be great in it's own regard.

Hopefully work like this continues as throwing a ton of vram at a model should be regarded as a performance optimization not necessarily a requirement.


> That is a step up from current model limitations which require ram or vram to hold the model.

Current? Apple recently published a neat paper on how they optimise for both inference (cpu/gpu) and memory use:

  Our method involves constructing an inference cost model that takes into account
  the characteristics of flash memory, guiding us to optimize in two critical areas:
  reducing the volume of data transferred from flash and reading data in larger, more contiguous
  chunks. Within this hardware-informed framework, we introduce two principal techniques.
  First, “windowing” strategically reduces data transfer by reusing previously activated neurons,
  and second, “row-column bundling”, tailored to the sequential data access strengths
  of flash memory, increases the size of data chunks read from flash memory. These methods
  collectively enable running models up to twice the size of the available DRAM, with
  up to 4x and 20x increase in inference speed compared to naive loading approaches in CPU
  and GPU, respectively.
https://news.ycombinator.com/item?id=38704982

It's already technically possible to run huge models locally when you don't have the RAM/VRAM needed - llama.cpp can 'mmap' the model from disk.

Of course an nvidia 4090 has a memory bandwidth of a 1000 GB per second; a CPU like the i7-13700K has a memory bandwidth of 90 GB per second; and a high-end NVMe SSD might only have read bandwidth of 10 GB per second.

So in approximate terms, an LLM and quantisation level that can produce 10 tokens per second on a 4090 will produce 1 token per second from RAM and a token every 10 seconds from SSD.


> I think the main advantage here is you COULD run it, even it it takes a while.

I mean, you COULD run it before as well, even if you don't have enough RAM or VRAM, by using something like `zram`. It'd probably be even slower (and border-line usable, depending on the use case), but it's not impossible to get things to run.


Do you think this could allow distributed inference only, or opens the door for distributed training of the model? Democratization of the models is in part hampered by the total compute a single person or small group can make use of, but if a project like folding@home, but for training large models is possible, it could change the game somewhat.

> I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.

Could be a big deal if it allows cluster of smaller GPUs to compete with a single large VRAM GPU.

Unfortunately I’m a few months of date - which is an eternity in LLM inference techniques - so I’m not sure what current state of distributed inference looks like.


llama.cpp supports splitting work across multiple nodes on a network already

it essentially just copies a chunk of the model to each one, works well for situations where each machine has limited vram


Any pointers to RTFM / llama repo for that ? I could not find anything on a cursory look. Thanks in advance !


Yeah, I think these methods could be baked into llama.cpp or some other higher up the toolchain python library or what have you. They shard out each layer (ish?) to the edges, and recombine that layer inference at the master node, while the outside edges load up their next bit if they need to; I would guess the devil is in the details for all the possible tensor types and architectures (for instance, how shall we implement skip layers?).

While I do think there's going to be a huge market for cloud-based LLM serving, the fact that consumer hardware can run close to SOTA models fairly easily (e.g. high-RAM MBP config), seems to me that the provider market won't be as big as investors are betting on.

Most of the rewards will be reaped by consumers rather than providers.

We're also in an age where the current levels of RAM in consumer devices were almost entirely optimized for prior to the existence of LLMs. I find it highly likely vendors will optimize for higher RAM capacity over other priorities in future hardware.

How long until a 256GB RAM laptop (shared with GPU) is reasonably cheap/available? I give it a few years at most.

It's possible that models grow orders of magnitude larger, but I find it more likely that the size of models will grow along the curve of cost of training decreasing/hardware cost improvements. There will be a sweet spot where it's economical to train larger models, and private companies won't push much beyond that.


Enterprises use LLMs too, and quite often there wouldn't be any client you could reasonably run the model on. (You wouldn't want to e.g. have an LLM summarize and categorize a user request on their device, since that would require you shipping your model and/or internal knowledge base to the client).

Yes, but if you can run a sufficient LLM on a $2,000 laptop, then the cost to serve it from the cloud will be similarly cheap. (e.g. reserve an appropriately sized EC2 instance for pennies on the dollar)

It's a highly competitive market. Companies aren't going to pay 100k/year to run a model on something that can run on a 2k consumer grade device.

128GB of gpu accessible/fast RAM can be had for $5000 on a macbook pro today. What will it be 3-4 years from now on linux/windows machines?

And we still haven't seen any SoC providers try to optimize for RAM capacity over compute yet.


Oh yes, I could definitely see the privacy-preserving consumer use case creating sufficient demand for efficiency that also bleeds over into the enterprise market.

That's what's happened with power efficiency and ARM CPUs, after all!


Not sure what you mean: https://aws.amazon.com/ec2/graviton/

Not to speak of managed cloud services that run on ARM under-the-hood/behind the scenes.

Of course ARM isn't inherently cheaper, AMD+Intel could cut prices/margins big and probably be competitive on $/perf


This is where I want highly sensitive healthcare consumers of LLMs to be at. Note summation, suggested diagnosis (provider always in control), and other augmented abilities for the clinical staff without the risk of health care data sent outside the device, or the very local network.

Depends, shipping part of it (just an encoder or decoder) could still work.

Even if bandwidth weren't an issue and all users had compatible hardware: You'd still be offloading a (semi-)trusted computation to user hardware, which is usually completely untrusted.

It would be nice for the inference time to be paired with measure of output quality. I'm not well versed in how the architecture works, but I have a hard time believing a 90% reduction in peak memory footprint comes cost-free.

It's not cost-free. It comes at the cost of greatly increased latency. 29.9 seconds per token with Llama 3.1-70B. This is from Table 1 (pg 8) of the paper.

That is s/token and not token/s. The cost is high.

The actual goal of the article is to highlight that we can optimise the overall speed by decreasing link latency. Yeah link latency, because it's not 1 machine but several low devices that are used together to serve the 70B LLM.


Am I just misunderstanding, or is the paper using "latency" when what they really mean is "throughput"?

In other words, if I want 100 tokens of output, do I have to wait 2990 seconds? If so, the terminology seems unnecessarily confusing.


Ah the disk swap method

It's not disk swap. It's multi-devices LLM.

That looked like an analogy. Back in the days of a mechanical arm moving magnetic fields around in our PCs, you could have the illusion of infinite RAM as long as you're ok with microsecond operations now taking two million times longer. This is akin.

I think the point is that it has the same sort of latency tradeoff that disk swap did: it's awful, but sometimes better than nothing.

Is there any predictability/patterns for neuron/layer activation? If so, would it be reasonable to have a second tiny model that specifically tries to predict activation and preemptively swap those into memory?

This isn't how neural networks work.

For vanilla models, you always use all the weights. That isn't true for mixture-of-experts, though, and in that setting, your approach has merit.


Depends on the architecture, but generally you just move through the layers linearly. Simple iteration.

The number of layers, and the amount of time spent in each of them, makes me think any benefit from pre-loading the layer ahead is negligible.

You really need the entire model on device to consider it performant.


From what I get skimming through the article the main cost is speed of token generation (token latency). You can always run a large model by reading directly from the disk and not care much about ram; but it is very slow. They try to improve that aspect doing some optimisations, but it is still definitely slower than using ram or vram.

Table 3 directly refutes this* and claims 0 tradeoffs.**

Below that, they indicate that a key part of the implementation is loading weights from disk before they're needed using a separate thread.***

* maybe I'm missing something though, someone please triple check :)

** ttft (time to first token) and s/token (seconds per token) are both lower than any alternative in all cases.

*** "daemon thread asynchronously preloads the weights"


I want to add that their chart shows s/token per device (edit: as per the heading on table 1 - it could also be confusing grammar), so it sounds like you are getting 4x the listed s/t on their 4 laptop cluster . Their laptops are not even hardwired - they are connecting over wifi.

This comes at a very interesting time for me. I have an ancient dual xeon workstation with 64gb memory that I was researching how to convert to run an llm. I can just run that with 4 instances on the same machine and see how it goes, without purchasing a better GPU, to start. It sounds like this will allow you to run very large models with minimal quants, on craigslist quality devices.

If it does what they say it does (and it seems to do), it will be an absolute game changer for most users.


I've only read the abstract but they don't mention quantizing the weights or otherwise trying to shrink the model in any way.

They're claiming to be able to efficiently run larger models without loading the entire thing into GPU memory. If they're using the same weights, the same architecture and just using tensor parallel operations to perform the forward pass that would imply no loss in quality.

I'm sure there are trade-offs but they're not clear by just looking at the abstract.


I read it like this too - no drop in weights or model quality just optimizing the lower boundaries of performance when you are splitting from vram to ram to disk (or network).

Nothing is free in this world.

Is this different from (or related to) the work being done by the exo project?

https://github.com/exo-explore/exo


Exo is for partitioning over network across devices (implementing some bandwidth-reducing partitions) but still requires a minimum ram/vram requirement to load a model. This could, in theory, be combined to allow larger models to run on exo clusters with less gpu/ram than is required by the underlying model (at the cost of some performance no doubt, but still).

exo maintainer here. tgtweak is correct.

This looks like potentially some promising research that I'm looking into reproducing now. We want to lower the barrier to running large models as much as possible so if this works, it would be a potential addition to the exo offering.


Is there a cuda implementation of this... asking for a friend

So when will I be able to "sudo apt-get install llm" ?

You can already do it with llamafile, checkout the project, it lets you convert a .gguf model in a portable executable


And everything runs way faster on cpu like that

I could run at 1token/s on qwen 2.5 72b q4_k_m on my i7 8750h + alot of ram XD

With this model:

https://huggingface.co/bartowski/Qwen2.5-72B-Instruct-GGUF

Man I need to test the q8 version with llamafiles optimizations, it would be so nice to host it locally with the new ryzens, it could maybe fit my 96GB of ram


Realistically, you probably want to wait until Vulkan support trickles out. That way, you aren't at the whim of the various evil hardware drivers (everybody's suck), and the AI can give you a disappointingly confused answer much faster than running the LLM on a CPU can.

I'm not aware of any Debian family distro that packages it, but NixOS has at least ollama and llama-cpp in its repos. Honestly even if the more stable distributions did have these things packaged, I would hesitate to use the packaged versions because all of this stuff is still so quickly moving that you'd be on an old version and it would hurt.

Edit: Arch has ollama in official repos too. OpenSUSE has https://software.opensuse.org/package/ollama .


Ollama is close...

You already can with ollama



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: