Fast and Portable Llama2 Inference on the Heterogeneous Edge

oersted · on Nov 13, 2023

I'm all for Rust and WASM, but if you look at the code it's just 150 lines of a basic Rust command-line script. All the heavy lifting is done by a single line passing the model to the WASI-NN backend, which in this case is provided by the WasmEdge runtime, which incidentally is C++, not Rust.

Rust is bringing zero advantage here really, the backend could be called from Python or anything else.

whywhywhywhy · on Nov 13, 2023

Seems like the advantage it is bringing is in bundling, shipping Python and PyTorch into something an end user can double click and run is currently a complete mess.

Of course the actual high powered code is C++ in both cases but shipping 2+GB and 10s of thousands of files just to send some instructions to that C++ could benefit from being one 2MB executable instead.

oersted · on Nov 13, 2023

Yes that makes sense.

I am not familiar enough with llama.cpp, but from what I see they have mostly copy-pasted it into WasmEdge for the WASI-NN implementation.

Surely a simple compiled binary of llama.cpp is better than Rust compiled to WASM plus the WasmEdge runtime binary wrapping the same llama.cpp.

It wouldn't be more portable either, all the heterogeneous hardware acceleration support is part of llama.cpp not WasmEdge.

I guess theoretically if the WASI-NN proposal is standardized, other WASM runtimes could implement their own backends. It is a decent abstraction to cleanly expand portability and for optimizing for specific instrastructure.

But at this point it doesn't have much to do with Rust or WASM. It's just the same old concept of portability via bytecode runtimes like the JVM or, indeed, the Python interpreter with native extensions (libraries).

wrsh07 · on Nov 13, 2023

If you replaced the rust with any other language (including python) you shouldn't need pytorch because the rust code is using ggml (which is cpp)

ed · on Nov 13, 2023

Whoa! Great work. To other folks checking it out, it still requires downloading the weights, which are pretty large. But they essentially made a fully portable, no-dependency llama.cpp, in 2mb.

If you're an app developer this might be the easiest way to package an inference engine in a distributable file (the weights are already portable and can be downloaded on-demand — the inference engine is really the part you want to lock down).

kristianp · on Nov 13, 2023

It might be more helpful if the title says 2MB of wasm. But as you say, the weights dwarf that.

andy99 · on Nov 13, 2023

The `main` file that llama.cpp builds is 1.2MB on my machine. The 2MB size isn't anything particularly impressive. Targeting wasm makes it more portable, otherwise these isn't some special extra compactness here.

FL33TW00D · on Nov 13, 2023

This just wrapping llama.cpp right? I’m sorry but I’m pretty tired of projects wrapping x.cpp.

I’ve been developing a Rust + WebGPU ML framework for the past 6 months. I’ve learned quickly how impressive the work by GG is.

It’s early stages but you can check it out here: https://www.ratchet.sh/ https://github.com/FL33TW00D/whisper-turbo

stavros · on Nov 13, 2023

Can you elaborate on what you find impressive? I know nothing about this stuff so I can't appreciate it.

Me1000 · on Nov 13, 2023

This is really cool! Thank you for sharing! Excited to follow your progress!

FL33TW00D · on Nov 14, 2023

Thank you!

hluska · on Nov 13, 2023

You ripped on someone else’s work and promoted your own in the same comment?? You need to seriously reflect upon your ethics.

FL33TW00D · on Nov 13, 2023

I appreciate the work that went into slimming this binary down, but it's a ~negligble amount of work compared to llama.cpp itself.

HN is inundated with posts doing xyz on top of the x.cpp community. Whilst I appreciate it is exciting - I wish more people would explore the low-level themselves! We can be much more creative in this new playground.

rgbrgb · on Nov 13, 2023

Isn’t doing Stuff on top of it what it’s for? llama.cpp is explicitly highly portable systems code with bindings in many languages.

spiderfarmer · on Nov 13, 2023

Why not both.

hluska · on Nov 13, 2023

[flagged]

FL33TW00D · on Nov 13, 2023

This isn't directed at the author of this post. I was just referring to the plethora of thin x.cpp wrappers.

animuchan · on Nov 13, 2023

Just to add a different POV, this didn't register as an "insult" or "idiotic take" at all for me.

If a project amounts to basically running another project's binary and displaying the results, it's fair criticism to point that out, is it not?

meiraleal · on Nov 13, 2023

> Just to add a different POV, this didn't register as an "insult" or "idiotic take" at all for me.

Just to clarify. Are you the author?

meiraleal · on Nov 13, 2023

> We can be much more creative in this new playground.

You are not being creative gatekeeping, this behavior is quite old.

tansan · on Nov 13, 2023

Who's GG?

europeanNyan · on Nov 13, 2023

https://github.com/ggerganov

0xDEADFED5 · on Nov 13, 2023

https://en.wikipedia.org/wiki/GG_Allin

wokwokwok · on Nov 13, 2023

Mmm…

The wasm-nn that this relies on (https://github.com/WebAssembly/wasi-nn) is a proposal that relies on sending arbitrarily chunks to some vendor implementation. The api is literally like set input, compute, set output.

…and that is totally non portable.

The reason this works, is because it’s relying on the abstraction already implemented in llama.cpp that allows it to take a gguf model and map it to multiple hardware targets,which you can see has been lifted as-is into WasmEdge here: https://github.com/WasmEdge/WasmEdge/tree/master/plugins/was...

So..

> Developers can refer to this project to write their machine learning application in a high-level language using the bindings, compile it to WebAssembly, and run it with a WebAssembly runtime that supports the wasi-nn proposal, such as WasmEdge.

Is total rubbish; no, you can’t.

This isn’t portable.

It’s not sandboxed.

It’s not a HAL.

If you have a wasm binary you might be able to run it if the version of the runtime you’re using happens to implement the specific ggml backend you need, which it probably doesn’t… because there’s literally no requirement for it to do so.

…and if you do, you’re just calling the llama.cpp ggml code, so it’s as safe as that library is…

There’s a lot of “so portable” and “such rust” talk in this article which really seems misplaced; this doesn’t seem to have the benefits of either of those two things.

Let’s imagine you have some new hardware with a WASI runtime on it, can you run your model on it? Does it have GPU support?

Well, turns out the answer is “go and see if llama.cpp compiles on that platform with GPU support and if the runtime you’re using happens have a ggml plugin in it and happens to have a copy of that version of ggml vendored in it, and if not, then no”.

..at which point, wtf are you even using WASI for?

Cross platform GPU support is hard, but this… I dunno. It seems absolutely ridiculous.

Imagine if webGPU was just “post some binary chunk to the GPU and maybe it’ll draw something or whatever if it’s the right binary chunk for the current hardware.”

That’s what this is.

anentropic · on Nov 13, 2023

Thanks for clarifying, I was wondering where there were getting GPU support in WASM from...

ikurei · on Nov 13, 2023

Could you please elaborate on the security implications?

wokwokwok · on Nov 13, 2023

It’s as secure as any C++ backend that performs no input validation.

Ie. whatever memory safety or sandbox you had from using wasm or rust is gone when you use it.

jart · on Nov 13, 2023

The llama.cpp author thinks security is "very low priority and almost unnecessary". https://github.com/ggerganov/llama.cpp/pull/651#pullrequestr... So I'm not sure why a sandbox would bundle llama.cpp and claim to be secure. They would need more evidence than this to make such a claim.

halyconWays · on Nov 15, 2023

This user was caught stealing code and banned from llama.cpp by its creator https://news.ycombinator.com/item?id=35411909

reidjs · on Nov 13, 2023

Can I run this offline on my iPhone? That would be like having basic internet search regardless of reception. Could come in handy when camping

3Sophons · on Nov 13, 2023

You can run it on a variety of Linux, Mac and Windows based devices, including the Raspberry Pi and most laptops / servers you might have. But you still need a few GBs of memory in order to fit the model itself.

SparkyMcUnicorn · on Nov 13, 2023

I got this project[0] running on a Pixel. Looks like it works on some iPhones/iPads as well.

[0] https://github.com/mlc-ai/mlc-llm

simonw · on Nov 13, 2023

Yeah I've been using their iPhone app for a while - it works great, though it does make the phone run pretty hot while it's outputting tokens!

https://llm.mlc.ai/#ios

throwaway154 · on Nov 13, 2023

You'd probably be better off downloading an edition of wikipedia for that purpose. Entropy, and stuff.

woadwarrior01 · on Nov 13, 2023

I have a successful-ish commercial iOS app[0] for that. I'd originally built it using ggml, and then subsequently ported it to be based on mlc-llm when I found it.

[0]: https://apps.apple.com/us/app/private-llm/id6448106860

JKCalhoun · on Nov 13, 2023

Says MacOS 13 when I followed your link. Too bad I'm still on MacOS 12. (Is there a reason to require MacOS 13?)

woadwarrior01 · on Nov 13, 2023

No specific reason, but SwiftUI improved tremendously between macOS 12 and 13, and I use a couple of the newer SwiftUI features. Also, if I could go back, I’d rather not support Intel Macs. I’d built the original version of the app on an Intel Mac 6 months ago, but the performance difference between Intel Macs and Apple Silicon Macs for LLM inference with Metal is night and day. Apple won’t let me drop support for Intel Macs now, so I’ll begrudgingly support it.

JKCalhoun · on Nov 14, 2023

Too bad about SwiftUI (not being as good on 12), but that's fair.

behnamoh · on Nov 13, 2023

The way things are going, we'll see more efficient and faster methods to run transformer arch on edge, but I'm afraid we're approaching the limit because you can't just rust your way out of the VRAM requirements, which is the main bottleneck in loading large-enough models. One might say "small models are getting better, look at Mistral vs. llama 2", but small models are also approaching their capacity (there's only so much you can put in 7b parameters).

I don't know man, this approach to AI doesn't "feel" like it'll lead to AGI—it's too inefficient.

danielbln · on Nov 13, 2023

I think we have plenty of headroom with MoE systems, dynamically loading LoRAs and such, even with the small models.

anentropic · on Nov 13, 2023

> the Mac OS build of the GGML plugin uses the Metal API to run the inference workload on M1/M2/M3’s built-in neural processing engines

I don't think that's accurate (someone please correct me...)

GGML use of Metal API means it runs on the M1/2/3 GPU and not the neural engine

Which is all good, but for sake of being pedantic...

btown · on Nov 13, 2023

Not pedantic at all! https://github.com/ggerganov/llama.cpp/discussions/336 is a (somewhat rambling) discussion on whether it would even be worthwhile to use the neural engine specifically, beyond the GPU.

nigma · on Nov 13, 2023

I hate this kind of clickbait marketing suggesting the project is delivering 1/100 of the size or 100x-35000x the speed of other solutions because it uses a different language for a wrapper around core library and completely neglecting tooling and community expertise built around other solutions.

First of all the project is based on llama.cpp[1], which does the heavy work of loading and running multi-GB model files on GPU/CPU and the inference speed is not limited by the wrapper choice (there are other wrappers in Go, Python, Node, Rust, etc. or one can use llama.cpp directly). The size of the binary is also not that important when common quantized model files are often in the range of 5GB-40GB and require a beefy GPU or a MB with 16-64GB of RAM.

[1] https://github.com/ggerganov/llama.cpp

hnarayanan · on Nov 13, 2023

If a large part of the size is essentially the trained weights of a model, how can one reduce the size by orders of magnitude (without losing any accuracy)?

rgbrgb · on Nov 13, 2023

I don't think you can reduce size without losing accuracy (though I think quantized GGUFs are great). But the 2 MB size here is a reference to the program size not including a model. It looks like it's a way to run llama.cpp with wasm + a rust server that runs llama.cpp.

I like the tiny llama.cpp/examples/server and embed it in FreeChat, but always happy for more tooling options.

Edit: Just checked, the arm64/x86 executable I embed is currently 4.2 MB. FreeChat is 12.1 MB but the default model is ~3 GB so I'm not really losing sleep over 2 MB.

[0]: https://github.com/ggerganov/llama.cpp/tree/master/examples/...

3Sophons · on Nov 13, 2023

Hello you might be talking about reducing the size of the model itself (i.e., the trained weights) by orders of magnitude without losing accuracy, that's indeed a different challenge. But the article discusses reducing the inference app size by 100x

hnarayanan · on Nov 13, 2023

Oh. Did not think that was even a goal.

3Sophons · on Nov 13, 2023

I guess making it portable is still quite important?

hnarayanan · on Nov 13, 2023

I am not trying to troll. I genuinely don’t see why a few MB on some binary matter when the models are multiple GB large. This is why I fundamentally misunderstood the article, my brain was looking for the other number going down as that’s genuinely a barrier for edge devices.

diimdeep · on Nov 13, 2023

I do not see the point to use this instead of directly using llama.cpp

kelseyfrog · on Nov 13, 2023

Hint: the Rewrite-it-in-Rust economy's currency isn't actually running things.

gumby · on Nov 13, 2023

The crypto of programming languages?

3Sophons · on Nov 13, 2023

llama.cpp typically needs to be compiled separately for each operating system and architecture (Windows, macOS, Linux, etc.), which is less portable.

Also, the article mentions the use of hardware acceleration on devices with heterogeneous hardware accelerators. This implies that the Wasm-compiled program can efficiently utilize different hardware resources (like GPUs and specialized AI chips) across various devices. A direct C++ implementation might require specific optimizations or versions for each type of hardware to achieve similar performance.

diimdeep · on Nov 13, 2023

> Wasm-compiled program can efficiently utilize different hardware resources (like GPUs and specialized AI chips) across various devices

I do not buy it, but maybe I am ignorant of progress being made there.

> A direct C++ implementation might require specific optimizations or versions for each type of hardware to achieve similar performance.

Because I do not buy previous one, I do not buy that similar performance can be painlessly(without extra developer time) achieved there, and that wasm runtime capable to achieve it.

zmmmmm · on Nov 13, 2023

So the magic (or sleight of hand, if you prefer) seems to be in

> You just need to install the WasmEdge with the GGML plugin.

And it turns out that all these plugins are native & specific to the acceleration environment as well. But this has to happen after it lands in its environment so your "portable" application is now only portable in the sense that once it starts running it will bootstrap itself by downloading and installing native platform-specific code from the internet. Whether that is a reasonable thing for an "edge" application to do I am not sure.

kamray23 · on Nov 13, 2023

Basically, WASM is now what the JVM was in 2000. It's portable because it is.

pjmlp · on Nov 13, 2023

Where have I seen this WORA before, including for C and C++?

WASM does not provide access to hardware acceleration on devices with heterogeneous hardware accelerators, even its SIMD bytecodes are a subset of what most CPUs are capable of.

tomalbrc · on Nov 13, 2023

Just use Comsopolitan at this point.

est · on Nov 13, 2023

> The core Rust source code is very simple. It is only 40 lines of code. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2’s chat template, and runs the inference operations using the WASI NN API.

TL;DR a 2MB executable that reads stdin and calls WASI-NN

isoprophlex · on Nov 13, 2023

> "Rust is the language of AGI."

Oh Rust Evangelism Strike Force, never change

hedgehog · on Nov 13, 2023

It looks like this is Rust for the application wrapped around a WASM port of llama.cpp that in turn uses an implementation of WASI-NN for the actual NN compute. It would be interesting to see how this compares to the TFLite, the new stuff in the PyTorch ecosystem, etc.

danielEM · on Nov 13, 2023

I'm getting lost in all that.

Using llama cpp and mlc-llm. Both on my 2 years old mobile Ryzen APU with 64GB of RAM. First does not use GPU at all, tried plenty of options, nothing did work, but llama 34B works - painfully slow, but does work. Second is working on top of Vulkan and I didn't take any precise measurements but it's limit looks like is 32GB RAM (so no llama 34B), but it offloads CPU, unfortunately seem like performance is similar to CPU (that is my perception, didn't take any measurements here too).

So ... will I get any benefits from switching to rust/webassembly version???

anon23432343 · on Nov 13, 2023

So you need to mb2 for sending an api call to the edge?

Okaayyyy...

kamray23 · on Nov 13, 2023

This is offline.

dkga · on Nov 13, 2023

Very cool, but unless I missed it could someone please explain why not just compile a Rust application? Is the Wasm part needed for the GPU acceleration (whatever the user GPU is?)

bouke · on Nov 13, 2023

I suppose wasm provides the portability between platforms. Compile once, run everywhere.

thih9 · on Nov 13, 2023

> the binary application (only 2MB) is completely portable across devices with heterogeneous hardware accelerators.

What does the “heterogenous hardware accelerators” mean in practice?

gvand · on Nov 13, 2023

The binary size is not really important in this case, llama.cpp should not be that far from this, what's matter as we all know is how much gpu memory we need.

rowanG077 · on Nov 13, 2023

I don't think you can call anything wasm efficient.

rjzzleep · on Nov 13, 2023

Is there any detailed info on how a 4090 + ryzen 7840 compares to any of the new Apple offerings with 64GB or more unified RAM?

renewiltord · on Nov 13, 2023

No. You just have to try it. Anecdotally, I can fit a larger Llama on my M1 Max with 64 GiB than my 3090 with 24 GiB.

antirez · on Nov 13, 2023

Linkbait at its finest. But it's true that the Python AI stack sucks big times.

syrusakbary · on Nov 13, 2023

Congrats on the work... it's an impressive demo!

It may be worth researching to add support of it into the Wasmer WebAssembly runtime [1]. (Note: I work at Wasmer!)

[1] https://wasmer.io/

classified · on Nov 13, 2023

How is it still fast if it was compiled to WASM?

tomalbrc · on Nov 13, 2023

> No wonder Elon Musk said that Rust is the language of AGI.

What.

bugglebeetle · on Nov 13, 2023

Wow, this is a “holy shit” moment for Rust in AI applications if this works as described. Also, so long Mojo!

EDIT:

Looks like I’m wrong, but I appreciate getting schooled by all the HNers with low-level expertise. Lots to go and learn about now.

hnfong · on Nov 13, 2023

It's "just" a port of GGML (written in C++) to wasm with some additional Rust code.

bugglebeetle · on Nov 13, 2023

Right, but if the port achieves performance gains over GGML, which is already highly performant, that’s a. Wild b. a signal to move further GGML development into Rust, no?

Nevin1901 · on Nov 13, 2023

How would wasm/rust be more performant over c++? I’m not sure the wasm version can take advantage of avx/metal.

Edit: the wasm installer does take advantage by installing plugins.

Unless you’re talking about performance on devices where those two weren’t a thing anyways.

cozzyd · on Nov 13, 2023

As far as I understand, only the "driver" code is in rust. Everything else is just C++ compiled to WASM. Maybe it's slightly better to have the driver code be in rust than python or scheme or whatever, but I imagine C++ would be basically equivalent (and.... you wouldn't have to go through the trouble of compiling to WASM which likely loses significant performance).

kamray23 · on Nov 13, 2023

That's what I find weird here. The bit of the code written in rust is almost comically tiny, and the rest is just C++ you compiled to WASM which someone else already wrote. I think comparing this to a Python wrapper for the same code would produce very minimal difference in performance, because the majority goes into performance and formatting the prompt string really isn't that complex of a task. I just don't see what advantage Rust produces here other than the fact that it's a language you can compile to WASM so that you have one binary.

brrrrrm · on Nov 13, 2023

ML has extremely predictable and heavily optimized routines. Languages that can target hardware ISA all tend to have comparable perf and there’s no reason to think Rust would offer much.

tomalbrc · on Nov 13, 2023

There is no mention of it running faster than the original llama2.cpp, if anything it is slower

blovescoffee · on Nov 13, 2023

No it's not. This does nothing to minimize the size of the models which inference on being run on. It's cool for edge applications, kind of. And Rust is already a go to tool for edge.

est · on Nov 13, 2023

> this is a “holy shit” moment for Rust in AI applications

Yeah because I realized the 2MB is just a wrapper that reads stdin and offloads everything to wasi-nn API.

> The core Rust source code is very simple. It is only 40 lines of code. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2’s chat template, and runs the inference operations using the WASI NN API.

You can do the same using Python with fewer lines of code and maybe smaller executable size.

gumby · on Nov 13, 2023

Pretty damning if 40 lines of rust to read stdin generates a 2 MB binary!

lakpan · on Nov 13, 2023

Presumably that also accounts for the WASM itself

gpderetta · on Nov 13, 2023

Indeed. I hope it does include the WASM VM.

3Sophons · on Nov 13, 2023

yeah excited to see how this will evolve. BTW, maybe give it a try on your Mac and see how it performs.

jasonjmcghee · on Nov 13, 2023

Confused about the title rewrite from “Fast and Portable Llama2 Inference on the Heterogeneous Edge” which more clearly communicates what this article is about - a wasm version of llama.cpp.

I feel like editorializing to highlight the fact that it’s 2MB and runs on a mac are missing some of the core aspects of the project and write up.

dang · on Nov 13, 2023

Thanks - a mod replaced the title last night. (Submitted title was "Run LLMs on my own Mac fast and efficient! Only 2 MBs.")

Submitters: "Please use the original title, unless it is misleading or linkbait; don't editorialize." - https://news.ycombinator.com/newsguidelines.html

PUSH_AX · on Nov 13, 2023

Now I’m confused, because neither of the titles _clearly_ communicate that it’s a wasm version of llama.cpp in my opinion.

It would probably be helpful to use the words “wasm” and “llama” to achieve that

stavros · on Nov 13, 2023

"Run LlaMA 2 on WASM in 2 MB RAM"

This has the added advantage of being completely gibberish to someone outside tech.

Edit: wait, it's not RAM, the binary is just 2 MB. That's disappointing.

wongarsu · on Nov 13, 2023

The article is complete gibberish to someone outside tech, so if the role of the title is to describe the article to its intended audience yours is a lot better.

Of course if you intend to communicate to non-tech people that you write relevant cutting-edge articles, then choosing a title like "Fast and Portable Llama2 Inference on the Heterogeneous Edge" does the job much better. Maybe even add the words sustainable and IoT somewhere.

grumpy_tired · on Nov 13, 2023

Beautiful. I wish all titles on hn where this concise.

doubloon · on Nov 13, 2023

well it requires nvidia so maybe its not actually portable.

threeseed · on Nov 13, 2023

It also works with Metal hence why they mention it runs on Mac.

3Sophons · on Nov 13, 2023

'portable' in the article refers to the software's ability to run across various operating systems or environments, rather than its hardware dependencies? This means while the software can be installed and run on different OSs, certain hardware-specific optimizations (like those for Nvidia GPUs using CUDA) are necessary to achieve the best performance.

3Sophons · on Nov 13, 2023

ok.. should be run LLMs on my own devices with a 2MB portable app then?