I'm all for Rust and WASM, but if you look at the code it's just 150 lines of a basic Rust command-line script. All the heavy lifting is done by a single line passing the model to the WASI-NN backend, which in this case is provided by the WasmEdge runtime, which incidentally is C++, not Rust.
Rust is bringing zero advantage here really, the backend could be called from Python or anything else.
Seems like the advantage it is bringing is in bundling, shipping Python and PyTorch into something an end user can double click and run is currently a complete mess.
Of course the actual high powered code is C++ in both cases but shipping 2+GB and 10s of thousands of files just to send some instructions to that C++ could benefit from being one 2MB executable instead.
I am not familiar enough with llama.cpp, but from what I see they have mostly copy-pasted it into WasmEdge for the WASI-NN implementation.
Surely a simple compiled binary of llama.cpp is better than Rust compiled to WASM plus the WasmEdge runtime binary wrapping the same llama.cpp.
It wouldn't be more portable either, all the heterogeneous hardware acceleration support is part of llama.cpp not WasmEdge.
I guess theoretically if the WASI-NN proposal is standardized, other WASM runtimes could implement their own backends. It is a decent abstraction to cleanly expand portability and for optimizing for specific instrastructure.
But at this point it doesn't have much to do with Rust or WASM. It's just the same old concept of portability via bytecode runtimes like the JVM or, indeed, the Python interpreter with native extensions (libraries).
Whoa! Great work. To other folks checking it out, it still requires downloading the weights, which are pretty large. But they essentially made a fully portable, no-dependency llama.cpp, in 2mb.
If you're an app developer this might be the easiest way to package an inference engine in a distributable file (the weights are already portable and can be downloaded on-demand — the inference engine is really the part you want to lock down).
The `main` file that llama.cpp builds is 1.2MB on my machine. The 2MB size isn't anything particularly impressive. Targeting wasm makes it more portable, otherwise these isn't some special extra compactness here.
I appreciate the work that went into slimming this binary down, but it's a ~negligble amount of work compared to llama.cpp itself.
HN is inundated with posts doing xyz on top of the x.cpp community. Whilst I appreciate it is exciting - I wish more people would explore the low-level themselves! We can be much more creative in this new playground.
The wasm-nn that this relies on (https://github.com/WebAssembly/wasi-nn) is a proposal that relies on sending arbitrarily chunks to some vendor implementation. The api is literally like set input, compute, set output.
…and that is totally non portable.
The reason this works, is because it’s relying on the abstraction already implemented in llama.cpp that allows it to take a gguf model and map it to multiple hardware targets,which you can see has been lifted as-is into WasmEdge here: https://github.com/WasmEdge/WasmEdge/tree/master/plugins/was...
So..
> Developers can refer to this project to write their machine learning application in a high-level language using the bindings, compile it to WebAssembly, and run it with a WebAssembly runtime that supports the wasi-nn proposal, such as WasmEdge.
Is total rubbish; no, you can’t.
This isn’t portable.
It’s not sandboxed.
It’s not a HAL.
If you have a wasm binary you might be able to run it if the version of the runtime you’re using happens to implement the specific ggml backend you need, which it probably doesn’t… because there’s literally no requirement for it to do so.
…and if you do, you’re just calling the llama.cpp ggml code, so it’s as safe as that library is…
There’s a lot of “so portable” and “such rust” talk in this article which really seems misplaced; this doesn’t seem to have the benefits of either of those two things.
Let’s imagine you have some new hardware with a WASI runtime on it, can you run your model on it? Does it have GPU support?
Well, turns out the answer is “go and see if llama.cpp compiles on that platform with GPU support and if the runtime you’re using happens have a ggml plugin in it and happens to have a copy of that version of ggml vendored in it, and if not, then no”.
..at which point, wtf are you even using WASI for?
Cross platform GPU support is hard, but this… I dunno. It seems absolutely ridiculous.
Imagine if webGPU was just “post some binary chunk to the GPU and maybe it’ll draw something or whatever if it’s the right binary chunk for the current hardware.”
The llama.cpp author thinks security is "very low priority and almost unnecessary". https://github.com/ggerganov/llama.cpp/pull/651#pullrequestr... So I'm not sure why a sandbox would bundle llama.cpp and claim to be secure. They would need more evidence than this to make such a claim.
You can run it on a variety of Linux, Mac and Windows based devices, including the Raspberry Pi and most laptops / servers you might have. But you still need a few GBs of memory in order to fit the model itself.
I have a successful-ish commercial iOS app[0] for that. I'd originally built it using ggml, and then subsequently ported it to be based on mlc-llm when I found it.
No specific reason, but SwiftUI improved tremendously between macOS 12 and 13, and I use a couple of the newer SwiftUI features. Also, if I could go back, I’d rather not support Intel Macs. I’d built the original version of the app on an Intel Mac 6 months ago, but the performance difference between Intel Macs and Apple Silicon Macs for LLM inference with Metal is night and day. Apple won’t let me drop support for Intel Macs now, so I’ll begrudgingly support it.
The way things are going, we'll see more efficient and faster methods to run transformer arch on edge, but I'm afraid we're approaching the limit because you can't just rust your way out of the VRAM requirements, which is the main bottleneck in loading large-enough models. One might say "small models are getting better, look at Mistral vs. llama 2", but small models are also approaching their capacity (there's only so much you can put in 7b parameters).
I don't know man, this approach to AI doesn't "feel" like it'll lead to AGI—it's too inefficient.
I hate this kind of clickbait marketing suggesting the project is delivering 1/100 of the size or 100x-35000x the speed of other solutions because it uses a different language for a wrapper around core library and completely neglecting tooling and community expertise built around other solutions.
First of all the project is based on llama.cpp[1], which does the heavy work of loading and running multi-GB model files on GPU/CPU and the inference speed is not limited by the wrapper choice (there are other wrappers in Go, Python, Node, Rust, etc. or one can use llama.cpp directly). The size of the binary is also not that important when common quantized model files are often in the range of 5GB-40GB and require a beefy GPU or a MB with 16-64GB of RAM.
If a large part of the size is essentially the trained weights of a model, how can one reduce the size by orders of magnitude (without losing any accuracy)?
I don't think you can reduce size without losing accuracy (though I think quantized GGUFs are great). But the 2 MB size here is a reference to the program size not including a model. It looks like it's a way to run llama.cpp with wasm + a rust server that runs llama.cpp.
I like the tiny llama.cpp/examples/server and embed it in FreeChat, but always happy for more tooling options.
Edit: Just checked, the arm64/x86 executable I embed is currently 4.2 MB. FreeChat is 12.1 MB but the default model is ~3 GB so I'm not really losing sleep over 2 MB.
Hello you might be talking about reducing the size of the model itself (i.e., the trained weights) by orders of magnitude without losing accuracy, that's indeed a different challenge. But the article discusses reducing the inference app size by 100x
I am not trying to troll. I genuinely don’t see why a few MB on some binary matter when the models are multiple GB large. This is why I fundamentally misunderstood the article, my brain was looking for the other number going down as that’s genuinely a barrier for edge devices.
llama.cpp typically needs to be compiled separately for each operating system and architecture (Windows, macOS, Linux, etc.), which is less portable.
Also, the article mentions the use of hardware acceleration on devices with heterogeneous hardware accelerators. This implies that the Wasm-compiled program can efficiently utilize different hardware resources (like GPUs and specialized AI chips) across various devices. A direct C++ implementation might require specific optimizations or versions for each type of hardware to achieve similar performance.
> Wasm-compiled program can efficiently utilize different hardware resources (like GPUs and specialized AI chips) across various devices
I do not buy it, but maybe I am ignorant of progress being made there.
> A direct C++ implementation might require specific optimizations or versions for each type of hardware to achieve similar performance.
Because I do not buy previous one, I do not buy that similar performance can be painlessly(without extra developer time) achieved there, and that wasm runtime capable to achieve it.
So the magic (or sleight of hand, if you prefer) seems to be in
> You just need to install the WasmEdge with the GGML plugin.
And it turns out that all these plugins are native & specific to the acceleration environment as well. But this has to happen after it lands in its environment so your "portable" application is now only portable in the sense that once it starts running it will bootstrap itself by downloading and installing native platform-specific code from the internet. Whether that is a reasonable thing for an "edge" application to do I am not sure.
Where have I seen this WORA before, including for C and C++?
WASM does not provide access to hardware acceleration on devices with heterogeneous hardware accelerators, even its SIMD bytecodes are a subset of what most CPUs are capable of.
> The core Rust source code is very simple. It is only 40 lines of code. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2’s chat template, and runs the inference operations using the WASI NN API.
TL;DR a 2MB executable that reads stdin and calls WASI-NN
It looks like this is Rust for the application wrapped around a WASM port of llama.cpp that in turn uses an implementation of WASI-NN for the actual NN compute. It would be interesting to see how this compares to the TFLite, the new stuff in the PyTorch ecosystem, etc.
Using llama cpp and mlc-llm. Both on my 2 years old mobile Ryzen APU with 64GB of RAM. First does not use GPU at all, tried plenty of options, nothing did work, but llama 34B works - painfully slow, but does work. Second is working on top of Vulkan and I didn't take any precise measurements but it's limit looks like is 32GB RAM (so no llama 34B), but it offloads CPU, unfortunately seem like performance is similar to CPU (that is my perception, didn't take any measurements here too).
So ... will I get any benefits from switching to rust/webassembly version???
Very cool, but unless I missed it could someone please explain why not just compile a Rust application? Is the Wasm part needed for the GPU acceleration (whatever the user GPU is?)
The binary size is not really important in this case, llama.cpp should not be that far from this, what's matter as we all know is how much gpu memory we need.
Right, but if the port achieves performance gains over GGML, which is already highly performant, that’s a. Wild b. a signal to move further GGML development into Rust, no?
As far as I understand, only the "driver" code is in rust. Everything else is just C++ compiled to WASM. Maybe it's slightly better to have the driver code be in rust than python or scheme or whatever, but I imagine C++ would be basically equivalent (and.... you wouldn't have to go through the trouble of compiling to WASM which likely loses significant performance).
That's what I find weird here. The bit of the code written in rust is almost comically tiny, and the rest is just C++ you compiled to WASM which someone else already wrote. I think comparing this to a Python wrapper for the same code would produce very minimal difference in performance, because the majority goes into performance and formatting the prompt string really isn't that complex of a task. I just don't see what advantage Rust produces here other than the fact that it's a language you can compile to WASM so that you have one binary.
ML has extremely predictable and heavily optimized routines. Languages that can target hardware ISA all tend to have comparable perf and there’s no reason to think Rust would offer much.
No it's not. This does nothing to minimize the size of the models which inference on being run on. It's cool for edge applications, kind of. And Rust is already a go to tool for edge.
> this is a “holy shit” moment for Rust in AI applications
Yeah because I realized the 2MB is just a wrapper that reads stdin and offloads everything to wasi-nn API.
> The core Rust source code is very simple. It is only 40 lines of code. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2’s chat template, and runs the inference operations using the WASI NN API.
You can do the same using Python with fewer lines of code and maybe smaller executable size.
Confused about the title rewrite from “Fast and Portable Llama2 Inference on the Heterogeneous Edge” which more clearly communicates what this article is about - a wasm version of llama.cpp.
I feel like editorializing to highlight the fact that it’s 2MB and runs on a mac are missing some of the core aspects of the project and write up.
The article is complete gibberish to someone outside tech, so if the role of the title is to describe the article to its intended audience yours is a lot better.
Of course if you intend to communicate to non-tech people that you write relevant cutting-edge articles, then choosing a title like "Fast and Portable Llama2 Inference on the Heterogeneous Edge" does the job much better. Maybe even add the words sustainable and IoT somewhere.
'portable' in the article refers to the software's ability to run across various operating systems or environments, rather than its hardware dependencies? This means while the software can be installed and run on different OSs, certain hardware-specific optimizations (like those for Nvidia GPUs using CUDA) are necessary to achieve the best performance.
Rust is bringing zero advantage here really, the backend could be called from Python or anything else.