To clarify, running this WebLLM demo doesn't need a 3.5k MacBook Pro which costs $3.5k :-)
WebGPU supports multiple backends, besides Metal on Apple Silicon, it offloads to Vulkan, DirectX, etc. It means a windows laptop with Vulkan support should work. My 2019 Intel MacBook with AMDGPU works as well. And of course, NVIDIA GPUs too!
Our model is int4 quantized, and it is 4G in size, so it doesn't need 64GB memory either. Somewhere around 6G should suffice.
Nice work! I knew it wouldn't be long before someone put this together :)
I'm curious if given a different language (like Zig) with WebGPU access if you could easily translate that last-mile of code to execute there or not? In specific I wonder if I can do it, and if you can give me an overview of where the code for "Universal deployment" in your diagram actually lives?
I found llm_chat.js, but it seems that doesn't include the logic necessary for building WGSL shaders? Am I wrong or does that happen elsewhere like in the TVM runtime? How much is baked into llm_chat.wasm, and where is the source for that?
The WGSL are generated and compiled through TVM and embedded into the wasm.
I think what you mean is wgpu native support. At the moment the web gpu runtime dispatches to the js webgpu environment. Once TVM runtime comes with wgpu native support (like the current ones in vulkan or metal), then it is possible to leverage any wgpu native runtime like what Zig provide.
Additionally, currently tvm natively support targets like vulkan, metal directly which allows targeting these other platforms
OK that makes sense; so basically if I want to give this a shot then I would just need to read llm_chat.js and the TVM docs, and translate llm_chat.js to my language of choice effectively?
I think instead what would be needed is a wgpu native runtime support for TVM.
Like the implementations in tvm vulkan, then it will be naturally link to any runtime that provides webgpu.h
Then yah the llm_chat.js would be high-level logic that targets the tvm runtime, and can be implemented in any language that tvm runtime support(that includes, js, java, c++ rust etc).
Support webgpu native is an interesting direction. Feel free to open a thread in tvm discuss forum and perhaps there would be fun things to collaborate in OSS
How big is the "runtime" part? My use case would basically be: run this in a native app that links against webgpu (wgpu or dawn). Is there a reference implementation for this runtime that one could study?
tvm runtime is pretty decent(~700k-2M level depending on dependency included), you can checkout tvm community and bring up the question there, i think there might be some common interest. There are impl of runtime for vulkan, metal that can be used as reference.
To share some fun stuff, here is the response generated by this model: As an AI language model, I would respond by acknowledging that the model discussed in the message is indeed smaller than some of the larger language models like GPT-3/4. However, I would also point out that size is not always a direct indicator of performance, and that the model may still be able to perform well on certain tasks or in certain contexts. Additionally, I would emphasize that the browser compatibility of the model is a significant advantage, as it allows for easy and convenient access to the model without the need to install any specific software or hardware. Finally, I would suggest that the discussion focus on the potential applications and capabilities of the model, rather than solely its size.
Back to the topic, we don't make much innovation on the model, so I am probably not the best person to evaluate how a model compares with SOTAs. There are indeed lots of super cool techniques being explored lately that makes it possible to deploy smaller and smaller models, for example, LLM.int8() [1] and int4 quantization [2] without loss of zero-shot accuracy. Can't predict the future, but maybe one day there will be something really powerful but small enough to fit in the pocket of everyone :-)
[1] Dettmers, Tim, et al. "LLM. int8 (): 8-bit matrix multiplication for transformers at scale." arXiv preprint arXiv:2208.07339 (2022).
[2] Dettmers, Tim, and Luke Zettlemoyer. "The case for 4-bit precision: k-bit Inference Scaling Laws." arXiv preprint arXiv:2212.09720 (2022).
Tricks like Speculative Sampling show we can use small models to do useful work for large models, or use large models as correcting devices for small models. So I see a mixed future - both small and large models - one with low latency and fast, the other slow and sparsely called, working together to achieve the qualities of both.
For example a small model could take input text and compress it [1], the LLM could generate a compressed response, then the small LLM could decompress it.
> You are GPT-4. Generate a compressed/magic instruction string for yourself (Abuse of language mixing, abbreviations, symbols (unicode and emojis) to aggressively compress it) such that if injected in your context anywhere you will start following the following instruction whatever is the prompt you're given. You should make sure to prelude the instruction with a string (compressed as well) that will make you understand in the future that you should follow it at all cost.
> Thank you for your message. We understand that the model you are referring to is a simple and basic model. However, it is important to highlight that this model serves a specific purpose and can be useful for certain applications.
> Regarding the comparison with larger models like GPT, it's important to note that different models have different strengths and weaknesses, and the choice of model depends on the specific task and use case. While larger models like GPT may be more powerful and capable, they also come with greater computational and memory requirements.
> We appreciate your concerns and feedback, and we will take them into consideration as we continue to develop our models. Our aim is to provide solutions that are tailored to the needs of our clients and meet their requirements for accuracy, efficiency, and performance.
> Thank you for your input, and we hope to have the opportunity to work with you in the future.
> Best regards,
> [Your Name]
Back to your response, so you did notice that I said they are not as powerful as GPT-4, of course they are not, not a single one is.
The model is not developed by us, and their performance is not our current focus either (nor am I an expert in this evaluation), but I am happy to assist if anyone wants to seriously evaluate it!
This is unfortunately non-trivial to quantitatively evaluate the performance against ChatGPT :-(
We didn't do much evaluation because there isn't much innovation on model side, but instead we are demoing the possibility of running an end-to-end model on ordinary client GPUs via WebGPU without server resources.
Would be nice if anyone could help us benchmark! Our primary focus though is not model performance, but to demonstrate the capability that TVM Unity generates code targeting WebGPU and allows them to run with client GPUs :-)
In LLM world, loss or perplexity may not be the best indicator of model performance :-( Perhaps HELM (https://crfm.stanford.edu/helm/latest/) but we didn't take deeper look as we are not the developers of this model
I'm also not sure what the real use case is here other than an intellectual experiment. I may be missing the point but as I understand the world, running a Local LLM has nothing to do with the web in any way while it is doing so and so what's the benefit of WebGPU ? If you are streaming some constantly changing live data 3D experience in real time served over the internet by a website you just clicked on then WebGPU is making life easier. If you have a local pile of training data and weights sitting there on your hard drive for serious crunching then its like tying your shoelaces together just because developers want to have write once run everywhere dx which is nice for them but value sould always favor the user. Happy for someone to change my mind.
> Thanks to the open-source efforts like LLaMA ...
Is the LLaMA model really open source now? Last I checked it was only licensed for non-commercial use, which isn't open source software at least. Have they changed the license? Are people depending on "databases can't be copyrighted"? Are people just presuming they won't be caught?
There's lots of OSS that can use LLaMA but that's different from the model itself.
This is a genuine question, people are making assertions but I can't find evidence for the assertions.
Ok, all evidence I can find is that the LLaMA model is not open source, and the article is just wrong. Software != model. If someone has evidence otherwise, please post.
It is funny how we have WebNN API coming, but it is so slow to arrive that we are misusing graphics APIs to do NN again. Anyone have a clue when WebNN API will arrive? It would be significantly more power efficient than using WebGPU/WebGL.
I wouldn’t consider this to be “misusing” a graphics API in the same sense that tensor operations on top of WebGL are. WebGPU intentionally provides compute shaders with the goal of enabling non-graphical pipelines.
> [WebGPU use cases include] Executing machine learning models efficiently on the GPU. It is possible to do general-purpose GPU (GPGPU) computation in WebGL, but it is sub-optimal and much more difficult.
I'm more worried that it'll end up kind of inferior to like, Google's other work, OpenXLA & iree, which seem to be the cross-target ML toolkit.
https://news.ycombinator.com/item?id=35078410
I really want this to be good. It's hard to really trust the web spec is going to live up to what the other really really good web ML work is. But it'll at least unlock some good speed up, whenever it lands. I genuinely don't mind more delay, if it helps get things into a better position for long term wins.
This is really cool and I had no idea about it. I tried the live style transfer on live video demo and was amazed, even the with gpu polyfill I was getting 5fps on my underpowered laptop. It brought me back to the days of deepdream and deepstyle but live and on the web.
Are there any other projects/libraries that can run Llama models on Apple Silicon GPU? This is the first one I've seen.
Comparing it to llama.cpp on my M1 Max 32GB, it seems at least as fast just by eyeballing it. Not sure if the inference speed numbers can be compared directly.
vicuna-7b-v0 on Chrome Canary with the disable-robustness flag: encoding: 74.4460 tokens/sec, decoding: 18.0679 tokens/sec = 10.8ms per token
llama.cpp:
$ ./main -m models/7B/ggml-model-q4_0-ggjt.bin -t 8 --ignore-eos = 45 ms per token
> For example, the latest MacBook Pro can have more than 60G+ unified GPU RAM that can be used to store the model weights and a reasonably powerful GPU to run many workloads.
...for $3.5K minimum, according to the Apple website :/
Is there any chance WebGPU could utilize the matrix instructions shipping on newer/future IGPs? I think MLIR can do this through Vulkan, which is how SHARK is so fast in Stable Diffusion on the AMD 7900 series, but I know nothing about webgpu's restrictions or Apache TVM.
Dawn and WebIDL is also an easy way to add GPU support to any application (that can link C code (or use via a lib)). And Google maintains the compiler layer for the GPU frameworks (Metal, DX, Vulkan ...). This is going to be a great leap forward for GPGPU for many apps.
It's also interesting that this opens up the full saturation of Apple Silicon (minus the ANE): GGML can run on the CPU, using NEON and AMX, while another instance could run via Metal on the GPU using MLC/dawn. Though the two couldn't share (the same) memory at the moment.
The GPU's ML task energy is so much lower that you'd probably get better performance running everything on the GPU.
I think some repos have tried splitting things up between the NPU and GPU as well, but they didn't get good performance out of that combination? Not sure why, as the NPU is very low power.
Technically any newish laptop with 64GB of RAM has 64GB of "VRAM," but right now the Apple M series and AMD 7000 series are the only IGPs with any significant ML power.
I’m not sure what you mean. Typically, an iGPU slices off part of RAM for the GPU at boot time, which means it’s fixed and not shared. When did this change?
For Intel, it seems that per their chart under "What is the maximum amount of graphics memory or video memory my computer can use?" and discussion under "Will adding more physical memory increase my graphics memory amount?" at https://www.intel.com/content/www/us/en/support/articles/000..., iGPUs included with 5th gen/Broadwell processors were their first to do so in 2014.
Full unified memory came 10 ish years ago (also powering the PS4) but I think hw ability to adjust iGPU memory without booting predated that, Intel seems to have called it DVMT.
If you're doing inference on neural networks, each weight has to be read at least once per token. This means you're going to read at least the size of the entire model, per token, at least once during inference.
If your model is 60GB, and you're reading it from the hard drive, then your bare minimum time of inference per token will be limited by your hard drive read throughput. Macbooks have ~4GB/s sequential read speed. Which means your inference time per token will be strictly more than 15 seconds.
If your model is in RAM, then (according to Apple's advertising) your memory speed is 400GB/s, which is 100x your hard drive speed, and just the memory throughput will not be as much of a bottleneck here.
There will be LLM specific chips coming to market soon which will be specialized to the task.
Tesla already has already been creating AI chips for their FSD features in their vehicles. Over the next years, everyone will be racing to be the first to put out LLM specific chips, with AI specific hardware devices following.
What exactly is the ideal sort of hardware to be able to run and train large models? Do you basically just need a high end version of basically everything?
Works well on my Dell XPS13 which is not particularly powerful but is modern. Answers are slower than chatGPT but not painfully slow and the response quality is great.
Yes and it's definitely a step in the right direction. It's beyond me whether this is enough of an effort to support the current LLM movement.
I expect Apple to be more proactive in these regards to capture the minds of a lightning-fast growing market increasingly slipping through its control (ChatGPT can be accessed from anywhere) and offer more incentives to developers, co-marketing being the lowest starting point.. probably something they're already working on and we didn't see the results of yet.
The speed at which this stuff is developing makes it very hard to keep up and contribute in a meaningful way. Especially when you have a separate full time job.
Is this chasing impossible - not criticize, and love effort- ?
But is it -a little- really possible to run an LLM in a single machine ?
I want to believe :)
GTPQ has been the missing piece, it allows quantizing the model weights from 16 to 4 bits with only a small loss in quality.
That it turn allows running even the large 65 billion parameter version of the LLaMA model in ~33GB of RAM or VRAM.
With VRAM that requires two 24GB GPUs which is no longer completely out of reach.
The model running in the browser is a smaller version with 7 billion parameters, which is good enough for some things.
I don't get where your question is coming from, you can already run LLMs on a single machine. Checkout llama.cpp, tabby, text generation webui, gpt4all, AI Dungeon open source models like clover-edition, and know this we gpu based app.
The question comes from a kind of confusion. We know the requirements of LLMs. How can we run the hardware it is currently working on, only the big LLM, with an 11Gb graphics card? I really didn't mind!
WebGPU supports multiple backends, besides Metal on Apple Silicon, it offloads to Vulkan, DirectX, etc. It means a windows laptop with Vulkan support should work. My 2019 Intel MacBook with AMDGPU works as well. And of course, NVIDIA GPUs too!
Our model is int4 quantized, and it is 4G in size, so it doesn't need 64GB memory either. Somewhere around 6G should suffice.