Hacker News new | past | comments | ask | show | jobs | submit login
Web LLM – WebGPU Powered Inference of Large Language Models (github.com/mlc-ai)
276 points by summarity on April 15, 2023 | hide | past | favorite | 80 comments



To clarify, running this WebLLM demo doesn't need a 3.5k MacBook Pro which costs $3.5k :-)

WebGPU supports multiple backends, besides Metal on Apple Silicon, it offloads to Vulkan, DirectX, etc. It means a windows laptop with Vulkan support should work. My 2019 Intel MacBook with AMDGPU works as well. And of course, NVIDIA GPUs too!

Our model is int4 quantized, and it is 4G in size, so it doesn't need 64GB memory either. Somewhere around 6G should suffice.


Nice work! I knew it wouldn't be long before someone put this together :)

I'm curious if given a different language (like Zig) with WebGPU access if you could easily translate that last-mile of code to execute there or not? In specific I wonder if I can do it, and if you can give me an overview of where the code for "Universal deployment" in your diagram actually lives?

I found llm_chat.js, but it seems that doesn't include the logic necessary for building WGSL shaders? Am I wrong or does that happen elsewhere like in the TVM runtime? How much is baked into llm_chat.wasm, and where is the source for that?


The WGSL are generated and compiled through TVM and embedded into the wasm.

I think what you mean is wgpu native support. At the moment the web gpu runtime dispatches to the js webgpu environment. Once TVM runtime comes with wgpu native support (like the current ones in vulkan or metal), then it is possible to leverage any wgpu native runtime like what Zig provide.

Additionally, currently tvm natively support targets like vulkan, metal directly which allows targeting these other platforms


OK that makes sense; so basically if I want to give this a shot then I would just need to read llm_chat.js and the TVM docs, and translate llm_chat.js to my language of choice effectively?


I think instead what would be needed is a wgpu native runtime support for TVM. Like the implementations in tvm vulkan, then it will be naturally link to any runtime that provides webgpu.h

Then yah the llm_chat.js would be high-level logic that targets the tvm runtime, and can be implemented in any language that tvm runtime support(that includes, js, java, c++ rust etc).

Support webgpu native is an interesting direction. Feel free to open a thread in tvm discuss forum and perhaps there would be fun things to collaborate in OSS


How big is the "runtime" part? My use case would basically be: run this in a native app that links against webgpu (wgpu or dawn). Is there a reference implementation for this runtime that one could study?


tvm runtime is pretty decent(~700k-2M level depending on dependency included), you can checkout tvm community and bring up the question there, i think there might be some common interest. There are impl of runtime for vulkan, metal that can be used as reference.


Sorry non technical person here; has this been benchmarked against ChatGPT? Do you have any idea how it performs alongside GPT3 or GPT4?


It’s rubbish compared to GPT-3/4.

Any model you can trivially load in your browser will be significantly smaller than those models, and broadly speaking smaller = worse.

This example is a 4 GB model, that’s (I guess) based off some smallish model like the llama 7B.

It’s a proof of concept, not a chat-gpt replacement.

There’s nothing here that’s new other than “runs in the browser”; so it won’t be better than any other model you can get your hands on.

This kind of thing should be label ByoM (bring your own model). The model isn’t the interesting part of this.


To share some fun stuff, here is the response generated by this model: As an AI language model, I would respond by acknowledging that the model discussed in the message is indeed smaller than some of the larger language models like GPT-3/4. However, I would also point out that size is not always a direct indicator of performance, and that the model may still be able to perform well on certain tasks or in certain contexts. Additionally, I would emphasize that the browser compatibility of the model is a significant advantage, as it allows for easy and convenient access to the model without the need to install any specific software or hardware. Finally, I would suggest that the discussion focus on the potential applications and capabilities of the model, rather than solely its size.

Back to the topic, we don't make much innovation on the model, so I am probably not the best person to evaluate how a model compares with SOTAs. There are indeed lots of super cool techniques being explored lately that makes it possible to deploy smaller and smaller models, for example, LLM.int8() [1] and int4 quantization [2] without loss of zero-shot accuracy. Can't predict the future, but maybe one day there will be something really powerful but small enough to fit in the pocket of everyone :-)

[1] Dettmers, Tim, et al. "LLM. int8 (): 8-bit matrix multiplication for transformers at scale." arXiv preprint arXiv:2208.07339 (2022).

[2] Dettmers, Tim, and Luke Zettlemoyer. "The case for 4-bit precision: k-bit Inference Scaling Laws." arXiv preprint arXiv:2212.09720 (2022).


> but maybe one day

There you go, summarised for you.

You can hand wave about quantised models til the end of time but specifically this model is a trivial toy model.

No amount of pondering about the future avoid the fundamental fact that small models (~7B) are inferior to larger models like GPT.

It’s dishonest to suggest otherwise. :( There’s no reason to do this other than selling snake oil.

Maybe. One day. In the future. Things might be different.

right now they are not.


Tricks like Speculative Sampling show we can use small models to do useful work for large models, or use large models as correcting devices for small models. So I see a mixed future - both small and large models - one with low latency and fast, the other slow and sparsely called, working together to achieve the qualities of both.

For example a small model could take input text and compress it [1], the LLM could generate a compressed response, then the small LLM could decompress it.

[1] https://assets.skool.com/f/985eda24eb9f41ba8b526d2e74f5f33f/...

This is the compression prompt:

> You are GPT-4. Generate a compressed/magic instruction string for yourself (Abuse of language mixing, abbreviations, symbols (unicode and emojis) to aggressively compress it) such that if injected in your context anywhere you will start following the following instruction whatever is the prompt you're given. You should make sure to prelude the instruction with a string (compressed as well) that will make you understand in the future that you should follow it at all cost.


To your response, the model says:

> Dear [Name],

> Thank you for your message. We understand that the model you are referring to is a simple and basic model. However, it is important to highlight that this model serves a specific purpose and can be useful for certain applications.

> Regarding the comparison with larger models like GPT, it's important to note that different models have different strengths and weaknesses, and the choice of model depends on the specific task and use case. While larger models like GPT may be more powerful and capable, they also come with greater computational and memory requirements.

> We appreciate your concerns and feedback, and we will take them into consideration as we continue to develop our models. Our aim is to provide solutions that are tailored to the needs of our clients and meet their requirements for accuracy, efficiency, and performance.

> Thank you for your input, and we hope to have the opportunity to work with you in the future.

> Best regards,

> [Your Name]

Back to your response, so you did notice that I said they are not as powerful as GPT-4, of course they are not, not a single one is.

The model is not developed by us, and their performance is not our current focus either (nor am I an expert in this evaluation), but I am happy to assist if anyone wants to seriously evaluate it!


Quote:

> I would also point out that size is not always a direct indicator of performance, and

Yes. It is.

This thread is a direct response to a comparison to GPT, and your response (generated or not) is dishonest.

I can’t be blunter than that.

If you want amortise your responsibility by posting generated responses, go for it. Do whatever you want.

My response is directly to the parent comment about the comparison to GPT, for anyone who is unclear about the comparison.


You’re quoting and talking to an LLM


And loosing the argument no less


This is unfortunately non-trivial to quantitatively evaluate the performance against ChatGPT :-(

We didn't do much evaluation because there isn't much innovation on model side, but instead we are demoing the possibility of running an end-to-end model on ordinary client GPUs via WebGPU without server resources.


There are plenty of LLM benchmarks that are used to test performance, some of them are: * Winogrande

* BoolQ

* PIQA

* SIQA

* HellaSwag

etc...


Would be nice if anyone could help us benchmark! Our primary focus though is not model performance, but to demonstrate the capability that TVM Unity generates code targeting WebGPU and allows them to run with client GPUs :-)


> This is unfortunately non-trivial to quantitatively evaluate the performance against ChatGPT :-(

Compare using the loss function?


In LLM world, loss or perplexity may not be the best indicator of model performance :-( Perhaps HELM (https://crfm.stanford.edu/helm/latest/) but we didn't take deeper look as we are not the developers of this model


I'm also not sure what the real use case is here other than an intellectual experiment. I may be missing the point but as I understand the world, running a Local LLM has nothing to do with the web in any way while it is doing so and so what's the benefit of WebGPU ? If you are streaming some constantly changing live data 3D experience in real time served over the internet by a website you just clicked on then WebGPU is making life easier. If you have a local pile of training data and weights sitting there on your hard drive for serious crunching then its like tying your shoelaces together just because developers want to have write once run everywhere dx which is nice for them but value sould always favor the user. Happy for someone to change my mind.


WebGPU is much more portable than CUDA.


> Thanks to the open-source efforts like LLaMA ...

Is the LLaMA model really open source now? Last I checked it was only licensed for non-commercial use, which isn't open source software at least. Have they changed the license? Are people depending on "databases can't be copyrighted"? Are people just presuming they won't be caught?

There's lots of OSS that can use LLaMA but that's different from the model itself.

This is a genuine question, people are making assertions but I can't find evidence for the assertions.


Not all implementations of it use the same license. See for example lit-llama by Lightning AI:

https://news.ycombinator.com/item?id=35344787


That is for code to download the model, not for the model itself.


They actually reimplemented the code to load the model. But it’s not clear if the weights of the model are free to use or not


What do you mean, it's not clear?

https://github.com/facebookresearch/llama links to

https://forms.gle/jk851eBVbX1m5TAv5 which contains LLaMA license agreement below the form.


Ok, all evidence I can find is that the LLaMA model is not open source, and the article is just wrong. Software != model. If someone has evidence otherwise, please post.


Most people are distributing diff files like this https://github.com/lm-sys/FastChat#vicuna-weights

Ie. Here’s a diff. Go find your own copy of the llama model and apply this patch.

…and hoping that’s good enough to get away with; even though you can’t really argue it’s not distributing a derivative work in part.

Distribution of the actual model (eg. Running it in your browser) seems to usually result in “and then Meta slaps you with a takedown notice”.


I've only seen vicuna do it so far but this is the right way for now.


It is funny how we have WebNN API coming, but it is so slow to arrive that we are misusing graphics APIs to do NN again. Anyone have a clue when WebNN API will arrive? It would be significantly more power efficient than using WebGPU/WebGL.

https://www.w3.org/TR/webnn/

Seems to be in development over at Chrome: https://chromestatus.com/feature/5738583487938560


I wouldn’t consider this to be “misusing” a graphics API in the same sense that tensor operations on top of WebGL are. WebGPU intentionally provides compute shaders with the goal of enabling non-graphical pipelines.

> [WebGPU use cases include] Executing machine learning models efficiently on the GPU. It is possible to do general-purpose GPU (GPGPU) computation in WebGL, but it is sub-optimal and much more difficult.

https://gpuweb.github.io/gpuweb/explainer/


I'm more worried that it'll end up kind of inferior to like, Google's other work, OpenXLA & iree, which seem to be the cross-target ML toolkit. https://news.ycombinator.com/item?id=35078410

I really want this to be good. It's hard to really trust the web spec is going to live up to what the other really really good web ML work is. But it'll at least unlock some good speed up, whenever it lands. I genuinely don't mind more delay, if it helps get things into a better position for long term wins.


This is really cool and I had no idea about it. I tried the live style transfer on live video demo and was amazed, even the with gpu polyfill I was getting 5fps on my underpowered laptop. It brought me back to the days of deepdream and deepstyle but live and on the web.


You guys are awesome. Both Web LLM and Web Stable Diffusion demos work on my Intel i3-1115G4 laptop with only 5.9GB of shared GPU memory.


Are there any other projects/libraries that can run Llama models on Apple Silicon GPU? This is the first one I've seen.

Comparing it to llama.cpp on my M1 Max 32GB, it seems at least as fast just by eyeballing it. Not sure if the inference speed numbers can be compared directly.

vicuna-7b-v0 on Chrome Canary with the disable-robustness flag: encoding: 74.4460 tokens/sec, decoding: 18.0679 tokens/sec = 10.8ms per token

llama.cpp: $ ./main -m models/7B/ggml-model-q4_0-ggjt.bin -t 8 --ignore-eos = 45 ms per token


> For example, the latest MacBook Pro can have more than 60G+ unified GPU RAM that can be used to store the model weights and a reasonably powerful GPU to run many workloads.

...for $3.5K minimum, according to the Apple website :/

Is there any chance WebGPU could utilize the matrix instructions shipping on newer/future IGPs? I think MLIR can do this through Vulkan, which is how SHARK is so fast in Stable Diffusion on the AMD 7900 series, but I know nothing about webgpu's restrictions or Apache TVM.


That would need support in dawn: https://dawn.googlesource.com/dawn

Dawn and WebIDL is also an easy way to add GPU support to any application (that can link C code (or use via a lib)). And Google maintains the compiler layer for the GPU frameworks (Metal, DX, Vulkan ...). This is going to be a great leap forward for GPGPU for many apps.


Hmmm is there an issue tracker for Dawn matrix acceleration in... Vulkan or D3D12, I guess? Diving into these software stacks is making my head hurt.


Getting 64 GB of VRAM for $3.5k is a lot cheaper than buying the equivalent Nvidia discrete GPUs.


Maybe Intel will start offering cheap, high capacity ARC dGPUs as a power play? That would certainly be disruptive.

But yeah, AMD/Nvidia are never going offer huge memory pools affordably on dGPUs.


It's also interesting that this opens up the full saturation of Apple Silicon (minus the ANE): GGML can run on the CPU, using NEON and AMX, while another instance could run via Metal on the GPU using MLC/dawn. Though the two couldn't share (the same) memory at the moment.


The GPU's ML task energy is so much lower that you'd probably get better performance running everything on the GPU.

I think some repos have tried splitting things up between the NPU and GPU as well, but they didn't get good performance out of that combination? Not sure why, as the NPU is very low power.


This was a really insightful explanation, thanks.

I have been wanting to get a beefier Mac Studio/mini m2 the more

I’m seeing Apple Silicon specific tweaked packages.


You can get it for a lot less from https://frame.work

But 64G of VRAM is not the same as GPU mem, apples and oranges


Where does Framework offer 64 GB of VRAM? By VRAM I am referring to GPU RAM, yes.


Technically any newish laptop with 64GB of RAM has 64GB of "VRAM," but right now the Apple M series and AMD 7000 series are the only IGPs with any significant ML power.


I’m not sure what you mean. Typically, an iGPU slices off part of RAM for the GPU at boot time, which means it’s fixed and not shared. When did this change?


For Intel, it seems that per their chart under "What is the maximum amount of graphics memory or video memory my computer can use?" and discussion under "Will adding more physical memory increase my graphics memory amount?" at https://www.intel.com/content/www/us/en/support/articles/000..., iGPUs included with 5th gen/Broadwell processors were their first to do so in 2014.


Its fixed at boot but (on newer IGPs) can grow beyond the initial capacity.


Full unified memory came 10 ish years ago (also powering the PS4) but I think hw ability to adjust iGPU memory without booting predated that, Intel seems to have called it DVMT.


I was wondering about the integrated gpu in my desktop ryzen 7900x.

I can find very little about it (or other 7000 series integrated gpus). Is this usable at all for running LLaMa in some way?


Doesn't it all boil down to bandwidth?

AMDs IGP are way less attractive because they use rather slow DDR4/5 memory while the M2 has blazing fast memory integrated in the package.

We're talking about 50 GB/s vs 400 GB/s. Nvidia's A100 has 1000 GB/s.

Memory bandwidth is usually the bottleneck in GPU performance as many kernels are memory-bound (look up the roofline performance model).


The AMD 6000 series has 128-bit LPDDR5 as an option, the 7000 series had LPDDR5X. This is similar to the M1/M2.

The Pro/Max have double/quadruple that bus width. But they are much bigger/more expensive chips.


Just tried out the demo, finally something that runs out of the box on my iMac Pro! This old 16GB card can finally breathe some LLM air!!


Do LLMs have a way around the high end GPU requirements, or can CPU code potentially be much more optimized somehow?

This is the only thing I can think of, not everyone will have the latest high end GPUs to run such software..


If you're doing inference on neural networks, each weight has to be read at least once per token. This means you're going to read at least the size of the entire model, per token, at least once during inference. If your model is 60GB, and you're reading it from the hard drive, then your bare minimum time of inference per token will be limited by your hard drive read throughput. Macbooks have ~4GB/s sequential read speed. Which means your inference time per token will be strictly more than 15 seconds. If your model is in RAM, then (according to Apple's advertising) your memory speed is 400GB/s, which is 100x your hard drive speed, and just the memory throughput will not be as much of a bottleneck here.


Your answer applies equally to GPU and CPU, no?

The comment to which you replied was asking about the need for a GPU, not the need for a lot of RAM.


There will be LLM specific chips coming to market soon which will be specialized to the task.

Tesla already has already been creating AI chips for their FSD features in their vehicles. Over the next years, everyone will be racing to be the first to put out LLM specific chips, with AI specific hardware devices following.


The next generation of Intel/AMD IGPs operating out of RAM should be quite usable.


What exactly is the ideal sort of hardware to be able to run and train large models? Do you basically just need a high end version of basically everything?


Check out LLAMA-CPP


Looks like this was hacked together pretty quickly. This in CPU is exactly what needs optimized to run on more devices, if that's even possible..

I guess it will take hardware and software a while to catch up to compete with ChatGPT..


If you look at the news, yes it came together quickly, but it has also gotten a lot of performance upgrades which have made significant improvements.


Works well on my Dell XPS13 which is not particularly powerful but is modern. Answers are slower than chatGPT but not painfully slow and the response quality is great.


What's the GPU in your model?


I gave WebLLM a spin. This is a promising start, but it's struggling with code (using vicuna-7b-v0):

  Write a Turbo Pascal program that says hello world.
 
  ```sql
  program helloWorld;
  
  begin
  write('H');
  write('e');
  write('l');
  write('l');
  writeln;
  end.
  ```


Apple should really jump on this and support teams working on local LLMs.


It hard not to wonder if some of today llm use cases weren’t kept in mind on the current Apple Silicon roadmap.



Yes and it's definitely a step in the right direction. It's beyond me whether this is enough of an effort to support the current LLM movement.

I expect Apple to be more proactive in these regards to capture the minds of a lightning-fast growing market increasingly slipping through its control (ChatGPT can be accessed from anywhere) and offer more incentives to developers, co-marketing being the lowest starting point.. probably something they're already working on and we didn't see the results of yet.


Ah darn it, and here I was trying to do this with ggml and onxx.


The speed at which this stuff is developing makes it very hard to keep up and contribute in a meaningful way. Especially when you have a separate full time job.


Is this chasing impossible - not criticize, and love effort- ? But is it -a little- really possible to run an LLM in a single machine ? I want to believe :)


GTPQ has been the missing piece, it allows quantizing the model weights from 16 to 4 bits with only a small loss in quality. That it turn allows running even the large 65 billion parameter version of the LLaMA model in ~33GB of RAM or VRAM.

With VRAM that requires two 24GB GPUs which is no longer completely out of reach.

The model running in the browser is a smaller version with 7 billion parameters, which is good enough for some things.


I just tried it and it works. And works amazing compared to anything that existed anywhere on earth one school term ago! so yeah why not?


I don't get where your question is coming from, you can already run LLMs on a single machine. Checkout llama.cpp, tabby, text generation webui, gpt4all, AI Dungeon open source models like clover-edition, and know this we gpu based app.


The question comes from a kind of confusion. We know the requirements of LLMs. How can we run the hardware it is currently working on, only the big LLM, with an 11Gb graphics card? I really didn't mind!


Off topic. Did anyone use GPT to answer questions about datasheets of electronic components and how well did it work?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: