Web LLM – WebGPU Powered Inference of Large Language Models

junrushao1994 · on April 15, 2023

To clarify, running this WebLLM demo doesn't need a 3.5k MacBook Pro which costs $3.5k :-)

WebGPU supports multiple backends, besides Metal on Apple Silicon, it offloads to Vulkan, DirectX, etc. It means a windows laptop with Vulkan support should work. My 2019 Intel MacBook with AMDGPU works as well. And of course, NVIDIA GPUs too!

Our model is int4 quantized, and it is 4G in size, so it doesn't need 64GB memory either. Somewhere around 6G should suffice.

slimsag · on April 15, 2023

Nice work! I knew it wouldn't be long before someone put this together :)

I'm curious if given a different language (like Zig) with WebGPU access if you could easily translate that last-mile of code to execute there or not? In specific I wonder if I can do it, and if you can give me an overview of where the code for "Universal deployment" in your diagram actually lives?

I found llm_chat.js, but it seems that doesn't include the logic necessary for building WGSL shaders? Am I wrong or does that happen elsewhere like in the TVM runtime? How much is baked into llm_chat.wasm, and where is the source for that?

crowwork · on April 15, 2023

The WGSL are generated and compiled through TVM and embedded into the wasm.

I think what you mean is wgpu native support. At the moment the web gpu runtime dispatches to the js webgpu environment. Once TVM runtime comes with wgpu native support (like the current ones in vulkan or metal), then it is possible to leverage any wgpu native runtime like what Zig provide.

Additionally, currently tvm natively support targets like vulkan, metal directly which allows targeting these other platforms

slimsag · on April 15, 2023

OK that makes sense; so basically if I want to give this a shot then I would just need to read llm_chat.js and the TVM docs, and translate llm_chat.js to my language of choice effectively?

crowwork · on April 15, 2023

I think instead what would be needed is a wgpu native runtime support for TVM. Like the implementations in tvm vulkan, then it will be naturally link to any runtime that provides webgpu.h

Then yah the llm_chat.js would be high-level logic that targets the tvm runtime, and can be implemented in any language that tvm runtime support(that includes, js, java, c++ rust etc).

Support webgpu native is an interesting direction. Feel free to open a thread in tvm discuss forum and perhaps there would be fun things to collaborate in OSS

summarity · on April 15, 2023

How big is the "runtime" part? My use case would basically be: run this in a native app that links against webgpu (wgpu or dawn). Is there a reference implementation for this runtime that one could study?

crowwork · on April 15, 2023

tvm runtime is pretty decent(~700k-2M level depending on dependency included), you can checkout tvm community and bring up the question there, i think there might be some common interest. There are impl of runtime for vulkan, metal that can be used as reference.

bazmattaz · on April 15, 2023

Sorry non technical person here; has this been benchmarked against ChatGPT? Do you have any idea how it performs alongside GPT3 or GPT4?

wokwokwok · on April 15, 2023

It’s rubbish compared to GPT-3/4.

Any model you can trivially load in your browser will be significantly smaller than those models, and broadly speaking smaller = worse.

This example is a 4 GB model, that’s (I guess) based off some smallish model like the llama 7B.

It’s a proof of concept, not a chat-gpt replacement.

There’s nothing here that’s new other than “runs in the browser”; so it won’t be better than any other model you can get your hands on.

This kind of thing should be label ByoM (bring your own model). The model isn’t the interesting part of this.

junrushao1994 · on April 16, 2023

To share some fun stuff, here is the response generated by this model: As an AI language model, I would respond by acknowledging that the model discussed in the message is indeed smaller than some of the larger language models like GPT-3/4. However, I would also point out that size is not always a direct indicator of performance, and that the model may still be able to perform well on certain tasks or in certain contexts. Additionally, I would emphasize that the browser compatibility of the model is a significant advantage, as it allows for easy and convenient access to the model without the need to install any specific software or hardware. Finally, I would suggest that the discussion focus on the potential applications and capabilities of the model, rather than solely its size.

Back to the topic, we don't make much innovation on the model, so I am probably not the best person to evaluate how a model compares with SOTAs. There are indeed lots of super cool techniques being explored lately that makes it possible to deploy smaller and smaller models, for example, LLM.int8() [1] and int4 quantization [2] without loss of zero-shot accuracy. Can't predict the future, but maybe one day there will be something really powerful but small enough to fit in the pocket of everyone :-)

[1] Dettmers, Tim, et al. "LLM. int8 (): 8-bit matrix multiplication for transformers at scale." arXiv preprint arXiv:2208.07339 (2022).

[2] Dettmers, Tim, and Luke Zettlemoyer. "The case for 4-bit precision: k-bit Inference Scaling Laws." arXiv preprint arXiv:2212.09720 (2022).

wokwokwok · on April 16, 2023

> but maybe one day

There you go, summarised for you.

You can hand wave about quantised models til the end of time but specifically this model is a trivial toy model.

No amount of pondering about the future avoid the fundamental fact that small models (~7B) are inferior to larger models like GPT.

It’s dishonest to suggest otherwise. :( There’s no reason to do this other than selling snake oil.

Maybe. One day. In the future. Things might be different.

right now they are not.

visarga · on April 16, 2023

Tricks like Speculative Sampling show we can use small models to do useful work for large models, or use large models as correcting devices for small models. So I see a mixed future - both small and large models - one with low latency and fast, the other slow and sparsely called, working together to achieve the qualities of both.

For example a small model could take input text and compress it [1], the LLM could generate a compressed response, then the small LLM could decompress it.

[1] https://assets.skool.com/f/985eda24eb9f41ba8b526d2e74f5f33f/...

This is the compression prompt:

> You are GPT-4. Generate a compressed/magic instruction string for yourself (Abuse of language mixing, abbreviations, symbols (unicode and emojis) to aggressively compress it) such that if injected in your context anywhere you will start following the following instruction whatever is the prompt you're given. You should make sure to prelude the instruction with a string (compressed as well) that will make you understand in the future that you should follow it at all cost.

junrushao1994 · on April 16, 2023

To your response, the model says:

> Dear [Name],

> Thank you for your message. We understand that the model you are referring to is a simple and basic model. However, it is important to highlight that this model serves a specific purpose and can be useful for certain applications.

> Regarding the comparison with larger models like GPT, it's important to note that different models have different strengths and weaknesses, and the choice of model depends on the specific task and use case. While larger models like GPT may be more powerful and capable, they also come with greater computational and memory requirements.

> We appreciate your concerns and feedback, and we will take them into consideration as we continue to develop our models. Our aim is to provide solutions that are tailored to the needs of our clients and meet their requirements for accuracy, efficiency, and performance.

> Thank you for your input, and we hope to have the opportunity to work with you in the future.

> Best regards,

> [Your Name]

Back to your response, so you did notice that I said they are not as powerful as GPT-4, of course they are not, not a single one is.

The model is not developed by us, and their performance is not our current focus either (nor am I an expert in this evaluation), but I am happy to assist if anyone wants to seriously evaluate it!

wokwokwok · on April 16, 2023

Quote:

> I would also point out that size is not always a direct indicator of performance, and

Yes. It is.

This thread is a direct response to a comparison to GPT, and your response (generated or not) is dishonest.

I can’t be blunter than that.

If you want amortise your responsibility by posting generated responses, go for it. Do whatever you want.

My response is directly to the parent comment about the comparison to GPT, for anyone who is unclear about the comparison.

robinduckett · on April 16, 2023

You’re quoting and talking to an LLM

lyu07282 · on April 16, 2023

And loosing the argument no less

junrushao1994 · on April 15, 2023

This is unfortunately non-trivial to quantitatively evaluate the performance against ChatGPT :-(

We didn't do much evaluation because there isn't much innovation on model side, but instead we are demoing the possibility of running an end-to-end model on ordinary client GPUs via WebGPU without server resources.

imranq · on April 16, 2023

There are plenty of LLM benchmarks that are used to test performance, some of them are: * Winogrande

* BoolQ

* PIQA

* SIQA

* HellaSwag

etc...

junrushao1994 · on April 16, 2023

Would be nice if anyone could help us benchmark! Our primary focus though is not model performance, but to demonstrate the capability that TVM Unity generates code targeting WebGPU and allows them to run with client GPUs :-)

amelius · on April 16, 2023

> This is unfortunately non-trivial to quantitatively evaluate the performance against ChatGPT :-(

Compare using the loss function?

junrushao1994 · on April 16, 2023

In LLM world, loss or perplexity may not be the best indicator of model performance :-( Perhaps HELM (https://crfm.stanford.edu/helm/latest/) but we didn't take deeper look as we are not the developers of this model

jimmySixDOF · on April 16, 2023

I'm also not sure what the real use case is here other than an intellectual experiment. I may be missing the point but as I understand the world, running a Local LLM has nothing to do with the web in any way while it is doing so and so what's the benefit of WebGPU ? If you are streaming some constantly changing live data 3D experience in real time served over the internet by a website you just clicked on then WebGPU is making life easier. If you have a local pile of training data and weights sitting there on your hard drive for serious crunching then its like tying your shoelaces together just because developers want to have write once run everywhere dx which is nice for them but value sould always favor the user. Happy for someone to change my mind.

bigdict · on April 16, 2023

WebGPU is much more portable than CUDA.

dwheeler · on April 15, 2023

> Thanks to the open-source efforts like LLaMA ...

Is the LLaMA model really open source now? Last I checked it was only licensed for non-commercial use, which isn't open source software at least. Have they changed the license? Are people depending on "databases can't be copyrighted"? Are people just presuming they won't be caught?

There's lots of OSS that can use LLaMA but that's different from the model itself.

This is a genuine question, people are making assertions but I can't find evidence for the assertions.

Anduia · on April 15, 2023

Not all implementations of it use the same license. See for example lit-llama by Lightning AI:

https://news.ycombinator.com/item?id=35344787

dwheeler · on April 16, 2023

That is for code to download the model, not for the model itself.

jerpint · on April 16, 2023

They actually reimplemented the code to load the model. But it’s not clear if the weights of the model are free to use or not

Tepix · on April 16, 2023

What do you mean, it's not clear?

https://github.com/facebookresearch/llama links to

https://forms.gle/jk851eBVbX1m5TAv5 which contains LLaMA license agreement below the form.

dwheeler · on April 16, 2023

Ok, all evidence I can find is that the LLaMA model is not open source, and the article is just wrong. Software != model. If someone has evidence otherwise, please post.

wokwokwok · on April 16, 2023

Most people are distributing diff files like this https://github.com/lm-sys/FastChat#vicuna-weights

Ie. Here’s a diff. Go find your own copy of the llama model and apply this patch.

…and hoping that’s good enough to get away with; even though you can’t really argue it’s not distributing a derivative work in part.

Distribution of the actual model (eg. Running it in your browser) seems to usually result in “and then Meta slaps you with a takedown notice”.

Tepix · on April 16, 2023

I've only seen vicuna do it so far but this is the right way for now.

bhouston · on April 15, 2023

It is funny how we have WebNN API coming, but it is so slow to arrive that we are misusing graphics APIs to do NN again. Anyone have a clue when WebNN API will arrive? It would be significantly more power efficient than using WebGPU/WebGL.

https://www.w3.org/TR/webnn/

Seems to be in development over at Chrome: https://chromestatus.com/feature/5738583487938560

paulgb · on April 16, 2023

I wouldn’t consider this to be “misusing” a graphics API in the same sense that tensor operations on top of WebGL are. WebGPU intentionally provides compute shaders with the goal of enabling non-graphical pipelines.

> [WebGPU use cases include] Executing machine learning models efficiently on the GPU. It is possible to do general-purpose GPU (GPGPU) computation in WebGL, but it is sub-optimal and much more difficult.

https://gpuweb.github.io/gpuweb/explainer/

rektide · on April 16, 2023

I'm more worried that it'll end up kind of inferior to like, Google's other work, OpenXLA & iree, which seem to be the cross-target ML toolkit. https://news.ycombinator.com/item?id=35078410

I really want this to be good. It's hard to really trust the web spec is going to live up to what the other really really good web ML work is. But it'll at least unlock some good speed up, whenever it lands. I genuinely don't mind more delay, if it helps get things into a better position for long term wins.

archerx · on April 16, 2023

This is really cool and I had no idea about it. I tried the live style transfer on live video demo and was amazed, even the with gpu polyfill I was getting 5fps on my underpowered laptop. It brought me back to the days of deepdream and deepstyle but live and on the web.

hongkonger · on April 15, 2023

You guys are awesome. Both Web LLM and Web Stable Diffusion demos work on my Intel i3-1115G4 laptop with only 5.9GB of shared GPU memory.

Hyzer · on April 16, 2023

Are there any other projects/libraries that can run Llama models on Apple Silicon GPU? This is the first one I've seen.

Comparing it to llama.cpp on my M1 Max 32GB, it seems at least as fast just by eyeballing it. Not sure if the inference speed numbers can be compared directly.

vicuna-7b-v0 on Chrome Canary with the disable-robustness flag: encoding: 74.4460 tokens/sec, decoding: 18.0679 tokens/sec = 10.8ms per token

llama.cpp: $ ./main -m models/7B/ggml-model-q4_0-ggjt.bin -t 8 --ignore-eos = 45 ms per token

brucethemoose2 · on April 15, 2023

> For example, the latest MacBook Pro can have more than 60G+ unified GPU RAM that can be used to store the model weights and a reasonably powerful GPU to run many workloads.

...for $3.5K minimum, according to the Apple website :/

Is there any chance WebGPU could utilize the matrix instructions shipping on newer/future IGPs? I think MLIR can do this through Vulkan, which is how SHARK is so fast in Stable Diffusion on the AMD 7900 series, but I know nothing about webgpu's restrictions or Apache TVM.

summarity · on April 15, 2023

That would need support in dawn: https://dawn.googlesource.com/dawn

Dawn and WebIDL is also an easy way to add GPU support to any application (that can link C code (or use via a lib)). And Google maintains the compiler layer for the GPU frameworks (Metal, DX, Vulkan ...). This is going to be a great leap forward for GPGPU for many apps.

brucethemoose2 · on April 15, 2023

Hmmm is there an issue tracker for Dawn matrix acceleration in... Vulkan or D3D12, I guess? Diving into these software stacks is making my head hurt.

satvikpendem · on April 15, 2023

Getting 64 GB of VRAM for $3.5k is a lot cheaper than buying the equivalent Nvidia discrete GPUs.

brucethemoose2 · on April 15, 2023

Maybe Intel will start offering cheap, high capacity ARC dGPUs as a power play? That would certainly be disruptive.

But yeah, AMD/Nvidia are never going offer huge memory pools affordably on dGPUs.

summarity · on April 15, 2023

It's also interesting that this opens up the full saturation of Apple Silicon (minus the ANE): GGML can run on the CPU, using NEON and AMX, while another instance could run via Metal on the GPU using MLC/dawn. Though the two couldn't share (the same) memory at the moment.

brucethemoose2 · on April 15, 2023

The GPU's ML task energy is so much lower that you'd probably get better performance running everything on the GPU.

I think some repos have tried splitting things up between the NPU and GPU as well, but they didn't get good performance out of that combination? Not sure why, as the NPU is very low power.

j45 · on April 16, 2023

This was a really insightful explanation, thanks.

I have been wanting to get a beefier Mac Studio/mini m2 the more

I’m seeing Apple Silicon specific tweaked packages.

verdverm · on April 15, 2023

You can get it for a lot less from https://frame.work

But 64G of VRAM is not the same as GPU mem, apples and oranges

satvikpendem · on April 15, 2023

Where does Framework offer 64 GB of VRAM? By VRAM I am referring to GPU RAM, yes.

brucethemoose2 · on April 15, 2023

Technically any newish laptop with 64GB of RAM has 64GB of "VRAM," but right now the Apple M series and AMD 7000 series are the only IGPs with any significant ML power.

sroussey · on April 15, 2023

I’m not sure what you mean. Typically, an iGPU slices off part of RAM for the GPU at boot time, which means it’s fixed and not shared. When did this change?

lights0123 · on April 16, 2023

For Intel, it seems that per their chart under "What is the maximum amount of graphics memory or video memory my computer can use?" and discussion under "Will adding more physical memory increase my graphics memory amount?" at https://www.intel.com/content/www/us/en/support/articles/000..., iGPUs included with 5th gen/Broadwell processors were their first to do so in 2014.

brucethemoose2 · on April 15, 2023

Its fixed at boot but (on newer IGPs) can grow beyond the initial capacity.

fulafel · on April 16, 2023

Full unified memory came 10 ish years ago (also powering the PS4) but I think hw ability to adjust iGPU memory without booting predated that, Intel seems to have called it DVMT.

azeirah · on April 16, 2023

I was wondering about the integrated gpu in my desktop ryzen 7900x.

I can find very little about it (or other 7000 series integrated gpus). Is this usable at all for running LLaMa in some way?

KeplerBoy · on April 15, 2023

Doesn't it all boil down to bandwidth?

AMDs IGP are way less attractive because they use rather slow DDR4/5 memory while the M2 has blazing fast memory integrated in the package.

We're talking about 50 GB/s vs 400 GB/s. Nvidia's A100 has 1000 GB/s.

Memory bandwidth is usually the bottleneck in GPU performance as many kernels are memory-bound (look up the roofline performance model).

brucethemoose2 · on April 15, 2023

The AMD 6000 series has 128-bit LPDDR5 as an option, the 7000 series had LPDDR5X. This is similar to the M1/M2.

The Pro/Max have double/quadruple that bus width. But they are much bigger/more expensive chips.

auggierose · on April 15, 2023

Just tried out the demo, finally something that runs out of the box on my iMac Pro! This old 16GB card can finally breathe some LLM air!!

cahoot_bird · on April 15, 2023

Do LLMs have a way around the high end GPU requirements, or can CPU code potentially be much more optimized somehow?

This is the only thing I can think of, not everyone will have the latest high end GPUs to run such software..

m1el · on April 15, 2023

If you're doing inference on neural networks, each weight has to be read at least once per token. This means you're going to read at least the size of the entire model, per token, at least once during inference. If your model is 60GB, and you're reading it from the hard drive, then your bare minimum time of inference per token will be limited by your hard drive read throughput. Macbooks have ~4GB/s sequential read speed. Which means your inference time per token will be strictly more than 15 seconds. If your model is in RAM, then (according to Apple's advertising) your memory speed is 400GB/s, which is 100x your hard drive speed, and just the memory throughput will not be as much of a bottleneck here.

rahimnathwani · on April 15, 2023

Your answer applies equally to GPU and CPU, no?

The comment to which you replied was asking about the need for a GPU, not the need for a lot of RAM.

ratg13 · on April 15, 2023

There will be LLM specific chips coming to market soon which will be specialized to the task.

Tesla already has already been creating AI chips for their FSD features in their vehicles. Over the next years, everyone will be racing to be the first to put out LLM specific chips, with AI specific hardware devices following.

brucethemoose2 · on April 15, 2023

The next generation of Intel/AMD IGPs operating out of RAM should be quite usable.

dan_mctree · on April 16, 2023

What exactly is the ideal sort of hardware to be able to run and train large models? Do you basically just need a high end version of basically everything?

verdverm · on April 15, 2023

Check out LLAMA-CPP

cahoot_bird · on April 15, 2023

Looks like this was hacked together pretty quickly. This in CPU is exactly what needs optimized to run on more devices, if that's even possible..

I guess it will take hardware and software a while to catch up to compete with ChatGPT..

verdverm · on April 15, 2023

If you look at the news, yes it came together quickly, but it has also gotten a lot of performance upgrades which have made significant improvements.

quickthrower2 · on April 16, 2023

Works well on my Dell XPS13 which is not particularly powerful but is modern. Answers are slower than chatGPT but not painfully slow and the response quality is great.

tuukkah · on April 18, 2023

What's the GPU in your model?

xyproto · on April 17, 2023

I gave WebLLM a spin. This is a promising start, but it's struggling with code (using vicuna-7b-v0):

  Write a Turbo Pascal program that says hello world.
 
  ```sql
  program helloWorld;
  
  begin
  write('H');
  write('e');
  write('l');
  write('l');
  writeln;
  end.
  ```

omneity · on April 16, 2023

Apple should really jump on this and support teams working on local LLMs.

j45 · on April 16, 2023

It hard not to wonder if some of today llm use cases weren’t kept in mind on the current Apple Silicon roadmap.

Tepix · on April 16, 2023

Have you seen this?

https://github.com/apple/ml-ane-transformers

omneity · on April 17, 2023

Yes and it's definitely a step in the right direction. It's beyond me whether this is enough of an effort to support the current LLM movement.

I expect Apple to be more proactive in these regards to capture the minds of a lightning-fast growing market increasingly slipping through its control (ChatGPT can be accessed from anywhere) and offer more incentives to developers, co-marketing being the lowest starting point.. probably something they're already working on and we didn't see the results of yet.

lxe · on April 15, 2023

Ah darn it, and here I was trying to do this with ggml and onxx.

tantony · on April 16, 2023

The speed at which this stuff is developing makes it very hard to keep up and contribute in a meaningful way. Especially when you have a separate full time job.

firatsarlar · on April 16, 2023

Is this chasing impossible - not criticize, and love effort- ? But is it -a little- really possible to run an LLM in a single machine ? I want to believe :)

Tepix · on April 16, 2023

GTPQ has been the missing piece, it allows quantizing the model weights from 16 to 4 bits with only a small loss in quality. That it turn allows running even the large 65 billion parameter version of the LLaMA model in ~33GB of RAM or VRAM.

With VRAM that requires two 24GB GPUs which is no longer completely out of reach.

The model running in the browser is a smaller version with 7 billion parameters, which is good enough for some things.

quickthrower2 · on April 16, 2023

I just tried it and it works. And works amazing compared to anything that existed anywhere on earth one school term ago! so yeah why not?

ranguna · on April 16, 2023

I don't get where your question is coming from, you can already run LLMs on a single machine. Checkout llama.cpp, tabby, text generation webui, gpt4all, AI Dungeon open source models like clover-edition, and know this we gpu based app.

firatsarlar · on April 17, 2023

The question comes from a kind of confusion. We know the requirements of LLMs. How can we run the hardware it is currently working on, only the big LLM, with an 11Gb graphics card? I really didn't mind!

amelius · on April 16, 2023

Off topic. Did anyone use GPT to answer questions about datasheets of electronic components and how well did it work?