WebLLM: Llama2 in the Browser

simonw · on Aug 29, 2023

I ran Llama 2 70B in my browser (Chrome Canary on a 64GB MacBook M2) using this. It took a long time to start running, but then:

prefill: 0.9654 tokens/sec, decoding: 3.2589 tokens/sec

Honestly amazed that this is even possible. I haven't even run Llama 2 70B on my laptop NOT using a web browser yet.

muskmusk · on Aug 29, 2023

That is impressive. Interesting that the prefill (i am guessing this is prompt processing) is so much slower than decoding.

Its my understanding that under normal circumstances decoding is memory bandwidth bound which prompt processing isn't due to batching. Is there some quirk in your setup?

larrysalibra · on Aug 29, 2023

Strange. I'm running Llama2 70b on Chrome Canary on a 64GB MacBook M1 Max...~1.5 older...and seeing better performance.

It's slow but usable!

prefill: 2.1963 tokens/sec, decoding: 3.4708 tokens/sec

LegitShady · on Aug 30, 2023

RTX 4080 Llama-2-77b-chat-hf-q4f32_1

210.6742 tokens/sec, decoding: 20.7758 tokens/sec

Can't tell if its really taking advantage of the all the power of the card, though.

TradingPlaces · on Aug 29, 2023

Mac Studio 2022 64 GB RAM M1 Max. Lots of other stuff running. prefill: 1.1620 tokens/sec, decoding: 2.4105 tokens/sec

3abiton · on Aug 29, 2023

This is a crazy good performance!

HPsquared · on Aug 29, 2023

How's the quantization?

dmarcos · on Aug 29, 2023

Related. I built karpathy’s llama2.c (https://github.com/karpathy/llama2.c) without modifications to WASM and run it in the browser. It was a fun exercise to directly compare native vs. Web perf. Getting 80% of native performance on my M1 Macbook Air and haven’t spent anytime optimizing the WASM side.

Demo: https://diegomarcos.com/llama2.c-web/

Code: https://github.com/dmarcos/llama2.c-web

TheRoque · on Aug 29, 2023

Thanks a lot for this ! I was looking for the equivalent of WebLLM that runs on CPU only.

dmarcos · on Aug 29, 2023

You’re welcome. Any feedback and contributions super appreciated

jankovicsandras · on Aug 29, 2023

Cool. Nice example of Atwood's Law. [0] (however not really JS of course)

If somebody hasn't tried running LLMs yet, here are some lines that do the job in Google Colab or locally.

  ! git clone https://github.com/ggerganov/llama.cpp.git

  ! wget "https://huggingface.co/TheBloke/CodeLlama-7B-GGUF/resolve/main/codellama-7b.Q8_0.gguf" -P llama.cpp/models

  ! cd llama.cpp && make

  ! ./llama.cpp/main -m ./llama.cpp/models/codellama-7b.Q8_0.gguf --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8

[0] : https://en.wikipedia.org/wiki/Atwood's_Law

e12e · on Aug 29, 2023

Thank you!

What are the exclamation points for though? In a *nix shell they'll expand to a command from history - copy-pasters beware!

baq · on Aug 29, 2023

google collab/jupyter instructions to run a shell command, presumably.

jankovicsandras · on Aug 29, 2023

Yes, sorry for being unclear. I hope people who use shells will notice.

Shameless plug: https://github.com/jankovicsandras/ml <- here are some minimal Colab / Jupyter notebooks for absolute beginners.

I just find it amazing how little effort it takes to run an LLM nowdays.

gvv · on Aug 29, 2023

great but wth is wrong with it https://i.imgur.com/gWIilWU.png

littlestymaar · on Aug 29, 2023

I suspect they've trained it on old stories on which they added this caveat, and now “once upon a time” became tightly coupled to the caveat in the model.

michaelt · on Aug 29, 2023

Yes, we wouldn't want to produce output that perpetuates harmful stereotypes about people who live in gingerbread houses; dangerously over-estimates the suitability of hair for safely working at height; or creates unrealistic expectations about the hospitality of people with dwarfism.

I wonder if this sort of behaviour was more nuanced in the initial model, and something like quantisation has degraded the performance?

littlestymaar · on Aug 29, 2023

In fairness, there are lots of things in old tales we may not an LLM to take literally.

For instance, unlike kids, at training time an LLM isn't going to ask “It's not very nice for the parents to abandon their children in the forest, is it?”.

I know conservatives are easily triggered by such caveats, but at the same time, they are literally banning books from library ¯\_(ツ)_/¯

Sunhold · on Aug 29, 2023

My guess is that they accidentally trained it to object to the past for being racist, i.e. "once upon a time" promotes "outdated" attitudes.

LegitShady · on Aug 30, 2023

Ya to me this is an immediate disqualification. They're building the political commissars into the tech, and they're actual nonsense political correctness rules. Instead of blocking actual racism etc they block "once upon a time"?

Throw it in the trash, its worthless.

jeroenhd · on Aug 29, 2023

AI models absorb all kind of racist/sexist/hateful speech, so it has to be neutered or it will end up like that Microsoft AI that started spouting nazi lingo after a day or two of training because of trolls.

Apparently AI companies can't be bothered to filter out the harmful training data so you end up with this warning every time you reference something even remotely controversial. It paints a bleak future if AI companies will keep producing these censoring AIs rather than fix the problem with their input.

modeless · on Aug 29, 2023

How do they load 70B weights with the ~4GB heap limit of JS/WASM?

andy_ppp · on Aug 29, 2023

It uses WebGPU which I’m guessing is allocated differently than the 4gb limit?

modeless · on Aug 29, 2023

WebGPU doesn't ingest data directly, it always goes through JS/WASM. Presumably they are streaming the data in small chunks through into WebGPU. I could imagine several ways of doing it, but sometimes in browsers, things that ought to work don't, so a demonstrated working method would be interesting to know about, if someone can locate the actual code that does it. (Also, it's pretty silly that you can allocate far more VRAM than RAM on the web platform.)

jeroenhd · on Aug 29, 2023

WebGPU sure is a faff to set up with all of these flags and secret settings. It'll be a while before anyone can seriously use this tech in a real product.

The RedPajama one seems to work alright. It still often ends up getting stuck in a loop, though. GTX 1080. prefill: 34.2173 tokens/sec, decoding: 19.8731 tokens/sec

llama-2 is generating pure nonsense for me, just random letters, numbers, and punctuation Vicuna-v1-7b-q4f32_0 is slightly better. None of the fp16 models work on my GPU, I'm guessing it's a hardware limitation.

At least vicuna generates words, but it sure becomes obvious how much these models are just autocorrect. It reads like I'm tapping the word prediction button on my phone's keyboard.

    Human: Once upon a time

    AI: gro (2 o r (tella, asan) additionaly, as the combination, as the,as they, as the arrived, as they, the, as the as being, the, were, as the, as themselves, the, as their, the as them, the as their, the arrived, the, as the, as themselves, the, as the, as their, the, the as their, as themselves, as the, as their, the, the as their, the, as their, the, as their, the as their, the, as their, the, as their, the, as themselves, the, as their, the, as their, the, as their, the, as their, as their, the, as their, the, as their, the, as their, as their, the, as their, as their, as their, the, as their, the, as their, the, as their, as their, as their, the, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as

PUSH_AX · on Aug 29, 2023

Interesting, llama-2 70b was coherent for me, Apple M1 Max.

jecvay · on Aug 29, 2023

*GTX 1650* prefill: 5.2211 tokens/sec, decoding: 0.4233 tokens/sec

teaearlgraycold · on Aug 29, 2023

What model variant?

martinald · on Aug 29, 2023

This is pretty amazing.

prefill: 16.9337 tokens/sec, decoding: 0.4631 tokens/sec on a Radeon 6600XT on the 7b default model.

Does feel something isn't quite right as it's only using a few % of GPU/CPU - though it is using the AMD GPU! Which I have never managed to get working in Windows or Linux with llama.cpp directly.

I wonder if using WebGPU somehow would avoid all the horrendous problems of GPU support in LLMs, as it seems there is some sort of semi-working abstraction layer here which works across M1/M2, Nvidia and AMD?

SillyUsername · on Aug 29, 2023

Demo now broken?

Generate error, Error: Chat module not yet initialized, did you call chat.reload?

zyl1n · on Aug 29, 2023

I got prefill: 26.9719 tokens/sec, decoding: 18.8827 tokens/sec on M1 Max 32GB laptop for llama 2 7b chat f32. Not bad.