Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
WebLLM: Llama2 in the Browser (mlc.ai)
192 points by meiraleal on Aug 29, 2023 | hide | past | favorite | 31 comments


I ran Llama 2 70B in my browser (Chrome Canary on a 64GB MacBook M2) using this. It took a long time to start running, but then:

prefill: 0.9654 tokens/sec, decoding: 3.2589 tokens/sec

Honestly amazed that this is even possible. I haven't even run Llama 2 70B on my laptop NOT using a web browser yet.


That is impressive. Interesting that the prefill (i am guessing this is prompt processing) is so much slower than decoding.

Its my understanding that under normal circumstances decoding is memory bandwidth bound which prompt processing isn't due to batching. Is there some quirk in your setup?


Strange. I'm running Llama2 70b on Chrome Canary on a 64GB MacBook M1 Max...~1.5 older...and seeing better performance.

It's slow but usable!

prefill: 2.1963 tokens/sec, decoding: 3.4708 tokens/sec


RTX 4080 Llama-2-77b-chat-hf-q4f32_1

210.6742 tokens/sec, decoding: 20.7758 tokens/sec

Can't tell if its really taking advantage of the all the power of the card, though.


Mac Studio 2022 64 GB RAM M1 Max. Lots of other stuff running. prefill: 1.1620 tokens/sec, decoding: 2.4105 tokens/sec


This is a crazy good performance!


How's the quantization?


Related. I built karpathy’s llama2.c (https://github.com/karpathy/llama2.c) without modifications to WASM and run it in the browser. It was a fun exercise to directly compare native vs. Web perf. Getting 80% of native performance on my M1 Macbook Air and haven’t spent anytime optimizing the WASM side.

Demo: https://diegomarcos.com/llama2.c-web/

Code: https://github.com/dmarcos/llama2.c-web


Thanks a lot for this ! I was looking for the equivalent of WebLLM that runs on CPU only.


You’re welcome. Any feedback and contributions super appreciated


Cool. Nice example of Atwood's Law. [0] (however not really JS of course)

If somebody hasn't tried running LLMs yet, here are some lines that do the job in Google Colab or locally.

  ! git clone https://github.com/ggerganov/llama.cpp.git

  ! wget "https://huggingface.co/TheBloke/CodeLlama-7B-GGUF/resolve/main/codellama-7b.Q8_0.gguf" -P llama.cpp/models

  ! cd llama.cpp && make

  ! ./llama.cpp/main -m ./llama.cpp/models/codellama-7b.Q8_0.gguf --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8

[0] : https://en.wikipedia.org/wiki/Atwood's_Law


Thank you!

What are the exclamation points for though? In a *nix shell they'll expand to a command from history - copy-pasters beware!


google collab/jupyter instructions to run a shell command, presumably.


Yes, sorry for being unclear. I hope people who use shells will notice.

Shameless plug: https://github.com/jankovicsandras/ml <- here are some minimal Colab / Jupyter notebooks for absolute beginners.

I just find it amazing how little effort it takes to run an LLM nowdays.


great but wth is wrong with it https://i.imgur.com/gWIilWU.png


I suspect they've trained it on old stories on which they added this caveat, and now “once upon a time” became tightly coupled to the caveat in the model.


Yes, we wouldn't want to produce output that perpetuates harmful stereotypes about people who live in gingerbread houses; dangerously over-estimates the suitability of hair for safely working at height; or creates unrealistic expectations about the hospitality of people with dwarfism.

I wonder if this sort of behaviour was more nuanced in the initial model, and something like quantisation has degraded the performance?


In fairness, there are lots of things in old tales we may not an LLM to take literally.

For instance, unlike kids, at training time an LLM isn't going to ask “It's not very nice for the parents to abandon their children in the forest, is it?”.

I know conservatives are easily triggered by such caveats, but at the same time, they are literally banning books from library ¯\_(ツ)_/¯


My guess is that they accidentally trained it to object to the past for being racist, i.e. "once upon a time" promotes "outdated" attitudes.


Ya to me this is an immediate disqualification. They're building the political commissars into the tech, and they're actual nonsense political correctness rules. Instead of blocking actual racism etc they block "once upon a time"?

Throw it in the trash, its worthless.


AI models absorb all kind of racist/sexist/hateful speech, so it has to be neutered or it will end up like that Microsoft AI that started spouting nazi lingo after a day or two of training because of trolls.

Apparently AI companies can't be bothered to filter out the harmful training data so you end up with this warning every time you reference something even remotely controversial. It paints a bleak future if AI companies will keep producing these censoring AIs rather than fix the problem with their input.


How do they load 70B weights with the ~4GB heap limit of JS/WASM?


It uses WebGPU which I’m guessing is allocated differently than the 4gb limit?


WebGPU doesn't ingest data directly, it always goes through JS/WASM. Presumably they are streaming the data in small chunks through into WebGPU. I could imagine several ways of doing it, but sometimes in browsers, things that ought to work don't, so a demonstrated working method would be interesting to know about, if someone can locate the actual code that does it. (Also, it's pretty silly that you can allocate far more VRAM than RAM on the web platform.)


WebGPU sure is a faff to set up with all of these flags and secret settings. It'll be a while before anyone can seriously use this tech in a real product.

The RedPajama one seems to work alright. It still often ends up getting stuck in a loop, though. GTX 1080. prefill: 34.2173 tokens/sec, decoding: 19.8731 tokens/sec

llama-2 is generating pure nonsense for me, just random letters, numbers, and punctuation Vicuna-v1-7b-q4f32_0 is slightly better. None of the fp16 models work on my GPU, I'm guessing it's a hardware limitation.

At least vicuna generates words, but it sure becomes obvious how much these models are just autocorrect. It reads like I'm tapping the word prediction button on my phone's keyboard.

    Human: Once upon a time

    AI: gro (2 o r (tella, asan) additionaly, as the combination, as the,as they, as the arrived, as they, the, as the as being, the, were, as the, as themselves, the, as their, the as them, the as their, the arrived, the, as the, as themselves, the, as the, as their, the, the as their, as themselves, as the, as their, the, the as their, the, as their, the, as their, the as their, the, as their, the, as their, the, as themselves, the, as their, the, as their, the, as their, the, as their, as their, the, as their, the, as their, the, as their, as their, the, as their, as their, as their, the, as their, the, as their, the, as their, as their, as their, the, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as


Interesting, llama-2 70b was coherent for me, Apple M1 Max.


*GTX 1650* prefill: 5.2211 tokens/sec, decoding: 0.4233 tokens/sec


What model variant?


This is pretty amazing.

prefill: 16.9337 tokens/sec, decoding: 0.4631 tokens/sec on a Radeon 6600XT on the 7b default model.

Does feel something isn't quite right as it's only using a few % of GPU/CPU - though it is using the AMD GPU! Which I have never managed to get working in Windows or Linux with llama.cpp directly.

I wonder if using WebGPU somehow would avoid all the horrendous problems of GPU support in LLMs, as it seems there is some sort of semi-working abstraction layer here which works across M1/M2, Nvidia and AMD?


Demo now broken?

Generate error, Error: Chat module not yet initialized, did you call chat.reload?


I got prefill: 26.9719 tokens/sec, decoding: 18.8827 tokens/sec on M1 Max 32GB laptop for llama 2 7b chat f32. Not bad.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: