That is impressive. Interesting that the prefill (i am guessing this is prompt processing) is so much slower than decoding.
Its my understanding that under normal circumstances decoding is memory bandwidth bound which prompt processing isn't due to batching. Is there some quirk in your setup?
Related. I built karpathy’s llama2.c (https://github.com/karpathy/llama2.c) without modifications to WASM and run it in the browser. It was a fun exercise to directly compare native vs. Web perf. Getting 80% of native performance on my M1 Macbook Air and haven’t spent anytime optimizing the WASM side.
I suspect they've trained it on old stories on which they added this caveat, and now “once upon a time” became tightly coupled to the caveat in the model.
Yes, we wouldn't want to produce output that perpetuates harmful stereotypes about people who live in gingerbread houses; dangerously over-estimates the suitability of hair for safely working at height; or creates unrealistic expectations about the hospitality of people with dwarfism.
I wonder if this sort of behaviour was more nuanced in the initial model, and something like quantisation has degraded the performance?
In fairness, there are lots of things in old tales we may not an LLM to take literally.
For instance, unlike kids, at training time an LLM isn't going to ask “It's not very nice for the parents to abandon their children in the forest, is it?”.
I know conservatives are easily triggered by such caveats, but at the same time, they are literally banning books from library ¯\_(ツ)_/¯
Ya to me this is an immediate disqualification. They're building the political commissars into the tech, and they're actual nonsense political correctness rules. Instead of blocking actual racism etc they block "once upon a time"?
AI models absorb all kind of racist/sexist/hateful speech, so it has to be neutered or it will end up like that Microsoft AI that started spouting nazi lingo after a day or two of training because of trolls.
Apparently AI companies can't be bothered to filter out the harmful training data so you end up with this warning every time you reference something even remotely controversial. It paints a bleak future if AI companies will keep producing these censoring AIs rather than fix the problem with their input.
WebGPU doesn't ingest data directly, it always goes through JS/WASM. Presumably they are streaming the data in small chunks through into WebGPU. I could imagine several ways of doing it, but sometimes in browsers, things that ought to work don't, so a demonstrated working method would be interesting to know about, if someone can locate the actual code that does it. (Also, it's pretty silly that you can allocate far more VRAM than RAM on the web platform.)
WebGPU sure is a faff to set up with all of these flags and secret settings. It'll be a while before anyone can seriously use this tech in a real product.
The RedPajama one seems to work alright. It still often ends up getting stuck in a loop, though. GTX 1080. prefill: 34.2173 tokens/sec, decoding: 19.8731 tokens/sec
llama-2 is generating pure nonsense for me, just random letters, numbers, and punctuation Vicuna-v1-7b-q4f32_0 is slightly better. None of the fp16 models work on my GPU, I'm guessing it's a hardware limitation.
At least vicuna generates words, but it sure becomes obvious how much these models are just autocorrect. It reads like I'm tapping the word prediction button on my phone's keyboard.
Human: Once upon a time
AI: gro (2 o r (tella, asan) additionaly, as the combination, as the,as they, as the arrived, as they, the, as the as being, the, were, as the, as themselves, the, as their, the as them, the as their, the arrived, the, as the, as themselves, the, as the, as their, the, the as their, as themselves, as the, as their, the, the as their, the, as their, the, as their, the as their, the, as their, the, as their, the, as themselves, the, as their, the, as their, the, as their, the, as their, as their, the, as their, the, as their, the, as their, as their, the, as their, as their, as their, the, as their, the, as their, the, as their, as their, as their, the, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as their, as
prefill: 16.9337 tokens/sec, decoding: 0.4631 tokens/sec on a Radeon 6600XT on the 7b default model.
Does feel something isn't quite right as it's only using a few % of GPU/CPU - though it is using the AMD GPU! Which I have never managed to get working in Windows or Linux with llama.cpp directly.
I wonder if using WebGPU somehow would avoid all the horrendous problems of GPU support in LLMs, as it seems there is some sort of semi-working abstraction layer here which works across M1/M2, Nvidia and AMD?
prefill: 0.9654 tokens/sec, decoding: 3.2589 tokens/sec
Honestly amazed that this is even possible. I haven't even run Llama 2 70B on my laptop NOT using a web browser yet.