Remove "and Python 3.11" from title. Python used only for converting model to llama.cpp project format, 3.10 or whatever is fine.
Additionally, llama.cpp works fine with 10 y.o hardware that supports AVX2.
I'm running llama.cpp right now on an ancient Intel i5 2013 MacBook with only 2 cores and 8 GB RAM - 7B 4bit model loads in 8 seconds to 4.2 GB RAM and gives 600 ms per token.
btw: anyone knows how to disable swap per process in macOS ? even though there is enough free RAM, sometimes macOS decides to use swap instead.
> Remove "and Python 3.11" from title. Python used only for converting model to llama.cpp project format, 3.10 or whatever is fine.
As @rnosov notes elsewhere in the thread, this post has a workaround for the PyTorch issue with Python 3.11, which is why the "and Python 3.11" qualification is there.
In this particular case that doesn't matter, because the only time you run Python is for a one-off conversion against the model files.
That takes at most a minute to run, but once converted you'll never need to run it again. Actual llama.cpp model inference uses compiled C++ code with no Python involved at all.
Can you provide a link to what guide or steps you followed to get this up and running? I have a physical Linux machine with 300+ GB of RAM, would love to try out llama on it but I'm not sure where to get started for how to get it working with such a configuration.
Give us an update on the 30B model! I have 13B running easily on my M2 Air (24GB ram), just waiting until I'm on an unmetered connection to download the 30B model and give it a go.
I am running the 30B model on my m1 Mac Studio with 32gb of ram.
(venv) bherman@Rattata ~/llama.cpp$ ./main -m ./models/30B/ggml-model-q4_0.bin -
t 8 -n 128
main: seed = 1678666507
llama_model_load: loading model from './models/30B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 6656
llama_model_load: n_mult = 256
llama_model_load: n_head = 52
llama_model_load: n_layer = 60
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 17920
llama_model_load: n_parts = 4
llama_model_load: ggml ctx size = 20951.50 MB
llama_model_load: memory_size = 1560.00 MB, n_mem = 30720
llama_model_load: loading model part 1/4 from './models/30B/ggml-model-q4_0.bin'
llama_model_load: ................................................................... done
llama_model_load: model size = 4850.14 MB / num tensors = 543
llama_model_load: loading model part 2/4 from './models/30B/ggml-model-q4_0.bin.1'
llama_model_load: ................................................................... done
llama_model_load: model size = 4850.14 MB / num tensors = 543
llama_model_load: loading model part 3/4 from './models/30B/ggml-model-q4_0.bin.2'
llama_model_load: ................................................................... done
llama_model_load: model size = 4850.14 MB / num tensors = 543
llama_model_load: loading model part 4/4 from './models/30B/ggml-model-q4_0.bin.3'
llama_model_load: ................................................................... done
llama_model_load: model size = 4850.14 MB / num tensors = 543
main: prompt: 'When'
main: number of tokens in prompt = 2
1 -> ''
10401 -> 'When'
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000
When you need the help of an Auto Locksmith Kirtlington look no further than our team of experts who are always on call 24 hours a day 365 days a year.
We have a team of auto locksmiths on call in Kirtlington 24 hours a day 365 days a year to help with any auto locksmith emergency you may find yourself in, whether it be repairing an broken omega lock, reprogramming your car transponder keys, replacing a lacking vehicle key or limiting chipped car fobs, our team of auto lock
main: mem per token = 43387780 bytes
main: load time = 35493.44 ms
main: sample time = 281.98 ms
main: predict time = 34094.89 ms / 264.30 ms per token
main: total time = 74651.21 ms
It definitely runs. It uses almost 20GB of RAM so I had to exit my browser and VS Code to keep the memory usage down.
But it produces completely garbled output. Either there's a bug in the program, or the tokens are different to 13B model, or I performed the conversion wrong, or the 4bit quantization breaks it.
I've finally managed to download the model and it seems to be working well for me. There's been some updates to the quantization code, so maybe if you do a 'git pull && make' and rerun the quantization script it will work for you. I'm getting about 350ms per token with the 30B model.
I ran the 7b model on a prompt about how to get somewhere in the dating scene. Check out the ending:
> Don’t get distracted by guys who are already out of your league; focus on the ones that have some hope for getting into it with them...even though they might not be there yet! Dont Forget To Sign Up and Watch our
(No, I didn't cut off the end. That's just how it stopped.) Anyway, makes it seem like, whatever their training corpus was, it deffo included scraping a bunch of social media influencers.
just wanted to say thanks for your effort in aggregating and communicating a fast moving and extremely interesting area, have been watching your output like a hawk recently
8640 words/day is couple times faster than some of the best novelists in human history, if even quarter of that will be usable it could work as an autonomous paperback author.
I'm following the instructions on the post from the original owner of the repository involved here. It's at https://til.simonwillison.net/llms/llama-7b-m2 and it is much simpler. (no affiliation with author)
I'm currently running the 65B model just fine. It is a rather surreal experience, a ghost in my shell indeed.
As an aside, I'm seeing an interesting behaviour on the `-t` threads flag. I originally expected that this was similar to `make -j` flag where it controls the number of parallel threads but the total computation done would be the same. What I'm seeing is that this seems to change the fidelity of the output. At `-t 8` it has the fastest output presumably since that is the number of performance cores my M2 Max has. But up to `-t 12` the output fidelity increases, even though the output drastically slows down. I have 8 perf and 4 efficiency cores, so that makes superficial sense. At `-t 13` onwards, the performance exponentially decreases to the point that I effectively no longer have output.
That's interesting that the fidelity seems to change. I just realized I had been running with `-t 8` even though I only have a M2 MacBook Air (4 perf, 4 efficiency cores) and running with `-t 4` speeds up 13B significantly. It's now doing ~160ms per token versus ~300ms per token with the 8 cores settings. It's hard to quantify exactly if it's changing the output quality much, but I might do a subjective test with 5 or 10 runs on the same prompt and see how often it's factual versus "nonsense".
I also noticed hitting CTRL+S to pause the TTY output seemed to cause a reliable prompt to suddenly start printing garbage tokens after CTRL+Q to resume a few seconds later. It may have been a coincidence, but instant thought was very much "synchronization bug"
I'm sure there are potential uses but training your own LLM would probably be more meaningfully useful versus running someone else's trained model, which is what this is.
If you've got avx2 and enough RAM you can run these models on any boring consumer laptop. Performance on a contemporary 16 vCPU Ryzen is on par with the numbers I'm seeing out of the M1s that all these bloggers are happy to note they're using :)
Because boring consumer laptops are of course known for their copious amounts of expandable RAM and not for having one socket fitted with the minimum amount possible.
As long as it's 4 GB you should be good to run the smaller model. 8 GB would be preferred, if you're fancy enough to have more you can do the larger models on unquantized models for more quality.
I've tried the 7B model with 32GiB of RAM (and plenty of swap) but my 10th gen Intel CPU just doesn't seem up to the task. For some reason, the CPU based libraries only seem to use a single thread and it takes forever to get any output.
with llama.cpp, you might need to pass in -t to set the thread count. What kind of OS / host environment are you using? I noticed very little speedup with using t=16 and t=32, it's possible the code simply hasn't been tested with such high core counts, or it's bumping into some structural limitation of how llama.cpp is implemented
I'm wondering if there might be a problem with your compiler setup? Do set -t to use more threads. I don't see improvement past the number of real (not virtual) cores. But I'm seeing about 100ms/token for 7B with -t 8.
Georgi Gerganov is something of a wonder. A few more .cpp drops from him and we have fully local AI for the masses. Absolutely amazing. Thank you Georgi!
Based on my limited runs, I think 4 bit quantization is detrimental to the output quality:
> /main -m ~/Downloads/llama/7B/ggml-model-q4_0.bin -t 6 -n 256 -p 'The first man on the moon was '
The first man on the moon was 38 years old.
And that's when we were ready to land a ship of our own crew in outer space again, as opposed to just sending out probes or things like Skylab which is only designed for one trip and then they have to be de-orbited into some random spot on earth somewhere (not even hitting the water)
Warren Buffet has donated over $20 billion since 1978. His net worth today stands at more than a half trillion dollars ($53 Billiard). He's currently living in Omaha, NE as opposed to his earlier home of New York City/Berkshire Mountains area and he still lives like nothing changed except for being able to spend $20 billion.
Social Security is now paying out more than it collects because people are dying... That means that we're living longer past when Social security was supposed to run dry (65) [end of text]
The performance loss is because this is RTN quantization I believe. If you use the "4chan version" that is 4bit GPTQ, the performance loss from quantization should be very small.
The OP article eventually gets around to demonstrating the model and it is similarly bad, zooming from George Washington to the purported physical fitness of Donald Trump?
The post has a workaround for the PyTorch issue with Python 3.11. If you follow the repo instructions it will give you some rather strange looking errors.
You might want to tune the sampler. For example, set it to a lower temperature. Also, the 4bit RTN quantisation seems to be messing up the model. Perhaps, the GPTQ quantisation will be much better.
./main -m ./models/7B/ggml-model-q4_0.bin \
--top_p 2 --top_k 40 \
--repeat_penalty 1.176 \
--temp 0.7
-p 'To seduce a woman, you first have to'
output:
import numpy as np
from scipy.linalg import norm, LinAlgError
np.random.seed(10)
x = -2\*norm(LinAlgError())[0] # error message is too long for command line use
print x [end of text]
The writeup includes example text where the algorithm is fed a sentence starting about George Washington and within half a sentence or so goes unhinged and starts praising Trump...
Also, a reminder to folks that this model is not conversationally trained and won't behave like ChatGPT; it cannot take directions.
Well, gpt3.5-turbo fails the turing test due to it's censorship and legal liability butt covering openai bolted on, so almost anything else is better. Now, compared to openai's gpt3 davinci (text-davinci-003) ... llama is much worse.
I dont know why youre getting downvoted! There is nothing out there at the moment that is as authentic as text-davinci-003. I really hope its not taken away
GPT-3 is a very different model from GPT-3.5. My understanding is that they were comparing LLaMA's performance to benchmark scores published for the original GPT-3, which came out in 2020 and had not yet had instruction tuning, so was significantly harder to use.
GPT 3.5 is the instruction tuned modern GPT models, such as Da Vinci 002 and 003.
3.5 Turbo is the ChatGPT model: it's cheaper (1/10th the price), faster and has a bunch of extra RLHF training to make it work well as a safe and usable chatbot.
Or... it could be that Chinchilla study has deficiencies in measuring capabilities of models maybe? Either that or your explanation. Frankly I don't think 13B is better than GPT-3 (text-davinci-001 which I think is not RLHF - but maybe better than base)
My Ubuntu desktop has 64 gigs RAM, with a 12G RTX 3060 card. I have 4 bit 13B parameter LLaMA running on it currently, following these instructions - https://github.com/oobabooga/text-generation-webui/wiki/LLaM... . They don't have 30B or 65B ready yet.
Might try other methods to do 30B, or switch to my M1 Macbook if that's useful (as it said here). Don't have an immediate need for it, just futzing with it currently.
I should note that web link is to software for a gradio text generation web UI, reminiscent of Automatic1111.
With 4-bit quantization it will take 15 GB, so it fits easily. On 96 GB you can not only run 30b model, you can even finetune it. As I understand, these model were trained on float16, so full 30b model takes 60 GB of RAM
So you’re saying I could make the full model run on a 16 core ryzen with 64GB of DDR4? I have an 8GB VRAM 3070 but based on this thread it sounds like the CPU might have better perf due to the RAM?
These are my observations from playing with this over the weekend.
1. There is no thoughput benefit to running on GPU unless you can fit all the weights in VRAM. Otherwise the moving the weights eats up any benefit you can get from the faster compute.
2. The quantized models do worse than non-quantized smaller models, so currently they aren't worth using for much use cases. My hope is that more sophisticated quantization methods (like GPTQ) will resolve this.
3. Much like using raw GPT-3, you need to put a lot of thought into your prompts. You can really tell it hasn't been 'aligned' or whatever the kids are calling it these days.
This might be naïve, but couldn’t you just mmap the weights on an apple silicon MacBook? Why do you need to load the entire set of weights into memory at once?
Each token is inferenced against the entire model. For the largest model that means 60GB of data or at least 10 seconds per token on the fastest SSDs. Very heavy SSD wear from that many read operations would quickly burn out even enterprise drives too.
Assuming a sensible, somewhat linear layout using mmap to map the weights would give you the ability to load a lot in memory, with potentially a fairly minimal page-in overhead
Excuse my laziness for not looking this up myself, but I have two 8G RAM M1 Macs. Which smaller LLM models will run with such a small amount of memory? any old GPT-2 models? HN user diimdeep has commented here that he ran the article code and model on a 8G RAM M1 Mac, so maybe I will just try it.
I have had good luck in the past with Apple's TensorFlow tools for M1 for building my own models.
Is the ... widely distributed version ... safe to use? How can I check I have the 'right' one? Someone was saying that the models could technically execute arbitrary code if they were backdoored? I'd love to play with them if I have the proper compute haha
The way to think about it is that training a neural network is a form of compression that is very, very lossy. You can retrieve information from it but it will be very inaccurate compared to the original source.
For example, I assume LLaMa was trained on English Wikipedia data (it just makes sense). So let me try to prompt the 13B parameter model (which is 25 GiB to download, and 3.8 GiB after quantization) with "According to Wikipedia, the main causes of the French revolution are". It will give me the following continuations:
Attempt 1:
1) Social injustice and inequality;2)...
i need to do a report about the french revolution for my history class. so far i have these three questions:... [end of text]
Attempt 2:
1. The Enlightenment and its new ideas in philosophy had a great impact on France especially with regards their attitudes towards religion. There was also some political unrest due mainly from disagreements over how taxes were spent by different governments at that time....
What Were Three Causes of the French Revolution? The three causes for revolution are economic, social and governmental in nature... [end of text]
Attempt 3:
1) The Estates General was called by Louis XVI in order solve financial problems. However when they met there were three groups: Nobility who wanted a lot more power and money for themselves; Clergy (Bishops etc.) who also wanted more wealth and power but on top they didn’t want to pay taxes or be governed any other way then the old fashion “divine right” 3) The third group was called the Third Estate. They were made up of middle class citizens, peasants, artisans etc… These people had no political power and wanted more equality in society
While the Wikipedia article has a section on causes that starts with:
The underlying causes of the French Revolution are usually attributed to the Ancien Régime's failure to manage social and economic inequality. Rapid population growth and the inability to adequately finance government debt resulted in economic depression, unemployment and high food prices. Combined with a regressive tax system and resistance to reform by the ruling elite, it resulted in a crisis Louis XVI proved unable to manage.
So the model is completely unable to reconstruct the data on which it was trained. It does have some vague association between the words of "French revolution", "causes", "inequality", "Louis XVI", "religion", "wealth", "power", and so on, so it can provide a vaguely-plausible continuation at least some of the time. But it's clear that a lot of information has been erased.
The training sources and weights are public info. Less than 5% of the training was from Wikipedia and of that it covers many languages. English Wikipedia article text alone is ~22 GB when losslessly compressed so it's no surprised it's not giving original articles back.
CCNet [67%], C4 [15%], GitHub [4.5%], Wikipedia [4.5%], Books [4.5%], ArXiv [2.5%], Stack Exchange[2%]. The Wikipedia and Books domains include data in the following languages: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk
Interesting that so many people seem to want the bugs in these LLMs to be rebranded as features.
Memorization and plagiarism used to be undesired problems - to be worked on to get rid of them. Amazing job of PR here to try to reframe it as a benefit.
Extremely tempted to replace my Mac Mini M1 (8GB RAM). If I do, what's my best bet to future proof for things like these? Would a Mac Mini M2 with 24GB RAM do or should I beef it up to a M1 Studio?
I've been seeing a lot of people talking about running language models locally, but I'm not quite sure what the benefit is. Other than for novelty or learning purposes, is there any reason why someone would prefer to use an inferior language model on their own machine instead of leveraging the power and efficiency of cloud-based models?
This is just a first generation right now, but the tuning and efficiency hacks will be found that gets a very usable quality out of smaller models.
The benefit is have a super-genius oracle in your pocket on-demand, without Microsoft or Amazon or anyone else eavesdropping on your use. Who wouldn't see the value in that?
In the coming age, this will be one of the few things that could possibly keep the nightmare dystopia at bay in my opinion.
No, but whoever trains the weights can. Having said that, if LLaMA has been censored, then Meta have done a poor job of it: it is trivial to get it to say politically incorrect things.
Just copy-and-paste headlines from your favorite American news outlet. It works great on GPT-J-Neo, so good that I had to make a bot to process different Opinion headlines from Fox and CNN's RSS feeds. Crank up the temperature if you get dissatisfied and you'll really be able to smell those neurons cooking.
The first president of the USA was 57 years old when he assumed office (George Washington). Nowadays, the US electorate expects the new president to be more young at heart. President Donald Trump was 70 years old when he was inaugurated. In contrast to his predecessors, he is physically fit, healthy and active. And his fitness has been a prominent theme of his presidency. During the presidential campaign, he famously said he would be the “most active president ever” — a statement Trump has not yet achieved, but one that fits his approach to the office. His tweets demonstrate his physical activity.
If you care about that write some unit tests. I'm sure you'll be very proud of yourself for stopping censorship every time you see isModelRacist() in green.
It seems like running it on an A100 in a datacenter would be better, though? Unless you think cloud providers are logging the outputs of programs that their customers run themselves.
Of course they are...
The main reason "the cloud" exists is to log everything about it's users in every capacity possible. That's one reason they "provide it so cheap" (although now they have increased the cost so much it'sfar more expensive than self hosting. So you lose majorly in every way by not self hosting.
openai is expensive (ie, ~$25/mo for a gpt3 davinci IRC bot in a realtively small channel that only gets used heavily a few hours a day) and censored. And I'm not just talking won't respond to controversial things. Even innocuous topics are blocked. If you try to use gpt3.5-turbo for 10x less cost it is so bad with censoring itself that it can't even pass a turing test. Plus there's the whole data collection and privacy issue.
I just wish these weren't all articles about how to run it on proprietary mac setups. I'm still waiting for the guides on how to run it on a real PC.
In the 1970s, people moved to Ashrams in India to lose their ego. In the 2020s, people are anxious for AI to conserve it beyond death. Quite a generational pendulum swing… :)
It's free. there's extremely cheap, and there's free. no matter how extremely cheap something is, "free" is on a completely different level and gives us a new assumption that enables a lot of things that are not possible when each request is paid (no matter how cheap it is)
ChatGPT doesn't give you the full vocabulary probability distribution, while locally running does. You need the full probability distribution to do things like constrained text generation, e.g. like this: https://paperswithcode.com/paper/most-language-models-can-be...
The cloud is actually inferior - It costs more over the long run, you can't see the probabilities or internals of the models, you can't change anything, and you have to give them all of your personal data that they log every second that your logged in (and probably when you're not).
Running standard inference on GPUs for these models typically runs ~800$/month if you're actually using them often, which is much more than just running it on your own computer. If you need it away from the location just use a VPN. I don't understand the unnecessary use of 'the cloud' - especially in a supposedly "tech" forum - other than as a great triumph of marketing.
Not being reliant on a single entity is nice. I will accept not being on the bleeding edge of proprietary models and slower runs for the privacy and reliability of local execution.
For me I think it's exciting in a couple of different ways. Most importantly, it's just way more hackable than these giant frameworks. It's pretty cool to be able to read all the code, down to first principles, to understand how computationally simple this stuff really is.
I am currently paying thousands per month for translations. (Billions of words per month) if we only could have a way to run a chatgpt quality like system localy, we could save a lot of money. I am really impressed by the translation quality of these late ai models.
I've been commuting for about 45 minutes on the subway and I sometimes try to get work done in there. It'd be useful to be able to get answers while offline.
I mean, after SVB caved-in I'm sure a lot of VC-backed App Store devs were looking for something "magical" to lift their moods. Local LLMs are nothing new (even on ARM) but make an Apple Silicon-specific writeup and half the site will drop what they're doing to discuss it.
Additionally, llama.cpp works fine with 10 y.o hardware that supports AVX2.
I'm running llama.cpp right now on an ancient Intel i5 2013 MacBook with only 2 cores and 8 GB RAM - 7B 4bit model loads in 8 seconds to 4.2 GB RAM and gives 600 ms per token.
btw: anyone knows how to disable swap per process in macOS ? even though there is enough free RAM, sometimes macOS decides to use swap instead.