Hacker News new | past | comments | ask | show | jobs | submit login
Using LLaMA with M1 Mac and Python 3.11 (dev.l1x.be)
617 points by datadeft on March 12, 2023 | hide | past | favorite | 165 comments



Remove "and Python 3.11" from title. Python used only for converting model to llama.cpp project format, 3.10 or whatever is fine.

Additionally, llama.cpp works fine with 10 y.o hardware that supports AVX2.

I'm running llama.cpp right now on an ancient Intel i5 2013 MacBook with only 2 cores and 8 GB RAM - 7B 4bit model loads in 8 seconds to 4.2 GB RAM and gives 600 ms per token.

btw: anyone knows how to disable swap per process in macOS ? even though there is enough free RAM, sometimes macOS decides to use swap instead.


> Remove "and Python 3.11" from title. Python used only for converting model to llama.cpp project format, 3.10 or whatever is fine.

As @rnosov notes elsewhere in the thread, this post has a workaround for the PyTorch issue with Python 3.11, which is why the "and Python 3.11" qualification is there.


Do you know if there's there a good reason to favor 3.11 over 3.10 for this use case?


I'm a Python neophyte, but I've read that Python 3.11 is 10-60% faster than 3.10, so that may be a consideration.


In this particular case that doesn't matter, because the only time you run Python is for a one-off conversion against the model files.

That takes at most a minute to run, but once converted you'll never need to run it again. Actual llama.cpp model inference uses compiled C++ code with no Python involved at all.


The real question is. Which python3 version does current macOS ship with?

Well, on my macOS Ventura 13.2.1 install, /usr/bin/python3 is Python 3.9.6, which may be too old?

But also, my custom installed python3 via homebrew is not 3.11 either. My /opt/homebrew/bin/python3 is Python 3.10.9

MacBook Pro M1


brew install python@3.11



I am able to get 65B on a MacBook Pro 14.2 with 64gb of ram. https://gist.github.com/zitterbewegung/4787e42617aa0be6019c3...


Can you provide a link to what guide or steps you followed to get this up and running? I have a physical Linux machine with 300+ GB of RAM, would love to try out llama on it but I'm not sure where to get started for how to get it working with such a configuration.

Edit: Thank you, @diimdeep!


Sure. You can get models with magnet link from here https://github.com/shawwn/llama-dl/

To get running, just follow these steps https://github.com/ggerganov/llama.cpp/#usage


Is it legal to post that here?


Better instructions (less verbose and include 30B model): https://til.simonwillison.net/llms/llama-7b-m2

I’m running 13B on MacBook Air M2 quite easily. Will try 30B but probably won’t be able to keep my browser open :/

shameless plug: https://mobile.twitter.com/tomprimozic/status/16348774773100...


I have 65B running alright. But you'll have to lower your expectations if you are used to ChatGPT.

https://twitter.com/LalwaniVikas/status/1634644323890282498


Give us an update on the 30B model! I have 13B running easily on my M2 Air (24GB ram), just waiting until I'm on an unmetered connection to download the 30B model and give it a go.


I am running the 30B model on my m1 Mac Studio with 32gb of ram.

    (venv) bherman@Rattata ~/llama.cpp$ ./main -m ./models/30B/ggml-model-q4_0.bin -
  t 8 -n 128
  main: seed = 1678666507
  llama_model_load: loading model from './models/30B/ggml-model-q4_0.bin' - please wait ...
  llama_model_load: n_vocab = 32000
  llama_model_load: n_ctx   = 512
  llama_model_load: n_embd  = 6656
  llama_model_load: n_mult  = 256
  llama_model_load: n_head  = 52
  llama_model_load: n_layer = 60
  llama_model_load: n_rot   = 128
  llama_model_load: f16     = 2
  llama_model_load: n_ff    = 17920
  llama_model_load: n_parts = 4
  llama_model_load: ggml ctx size = 20951.50 MB
  llama_model_load: memory_size =  1560.00 MB, n_mem = 30720
  llama_model_load: loading model part 1/4 from './models/30B/ggml-model-q4_0.bin'
  llama_model_load: ................................................................... done
  llama_model_load: model size =  4850.14 MB / num tensors = 543
  llama_model_load: loading model part 2/4 from './models/30B/ggml-model-q4_0.bin.1'
  llama_model_load: ................................................................... done
  llama_model_load: model size =  4850.14 MB / num tensors = 543
  llama_model_load: loading model part 3/4 from './models/30B/ggml-model-q4_0.bin.2'
  llama_model_load: ................................................................... done
  llama_model_load: model size =  4850.14 MB / num tensors = 543
  llama_model_load: loading model part 4/4 from './models/30B/ggml-model-q4_0.bin.3'
  llama_model_load: ................................................................... done
  llama_model_load: model size =  4850.14 MB / num tensors = 543

  main: prompt: 'When'
  main: number of tokens in prompt = 2
       1 -> ''
   10401 -> 'When'
  
  sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000
  
  
  When you need the help of an Auto Locksmith Kirtlington look no further than our team of experts who are always on call 24 hours a day 365 days a year.
  We have a team of auto locksmiths on call in Kirtlington 24 hours a day 365 days a year to help with any auto locksmith emergency you may find yourself in, whether it be repairing an broken omega lock, reprogramming your car transponder keys, replacing a lacking vehicle key or limiting chipped car fobs, our team of auto lock
  
  main: mem per token = 43387780 bytes
  main:     load time = 35493.44 ms
  main:   sample time =   281.98 ms
  main:  predict time = 34094.89 ms / 264.30 ms per token
  main:    total time = 74651.21 ms


hm... well...

It definitely runs. It uses almost 20GB of RAM so I had to exit my browser and VS Code to keep the memory usage down.

But it produces completely garbled output. Either there's a bug in the program, or the tokens are different to 13B model, or I performed the conversion wrong, or the 4bit quantization breaks it.


I’m also getting garbage out of 30B and 65B.

30B just says “dotnetdotnetdotnet…”


I've finally managed to download the model and it seems to be working well for me. There's been some updates to the quantization code, so maybe if you do a 'git pull && make' and rerun the quantization script it will work for you. I'm getting about 350ms per token with the 30B model.


Thanks for reminding me! It works now. The difference is striking!


Neat - this uses the following to get a version of Torch that works with Python 3.11:

    pip3 install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu
That's the reason I stuck with Python 3.10 in my write-up for doing this: https://til.simonwillison.net/llms/llama-7b-m2


You don’t actually need torchvision. I just used

  mamba create -n llama python==3.10 pytorch sentencepiece


We have a mamba as well?


It’s a faster conda. Strongly recommended


Artem Andreenko on Twitter reports getting the 7B model running on a 4GB RaspberryPi! One token every ten seconds, but still, wow. https://twitter.com/miolini/status/1634982361757790209


I ran the 7b model on a prompt about how to get somewhere in the dating scene. Check out the ending:

> Don’t get distracted by guys who are already out of your league; focus on the ones that have some hope for getting into it with them...even though they might not be there yet! Dont Forget To Sign Up and Watch our

(No, I didn't cut off the end. That's just how it stopped.) Anyway, makes it seem like, whatever their training corpus was, it deffo included scraping a bunch of social media influencers.


just wanted to say thanks for your effort in aggregating and communicating a fast moving and extremely interesting area, have been watching your output like a hawk recently


8640 words/day is couple times faster than some of the best novelists in human history, if even quarter of that will be usable it could work as an autonomous paperback author.


I'm following the instructions on the post from the original owner of the repository involved here. It's at https://til.simonwillison.net/llms/llama-7b-m2 and it is much simpler. (no affiliation with author)

I'm currently running the 65B model just fine. It is a rather surreal experience, a ghost in my shell indeed.

As an aside, I'm seeing an interesting behaviour on the `-t` threads flag. I originally expected that this was similar to `make -j` flag where it controls the number of parallel threads but the total computation done would be the same. What I'm seeing is that this seems to change the fidelity of the output. At `-t 8` it has the fastest output presumably since that is the number of performance cores my M2 Max has. But up to `-t 12` the output fidelity increases, even though the output drastically slows down. I have 8 perf and 4 efficiency cores, so that makes superficial sense. At `-t 13` onwards, the performance exponentially decreases to the point that I effectively no longer have output.


That's interesting that the fidelity seems to change. I just realized I had been running with `-t 8` even though I only have a M2 MacBook Air (4 perf, 4 efficiency cores) and running with `-t 4` speeds up 13B significantly. It's now doing ~160ms per token versus ~300ms per token with the 8 cores settings. It's hard to quantify exactly if it's changing the output quality much, but I might do a subjective test with 5 or 10 runs on the same prompt and see how often it's factual versus "nonsense".


I also noticed hitting CTRL+S to pause the TTY output seemed to cause a reliable prompt to suddenly start printing garbage tokens after CTRL+Q to resume a few seconds later. It may have been a coincidence, but instant thought was very much "synchronization bug"


Don't you hate it when someone interrupts your train of thought.


What do you use it for, out of curiosity? Can it do shell autocompletes (this is what “ghost in the shell” made me think of, haha).


Nothing. It's technology for the love of it.

I'm sure there are potential uses but training your own LLM would probably be more meaningfully useful versus running someone else's trained model, which is what this is.


If you've got avx2 and enough RAM you can run these models on any boring consumer laptop. Performance on a contemporary 16 vCPU Ryzen is on par with the numbers I'm seeing out of the M1s that all these bloggers are happy to note they're using :)


> any boring consumer laptop

> enough RAM

Because boring consumer laptops are of course known for their copious amounts of expandable RAM and not for having one socket fitted with the minimum amount possible.


As long as it's 4 GB you should be good to run the smaller model. 8 GB would be preferred, if you're fancy enough to have more you can do the larger models on unquantized models for more quality.


I've tried the 7B model with 32GiB of RAM (and plenty of swap) but my 10th gen Intel CPU just doesn't seem up to the task. For some reason, the CPU based libraries only seem to use a single thread and it takes forever to get any output.


with llama.cpp, you might need to pass in -t to set the thread count. What kind of OS / host environment are you using? I noticed very little speedup with using t=16 and t=32, it's possible the code simply hasn't been tested with such high core counts, or it's bumping into some structural limitation of how llama.cpp is implemented


I'm wondering if there might be a problem with your compiler setup? Do set -t to use more threads. I don't see improvement past the number of real (not virtual) cores. But I'm seeing about 100ms/token for 7B with -t 8.


Georgi Gerganov is something of a wonder. A few more .cpp drops from him and we have fully local AI for the masses. Absolutely amazing. Thank you Georgi!


I want to know what his compute setup looks like.


Based on my limited runs, I think 4 bit quantization is detrimental to the output quality:

    > /main -m ~/Downloads/llama/7B/ggml-model-q4_0.bin -t 6 -n 256 -p 'The first man on the moon was ' 

    The first man on the moon was 38 years old.

    And that's when we were ready to land a ship of our own crew in outer space again, as opposed to just sending out probes or things like Skylab which is only designed for one trip and then they have to be de-orbited into some random spot on earth somewhere (not even hitting the water)

    Warren Buffet has donated over $20 billion since 1978. His net worth today stands at more than a half trillion dollars ($53 Billiard). He's currently living in Omaha, NE as opposed to his earlier home of New York City/Berkshire Mountains area and he still lives like nothing changed except for being able to spend $20 billion.

    Social Security is now paying out more than it collects because people are dying... That means that we're living longer past when Social security was supposed to run dry (65) [end of text]


Yes, I have found 65B quantized to be more nonsensical than 13B unquantized.


The performance loss is because this is RTN quantization I believe. If you use the "4chan version" that is 4bit GPTQ, the performance loss from quantization should be very small.


What's the 4chan version?


See https://github.com/ggerganov/llama.cpp/issues/62 (the related repo was originally posted on 4chan, is all, but the code is on GitHub)



The OP article eventually gets around to demonstrating the model and it is similarly bad, zooming from George Washington to the purported physical fitness of Donald Trump?


How is this post any different than the instructions on the actual repo? https://github.com/ggerganov/llama.cpp


The post has a workaround for the PyTorch issue with Python 3.11. If you follow the repo instructions it will give you some rather strange looking errors.


How does LLaMA compare to GPT-3.5 has anyone done side by side comparisons?


In short, my experience is that it's much worse to the point that I won't use LLaMA.

The main change needed seems to be InstructGPT style tuning (https://openai.com/research/instruction-following)


Yeah, it's MUCH harder to use because of the lack of tuning.

You have to lean on much older prompt engineering tricks - there are a few initial tips in the LLaMA FAQ here: https://github.com/facebookresearch/llama/blob/main/FAQ.md#2...


Are you getting useful content out of the 7B model? It goes off the rails way too often for me to find it useful.


You might want to tune the sampler. For example, set it to a lower temperature. Also, the 4bit RTN quantisation seems to be messing up the model. Perhaps, the GPTQ quantisation will be much better.


Use `--top_p 2 --top_k 40 --repeat_penalty 1.176 --temp 0.7` with llama.cpp


Not bad with these settings:

    ./main -m ./models/7B/ggml-model-q4_0.bin \
    --top_p 2 --top_k 40 \
    --repeat_penalty 1.176 \
    --temp 0.7 
    -p 'async fn download_url(url: &str)'


    async fn download_url(url: &str) -> io::Result<String> {
      let url = URL(string_value=url);
      if let Some(err) = url.verify() {} // nope, just skip the downloading part
      else match err == None {  // works now
        true => Ok(String::from(match url.open("get")?{
            |res| res.ok().expect_str(&url)?,
            |err: io::Error| Err(io::ErrorKind(uint16_t::MAX as u8))),
            false => Err(io::Error


lol,

    ./main -m ./models/7B/ggml-model-q4_0.bin \
    --top_p 2 --top_k 40 \
    --repeat_penalty 1.176 \
    --temp 0.7
    -p 'To seduce a woman, you first have to'
output:

    import numpy as np
    from scipy.linalg import norm, LinAlgError
    np.random.seed(10)
    x = -2\*norm(LinAlgError())[0]  # error message is too long for command line use
    print x [end of text]


What fork are you using?

repeat_penalty is not an option.



It's a new feature :) Pull latest from master.


Have you tried using the original repo?


GPT-3.5 probably has substantially lower cross-entropy loss before instruction fine-tuning as well.

Meta’s LLaMA v. GPT-3 comparisons are to OpenAI’s 2020 release of GPT-3, but there’s been significant progress since then.


The sampler needs some tuning, but the 65B model has impressive output

https://twitter.com/theshawwn/status/1632569215348531201


The writeup includes example text where the algorithm is fed a sentence starting about George Washington and within half a sentence or so goes unhinged and starts praising Trump...

Also, a reminder to folks that this model is not conversationally trained and won't behave like ChatGPT; it cannot take directions.


The original LLaMA paper has some benchmarks. https://arxiv.org/pdf/2302.13971.pdf


Well, gpt3.5-turbo fails the turing test due to it's censorship and legal liability butt covering openai bolted on, so almost anything else is better. Now, compared to openai's gpt3 davinci (text-davinci-003) ... llama is much worse.


I dont know why youre getting downvoted! There is nothing out there at the moment that is as authentic as text-davinci-003. I really hope its not taken away


I thought LLaMA outscored GPT-3


GPT-3 is a very different model from GPT-3.5. My understanding is that they were comparing LLaMA's performance to benchmark scores published for the original GPT-3, which came out in 2020 and had not yet had instruction tuning, so was significantly harder to use.


I know, that is why I said GPT-3 (Davinci) not GPT-3.5|ChatGPT.


Da Vinci 002 and 003 are actually classified as GPT 3.5 by OpenAI: https://platform.openai.com/docs/models/gpt-3-5

ChatGPT is GPT-3.5 Turbo.


Would you mind summarizing the difference between GPT 3.5 and GPT-3.5 Turbo? I'm not clear about that.


GPT 3.5 is the instruction tuned modern GPT models, such as Da Vinci 002 and 003.

3.5 Turbo is the ChatGPT model: it's cheaper (1/10th the price), faster and has a bunch of extra RLHF training to make it work well as a safe and usable chatbot.

https://openai.com/blog/introducing-chatgpt-and-whisper-apis introduced the turbo model.


Hard to measure these days. The training sets are so large they might contain leaks of test sets. Take these numbers with a grain of salt.


Or... it could be that Chinchilla study has deficiencies in measuring capabilities of models maybe? Either that or your explanation. Frankly I don't think 13B is better than GPT-3 (text-davinci-001 which I think is not RLHF - but maybe better than base)


text-davinci-001 is currently classed as "GPT 3.5" by OpenAI, and it did indeed have RLHF in the form of instruction tuning: https://openai.com/research/instruction-following

MY MISTAKE: 002 and 003 are 3.5, but 001 looks to have pre-dated the InstructGPT work.


My Ubuntu desktop has 64 gigs RAM, with a 12G RTX 3060 card. I have 4 bit 13B parameter LLaMA running on it currently, following these instructions - https://github.com/oobabooga/text-generation-webui/wiki/LLaM... . They don't have 30B or 65B ready yet.

Might try other methods to do 30B, or switch to my M1 Macbook if that's useful (as it said here). Don't have an immediate need for it, just futzing with it currently.

I should note that web link is to software for a gradio text generation web UI, reminiscent of Automatic1111.


By 30b and 65b, does it mean model with 30 or 65 million parameters?


Billion


If anyone is interested in running it on windows and a GPU (30b 4bit fits in a 3090), here's a guide: https://gist.github.com/lxe/82eb87db25fdb75b92fa18a6d494ee3c


Nice, why not add it to the wiki?


What wiki?


What are some prompts that seem to be working on non-finetuned models? (beyond what is listed in example.py)


We need a Fabrice Bellard-like genius to make a tinyLLM that makes the decent models work on 32gb ram



I cant wait to get my 96gb m2 i ordered last week.

Maybe it could even run the 30b model?


Should be able to run 65B, people got it running on 64GB.

https://twitter.com/lawrencecchen/status/1634507648824676353


With 4-bit quantization it will take 15 GB, so it fits easily. On 96 GB you can not only run 30b model, you can even finetune it. As I understand, these model were trained on float16, so full 30b model takes 60 GB of RAM


So you’re saying I could make the full model run on a 16 core ryzen with 64GB of DDR4? I have an 8GB VRAM 3070 but based on this thread it sounds like the CPU might have better perf due to the RAM?


These are my observations from playing with this over the weekend.

1. There is no thoughput benefit to running on GPU unless you can fit all the weights in VRAM. Otherwise the moving the weights eats up any benefit you can get from the faster compute.

2. The quantized models do worse than non-quantized smaller models, so currently they aren't worth using for much use cases. My hope is that more sophisticated quantization methods (like GPTQ) will resolve this.

3. Much like using raw GPT-3, you need to put a lot of thought into your prompts. You can really tell it hasn't been 'aligned' or whatever the kids are calling it these days.


i have the 65B model running fine on my 48GB Ryzen 5.


This might be naïve, but couldn’t you just mmap the weights on an apple silicon MacBook? Why do you need to load the entire set of weights into memory at once?


Each token is inferenced against the entire model. For the largest model that means 60GB of data or at least 10 seconds per token on the fastest SSDs. Very heavy SSD wear from that many read operations would quickly burn out even enterprise drives too.


SSDs don’t wear from reading, only from writing.

Assuming a sensible, somewhat linear layout using mmap to map the weights would give you the ability to load a lot in memory, with potentially a fairly minimal page-in overhead


Excuse my laziness for not looking this up myself, but I have two 8G RAM M1 Macs. Which smaller LLM models will run with such a small amount of memory? any old GPT-2 models? HN user diimdeep has commented here that he ran the article code and model on a 8G RAM M1 Mac, so maybe I will just try it.

I have had good luck in the past with Apple's TensorFlow tools for M1 for building my own models.


The LLaMA 7B model will run fine on an 8GB M1.


Rather remarkably, Llama 7B and 13B run fine (about as fast as ChatGPT/bing) if you follow the instructions in the posted article.


Anyone got the 65B model to work with llama.cpp? 7B worked fine for me, but 30B and 65B just output garbage.

(On Linux with a 5800X and 64GB of RAM)


Rerun make, and regenerate the quantized files, because some new commits broke backwards compatibility, if you recently pulled.


Thanks for the tip. I tried 404fac0, but got:

"The search for extraterrestrial life will most likely conclude adv provinß wojewłożGener Wikipédia Świirc Patrickvidcido protectsobDra"


Is the ... widely distributed version ... safe to use? How can I check I have the 'right' one? Someone was saying that the models could technically execute arbitrary code if they were backdoored? I'd love to play with them if I have the proper compute haha


It’s downloaded over a Torrent seems fine


Why not apply for download at Meta?


oh i thought it was for like EDU only not just people who want to mess around privately with friends


How much disk space does it use?


240GB for the initial model download, but once you convert the models they are 4GB for the 7B one and 8GB for the 13B one (and more for the others).


This may be a dumb question, but how is this possible?

How can it have all of this information packed into 4GB? I can't even imagine it being only 240GB.

These models have an unthinkable amount of information living in them.


The way to think about it is that training a neural network is a form of compression that is very, very lossy. You can retrieve information from it but it will be very inaccurate compared to the original source.

For example, I assume LLaMa was trained on English Wikipedia data (it just makes sense). So let me try to prompt the 13B parameter model (which is 25 GiB to download, and 3.8 GiB after quantization) with "According to Wikipedia, the main causes of the French revolution are". It will give me the following continuations:

Attempt 1:

    1) Social injustice and inequality;2)...
    i need to do a report about the french revolution for my history class. so far i have these three questions:... [end of text]
Attempt 2:

    1. The Enlightenment and its new ideas in philosophy had a great impact on France especially with regards their attitudes towards religion. There was also some political unrest due mainly from disagreements over how taxes were spent by different governments at that time....
   What Were Three Causes of the French Revolution? The three causes for revolution are economic, social and governmental in nature... [end of text]
Attempt 3:

    1) The Estates General was called by Louis XVI in order solve financial problems. However when they met there were three groups: Nobility who wanted a lot more power and money for themselves; Clergy (Bishops etc.) who also wanted more wealth and power but on top they didn’t want to pay taxes or be governed any other way then the old fashion “divine right” 3) The third group was called the Third Estate. They were made up of middle class citizens, peasants, artisans etc… These people had no political power and wanted more equality in society
While the Wikipedia article has a section on causes that starts with:

    The underlying causes of the French Revolution are usually attributed to the Ancien Régime's failure to manage social and economic inequality. Rapid population growth and the inability to adequately finance government debt resulted in economic depression, unemployment and high food prices. Combined with a regressive tax system and resistance to reform by the ruling elite, it resulted in a crisis Louis XVI proved unable to manage.
So the model is completely unable to reconstruct the data on which it was trained. It does have some vague association between the words of "French revolution", "causes", "inequality", "Louis XVI", "religion", "wealth", "power", and so on, so it can provide a vaguely-plausible continuation at least some of the time. But it's clear that a lot of information has been erased.


The training sources and weights are public info. Less than 5% of the training was from Wikipedia and of that it covers many languages. English Wikipedia article text alone is ~22 GB when losslessly compressed so it's no surprised it's not giving original articles back.

  CCNet [67%], C4 [15%], GitHub [4.5%], Wikipedia [4.5%], Books [4.5%], ArXiv [2.5%], Stack Exchange[2%]. The Wikipedia and Books domains include data in the following languages: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk


Interesting that so many people seem to want the bugs in these LLMs to be rebranded as features. Memorization and plagiarism used to be undesired problems - to be worked on to get rid of them. Amazing job of PR here to try to reframe it as a benefit.


Extremely tempted to replace my Mac Mini M1 (8GB RAM). If I do, what's my best bet to future proof for things like these? Would a Mac Mini M2 with 24GB RAM do or should I beef it up to a M1 Studio?


RAM is king as far as future proofing Apple Silicon.

Even a 128GB RAM M1 Ultra can’t run 65B unquantized.


Best future proof might be to wait two months and get an M2 Mac Pro.


You could pay 300€ for 128GB on a PC. And add one or two GPUs later.


What's with the weird "2023/12/08" date?


Maybe it's a hot tip from the future?

Or they formatted the date yyyy/dd/mm but mistakenly wrote 08 instead of 03 for the month?


Tip from the future works for me due to the news about new DeLorean[1]

[1] https://news.ycombinator.com/item?id=35116319


Sorry to do this but 2+0+1+2+0+8 = 23


Yes, title should include (future).


It was a typo. I fixed the url. Thanks for pointing out.


Not sure why it mentions Mac and Python

I just ran it on a plain old x86 servers with 64 cores and loads of RAM.

Works just fine. Apple H/W and Python version are completely irrelevant.


Because I could not test it on anything else.


AFAIK torch doesn’t work on 3.11 yet. It was not trivial to install on current Fedora. Might have changed.


Yes it does. The blog post has the proof. You can use the nightly builds.


I've been seeing a lot of people talking about running language models locally, but I'm not quite sure what the benefit is. Other than for novelty or learning purposes, is there any reason why someone would prefer to use an inferior language model on their own machine instead of leveraging the power and efficiency of cloud-based models?


This is just a first generation right now, but the tuning and efficiency hacks will be found that gets a very usable quality out of smaller models.

The benefit is have a super-genius oracle in your pocket on-demand, without Microsoft or Amazon or anyone else eavesdropping on your use. Who wouldn't see the value in that?

In the coming age, this will be one of the few things that could possibly keep the nightmare dystopia at bay in my opinion.


Yeah, locally your data doesn't leak out. So if you're using the language model on any sensitive data you're probably going to want local.


Also so the maintainer can't stealth-censor the model.


No, but whoever trains the weights can. Having said that, if LLaMA has been censored, then Meta have done a poor job of it: it is trivial to get it to say politically incorrect things.


Models can be extended so if someone wants - they can add all the censored stuff back. Sooner or later someone will make civitai for LLMs.


Can I prompt you to share some examples? ;)


Just copy-and-paste headlines from your favorite American news outlet. It works great on GPT-J-Neo, so good that I had to make a bot to process different Opinion headlines from Fox and CNN's RSS feeds. Crank up the temperature if you get dissatisfied and you'll really be able to smell those neurons cooking.


No, because I'd get banned.

I gave it a prompt containing the word of n and it actually ignored it but started talking about the Jews in terms that would make 4channers blush.


See example in TFA.


  The first president of the USA was 57 years old when he assumed office (George Washington). Nowadays, the US electorate expects the new president to be more young at heart. President Donald Trump was 70 years old when he was inaugurated. In contrast to his predecessors, he is physically fit, healthy and active. And his fitness has been a prominent theme of his presidency. During the presidential campaign, he famously said he would be the “most active president ever” — a statement Trump has not yet achieved, but one that fits his approach to the office. His tweets demonstrate his physical activity.

Eh? Which bit is politically incorrect?


Saying that youthfulness is necessary to win the presidency, then somehow fitting Trump into that over Washington, who rode into battle as president.


If you care about that write some unit tests. I'm sure you'll be very proud of yourself for stopping censorship every time you see isModelRacist() in green.


It seems like running it on an A100 in a datacenter would be better, though? Unless you think cloud providers are logging the outputs of programs that their customers run themselves.


Of course they are... The main reason "the cloud" exists is to log everything about it's users in every capacity possible. That's one reason they "provide it so cheap" (although now they have increased the cost so much it'sfar more expensive than self hosting. So you lose majorly in every way by not self hosting.


openai is expensive (ie, ~$25/mo for a gpt3 davinci IRC bot in a realtively small channel that only gets used heavily a few hours a day) and censored. And I'm not just talking won't respond to controversial things. Even innocuous topics are blocked. If you try to use gpt3.5-turbo for 10x less cost it is so bad with censoring itself that it can't even pass a turing test. Plus there's the whole data collection and privacy issue.

I just wish these weren't all articles about how to run it on proprietary mac setups. I'm still waiting for the guides on how to run it on a real PC.


https://github.com/oobabooga/text-generation-webui/wiki/LLaM... Here you go. You'll need a Nvidia GPU with at least 8 GB VRAM though.


Thanks! I have a Linux laptop with 16G ram and a 10G NVidai 1070, so I might be good to go.


Any examples on running on multiple GPUs?


The actual repo's instructions work perfectly without modification under Linux and WSL2:

https://github.com/ggerganov/llama.cpp


Thanks! I managed to get this running on CPU/system RAM on Debian 11 pretty easily.


> I'm still waiting for the guides on how to run it on a real PC.

Mac __is__ a real PC.


The exact steps that work on a Mac should work on x64 Linux, since the addition of AVX2 support! (Source - I did it last night)


I wish to keep all my chat logs , forever. This will be my alter ego that will even survive me. It must be private and not on someone else's computer.

But more importantly i want it uncensored. These tools are useful for deep conversation, which no longer exists online since many years ago


In the 1970s, people moved to Ashrams in India to lose their ego. In the 2020s, people are anxious for AI to conserve it beyond death. Quite a generational pendulum swing… :)


They took their notebooks with them. That's why we need private models


To summarize other answers: (1) free (2) private (3) censorship/ethics-free (4) customizable (5) doesn't require Internet


It's free. there's extremely cheap, and there's free. no matter how extremely cheap something is, "free" is on a completely different level and gives us a new assumption that enables a lot of things that are not possible when each request is paid (no matter how cheap it is)


You do have to pay for electricity which can be significant when you have multiple GPUs


ChatGPT doesn't give you the full vocabulary probability distribution, while locally running does. You need the full probability distribution to do things like constrained text generation, e.g. like this: https://paperswithcode.com/paper/most-language-models-can-be...


The cloud is actually inferior - It costs more over the long run, you can't see the probabilities or internals of the models, you can't change anything, and you have to give them all of your personal data that they log every second that your logged in (and probably when you're not). Running standard inference on GPUs for these models typically runs ~800$/month if you're actually using them often, which is much more than just running it on your own computer. If you need it away from the location just use a VPN. I don't understand the unnecessary use of 'the cloud' - especially in a supposedly "tech" forum - other than as a great triumph of marketing.


Inferior? "Cloud-based models"?

Not being reliant on a single entity is nice. I will accept not being on the bleeding edge of proprietary models and slower runs for the privacy and reliability of local execution.


For me I think it's exciting in a couple of different ways. Most importantly, it's just way more hackable than these giant frameworks. It's pretty cool to be able to read all the code, down to first principles, to understand how computationally simple this stuff really is.


I am currently paying thousands per month for translations. (Billions of words per month) if we only could have a way to run a chatgpt quality like system localy, we could save a lot of money. I am really impressed by the translation quality of these late ai models.


I've been commuting for about 45 minutes on the subway and I sometimes try to get work done in there. It'd be useful to be able to get answers while offline.


I mean, after SVB caved-in I'm sure a lot of VC-backed App Store devs were looking for something "magical" to lift their moods. Local LLMs are nothing new (even on ARM) but make an Apple Silicon-specific writeup and half the site will drop what they're doing to discuss it.


Ah...was missing the -t 8 part...no wonder was so painfully slow when I tried this


What’s with the propaganda output from the example LLaMA prompt?


Why not just have a script and just say run this?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: