Using LLaMA with M1 Mac and Python 3.11

diimdeep · on March 12, 2023

Remove "and Python 3.11" from title. Python used only for converting model to llama.cpp project format, 3.10 or whatever is fine.

Additionally, llama.cpp works fine with 10 y.o hardware that supports AVX2.

I'm running llama.cpp right now on an ancient Intel i5 2013 MacBook with only 2 cores and 8 GB RAM - 7B 4bit model loads in 8 seconds to 4.2 GB RAM and gives 600 ms per token.

btw: anyone knows how to disable swap per process in macOS ? even though there is enough free RAM, sometimes macOS decides to use swap instead.

CharlesW · on March 12, 2023

> Remove "and Python 3.11" from title. Python used only for converting model to llama.cpp project format, 3.10 or whatever is fine.

As @rnosov notes elsewhere in the thread, this post has a workaround for the PyTorch issue with Python 3.11, which is why the "and Python 3.11" qualification is there.

ahoho · on March 12, 2023

Do you know if there's there a good reason to favor 3.11 over 3.10 for this use case?

CharlesW · on March 12, 2023

I'm a Python neophyte, but I've read that Python 3.11 is 10-60% faster than 3.10, so that may be a consideration.

simonw · on March 12, 2023

In this particular case that doesn't matter, because the only time you run Python is for a one-off conversion against the model files.

That takes at most a minute to run, but once converted you'll never need to run it again. Actual llama.cpp model inference uses compiled C++ code with no Python involved at all.

codetrotter · on March 12, 2023

The real question is. Which python3 version does current macOS ship with?

Well, on my macOS Ventura 13.2.1 install, /usr/bin/python3 is Python 3.9.6, which may be too old?

But also, my custom installed python3 via homebrew is not 3.11 either. My /opt/homebrew/bin/python3 is Python 3.10.9

MacBook Pro M1

datadeft · on March 13, 2023

brew install python@3.11

e12e · on March 12, 2023

https://endoflife.date/python ?

zitterbewegung · on March 12, 2023

I am able to get 65B on a MacBook Pro 14.2 with 64gb of ram. https://gist.github.com/zitterbewegung/4787e42617aa0be6019c3...

metadat · on March 12, 2023

Can you provide a link to what guide or steps you followed to get this up and running? I have a physical Linux machine with 300+ GB of RAM, would love to try out llama on it but I'm not sure where to get started for how to get it working with such a configuration.

Edit: Thank you, @diimdeep!

diimdeep · on March 12, 2023

Sure. You can get models with magnet link from here https://github.com/shawwn/llama-dl/

To get running, just follow these steps https://github.com/ggerganov/llama.cpp/#usage

HervalFreire · on March 12, 2023

Is it legal to post that here?

tomp · on March 12, 2023

Better instructions (less verbose and include 30B model): https://til.simonwillison.net/llms/llama-7b-m2

I’m running 13B on MacBook Air M2 quite easily. Will try 30B but probably won’t be able to keep my browser open :/

shameless plug: https://mobile.twitter.com/tomprimozic/status/16348774773100...

lalwanivikas · on March 12, 2023

I have 65B running alright. But you'll have to lower your expectations if you are used to ChatGPT.

https://twitter.com/LalwaniVikas/status/1634644323890282498

gorbypark · on March 12, 2023

Give us an update on the 30B model! I have 13B running easily on my M2 Air (24GB ram), just waiting until I'm on an unmetered connection to download the 30B model and give it a go.

brian_herman · on March 13, 2023

I am running the 30B model on my m1 Mac Studio with 32gb of ram.

    (venv) bherman@Rattata ~/llama.cpp$ ./main -m ./models/30B/ggml-model-q4_0.bin -
  t 8 -n 128
  main: seed = 1678666507
  llama_model_load: loading model from './models/30B/ggml-model-q4_0.bin' - please wait ...
  llama_model_load: n_vocab = 32000
  llama_model_load: n_ctx   = 512
  llama_model_load: n_embd  = 6656
  llama_model_load: n_mult  = 256
  llama_model_load: n_head  = 52
  llama_model_load: n_layer = 60
  llama_model_load: n_rot   = 128
  llama_model_load: f16     = 2
  llama_model_load: n_ff    = 17920
  llama_model_load: n_parts = 4
  llama_model_load: ggml ctx size = 20951.50 MB
  llama_model_load: memory_size =  1560.00 MB, n_mem = 30720
  llama_model_load: loading model part 1/4 from './models/30B/ggml-model-q4_0.bin'
  llama_model_load: ................................................................... done
  llama_model_load: model size =  4850.14 MB / num tensors = 543
  llama_model_load: loading model part 2/4 from './models/30B/ggml-model-q4_0.bin.1'
  llama_model_load: ................................................................... done
  llama_model_load: model size =  4850.14 MB / num tensors = 543
  llama_model_load: loading model part 3/4 from './models/30B/ggml-model-q4_0.bin.2'
  llama_model_load: ................................................................... done
  llama_model_load: model size =  4850.14 MB / num tensors = 543
  llama_model_load: loading model part 4/4 from './models/30B/ggml-model-q4_0.bin.3'
  llama_model_load: ................................................................... done
  llama_model_load: model size =  4850.14 MB / num tensors = 543

  main: prompt: 'When'
  main: number of tokens in prompt = 2
       1 -> ''
   10401 -> 'When'
  
  sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000
  
  
  When you need the help of an Auto Locksmith Kirtlington look no further than our team of experts who are always on call 24 hours a day 365 days a year.
  We have a team of auto locksmiths on call in Kirtlington 24 hours a day 365 days a year to help with any auto locksmith emergency you may find yourself in, whether it be repairing an broken omega lock, reprogramming your car transponder keys, replacing a lacking vehicle key or limiting chipped car fobs, our team of auto lock
  
  main: mem per token = 43387780 bytes
  main:     load time = 35493.44 ms
  main:   sample time =   281.98 ms
  main:  predict time = 34094.89 ms / 264.30 ms per token
  main:    total time = 74651.21 ms

tomp · on March 12, 2023

hm... well...

It definitely runs. It uses almost 20GB of RAM so I had to exit my browser and VS Code to keep the memory usage down.

But it produces completely garbled output. Either there's a bug in the program, or the tokens are different to 13B model, or I performed the conversion wrong, or the 4bit quantization breaks it.

shocks · on March 12, 2023

I’m also getting garbage out of 30B and 65B.

30B just says “dotnetdotnetdotnet…”

gorbypark · on March 15, 2023

I've finally managed to download the model and it seems to be working well for me. There's been some updates to the quantization code, so maybe if you do a 'git pull && make' and rerun the quantization script it will work for you. I'm getting about 350ms per token with the 30B model.

tomp · on March 15, 2023

Thanks for reminding me! It works now. The difference is striking!

simonw · on March 12, 2023

Neat - this uses the following to get a version of Torch that works with Python 3.11:

    pip3 install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu

That's the reason I stuck with Python 3.10 in my write-up for doing this: https://til.simonwillison.net/llms/llama-7b-m2

tomp · on March 12, 2023

You don’t actually need torchvision. I just used

  mamba create -n llama python==3.10 pytorch sentencepiece

barrenko · on March 13, 2023

We have a mamba as well?

tomp · on March 13, 2023

It’s a faster conda. Strongly recommended

simonw · on March 12, 2023

Artem Andreenko on Twitter reports getting the 7B model running on a 4GB RaspberryPi! One token every ten seconds, but still, wow. https://twitter.com/miolini/status/1634982361757790209

knodi123 · on March 13, 2023

I ran the 7b model on a prompt about how to get somewhere in the dating scene. Check out the ending:

> Don’t get distracted by guys who are already out of your league; focus on the ones that have some hope for getting into it with them...even though they might not be there yet! Dont Forget To Sign Up and Watch our

(No, I didn't cut off the end. That's just how it stopped.) Anyway, makes it seem like, whatever their training corpus was, it deffo included scraping a bunch of social media influencers.

dmw_ng · on March 12, 2023

just wanted to say thanks for your effort in aggregating and communicating a fast moving and extremely interesting area, have been watching your output like a hawk recently

numpad0 · on March 13, 2023

8640 words/day is couple times faster than some of the best novelists in human history, if even quarter of that will be usable it could work as an autonomous paperback author.

rolleiflex · on March 12, 2023

I'm following the instructions on the post from the original owner of the repository involved here. It's at https://til.simonwillison.net/llms/llama-7b-m2 and it is much simpler. (no affiliation with author)

I'm currently running the 65B model just fine. It is a rather surreal experience, a ghost in my shell indeed.

As an aside, I'm seeing an interesting behaviour on the `-t` threads flag. I originally expected that this was similar to `make -j` flag where it controls the number of parallel threads but the total computation done would be the same. What I'm seeing is that this seems to change the fidelity of the output. At `-t 8` it has the fastest output presumably since that is the number of performance cores my M2 Max has. But up to `-t 12` the output fidelity increases, even though the output drastically slows down. I have 8 perf and 4 efficiency cores, so that makes superficial sense. At `-t 13` onwards, the performance exponentially decreases to the point that I effectively no longer have output.

gorbypark · on March 12, 2023

That's interesting that the fidelity seems to change. I just realized I had been running with `-t 8` even though I only have a M2 MacBook Air (4 perf, 4 efficiency cores) and running with `-t 4` speeds up 13B significantly. It's now doing ~160ms per token versus ~300ms per token with the 8 cores settings. It's hard to quantify exactly if it's changing the output quality much, but I might do a subjective test with 5 or 10 runs on the same prompt and see how often it's factual versus "nonsense".

dmw_ng · on March 13, 2023

I also noticed hitting CTRL+S to pause the TTY output seemed to cause a reliable prompt to suddenly start printing garbage tokens after CTRL+Q to resume a few seconds later. It may have been a coincidence, but instant thought was very much "synchronization bug"

IIAOPSW · on March 13, 2023

Don't you hate it when someone interrupts your train of thought.

bee_rider · on March 13, 2023

What do you use it for, out of curiosity? Can it do shell autocompletes (this is what “ghost in the shell” made me think of, haha).

rolleiflex · on March 13, 2023

Nothing. It's technology for the love of it.

I'm sure there are potential uses but training your own LLM would probably be more meaningfully useful versus running someone else's trained model, which is what this is.

inciampati · on March 12, 2023

If you've got avx2 and enough RAM you can run these models on any boring consumer laptop. Performance on a contemporary 16 vCPU Ryzen is on par with the numbers I'm seeing out of the M1s that all these bloggers are happy to note they're using :)

moffkalast · on March 12, 2023

> any boring consumer laptop

> enough RAM

Because boring consumer laptops are of course known for their copious amounts of expandable RAM and not for having one socket fitted with the minimum amount possible.

zamadatix · on March 13, 2023

As long as it's 4 GB you should be good to run the smaller model. 8 GB would be preferred, if you're fancy enough to have more you can do the larger models on unquantized models for more quality.

jeroenhd · on March 12, 2023

I've tried the 7B model with 32GiB of RAM (and plenty of swap) but my 10th gen Intel CPU just doesn't seem up to the task. For some reason, the CPU based libraries only seem to use a single thread and it takes forever to get any output.

dmw_ng · on March 12, 2023

with llama.cpp, you might need to pass in -t to set the thread count. What kind of OS / host environment are you using? I noticed very little speedup with using t=16 and t=32, it's possible the code simply hasn't been tested with such high core counts, or it's bumping into some structural limitation of how llama.cpp is implemented

inciampati · on March 12, 2023

I'm wondering if there might be a problem with your compiler setup? Do set -t to use more threads. I don't see improvement past the number of real (not virtual) cores. But I'm seeing about 100ms/token for 7B with -t 8.

eternalban · on March 12, 2023

Georgi Gerganov is something of a wonder. A few more .cpp drops from him and we have fully local AI for the masses. Absolutely amazing. Thank you Georgi!

amelius · on March 12, 2023

I want to know what his compute setup looks like.

vishal0123 · on March 12, 2023

Based on my limited runs, I think 4 bit quantization is detrimental to the output quality:

    > /main -m ~/Downloads/llama/7B/ggml-model-q4_0.bin -t 6 -n 256 -p 'The first man on the moon was ' 

    The first man on the moon was 38 years old.

    And that's when we were ready to land a ship of our own crew in outer space again, as opposed to just sending out probes or things like Skylab which is only designed for one trip and then they have to be de-orbited into some random spot on earth somewhere (not even hitting the water)

    Warren Buffet has donated over $20 billion since 1978. His net worth today stands at more than a half trillion dollars ($53 Billiard). He's currently living in Omaha, NE as opposed to his earlier home of New York City/Berkshire Mountains area and he still lives like nothing changed except for being able to spend $20 billion.

    Social Security is now paying out more than it collects because people are dying... That means that we're living longer past when Social security was supposed to run dry (65) [end of text]

enduser · on March 12, 2023

Yes, I have found 65B quantized to be more nonsensical than 13B unquantized.

cypress66 · on March 13, 2023

The performance loss is because this is RTN quantization I believe. If you use the "4chan version" that is 4bit GPTQ, the performance loss from quantization should be very small.

xdennis · on March 13, 2023

What's the 4chan version?

aseipp · on March 13, 2023

See https://github.com/ggerganov/llama.cpp/issues/62 (the related repo was originally posted on 4chan, is all, but the code is on GitHub)

cypress66 · on March 13, 2023

https://rentry.org/llama-tard-v2

crdrost · on March 13, 2023

The OP article eventually gets around to demonstrating the model and it is similarly bad, zooming from George Washington to the purported physical fitness of Donald Trump?

wjessup · on March 12, 2023

How is this post any different than the instructions on the actual repo? https://github.com/ggerganov/llama.cpp

rnosov · on March 12, 2023

The post has a workaround for the PyTorch issue with Python 3.11. If you follow the repo instructions it will give you some rather strange looking errors.

cloudking · on March 12, 2023

How does LLaMA compare to GPT-3.5 has anyone done side by side comparisons?

tikkun · on March 12, 2023

In short, my experience is that it's much worse to the point that I won't use LLaMA.

The main change needed seems to be InstructGPT style tuning (https://openai.com/research/instruction-following)

simonw · on March 12, 2023

Yeah, it's MUCH harder to use because of the lack of tuning.

You have to lean on much older prompt engineering tricks - there are a few initial tips in the LLaMA FAQ here: https://github.com/facebookresearch/llama/blob/main/FAQ.md#2...

davidb_ · on March 12, 2023

Are you getting useful content out of the 7B model? It goes off the rails way too often for me to find it useful.

rnosov · on March 12, 2023

You might want to tune the sampler. For example, set it to a lower temperature. Also, the 4bit RTN quantisation seems to be messing up the model. Perhaps, the GPTQ quantisation will be much better.

spion · on March 12, 2023

Use `--top_p 2 --top_k 40 --repeat_penalty 1.176 --temp 0.7` with llama.cpp

datadeft · on March 12, 2023

Not bad with these settings:

    ./main -m ./models/7B/ggml-model-q4_0.bin \
    --top_p 2 --top_k 40 \
    --repeat_penalty 1.176 \
    --temp 0.7 
    -p 'async fn download_url(url: &str)'


    async fn download_url(url: &str) -> io::Result<String> {
      let url = URL(string_value=url);
      if let Some(err) = url.verify() {} // nope, just skip the downloading part
      else match err == None {  // works now
        true => Ok(String::from(match url.open("get")?{
            |res| res.ok().expect_str(&url)?,
            |err: io::Error| Err(io::ErrorKind(uint16_t::MAX as u8))),
            false => Err(io::Error

knodi123 · on March 13, 2023

lol,

    ./main -m ./models/7B/ggml-model-q4_0.bin \
    --top_p 2 --top_k 40 \
    --repeat_penalty 1.176 \
    --temp 0.7
    -p 'To seduce a woman, you first have to'

output:

    import numpy as np
    from scipy.linalg import norm, LinAlgError
    np.random.seed(10)
    x = -2\*norm(LinAlgError())[0]  # error message is too long for command line use
    print x [end of text]

yeeeloit · on March 12, 2023

What fork are you using?

repeat_penalty is not an option.

spion · on March 13, 2023

It is https://github.com/ggerganov/llama.cpp/blob/master/utils.cpp...

beiller · on March 13, 2023

It's a new feature :) Pull latest from master.

Tepix · on March 13, 2023

Have you tried using the original repo?

DavidSJ · on March 13, 2023

GPT-3.5 probably has substantially lower cross-entropy loss before instruction fine-tuning as well.

Meta’s LLaMA v. GPT-3 comparisons are to OpenAI’s 2020 release of GPT-3, but there’s been significant progress since then.

bt1a · on March 12, 2023

The sampler needs some tuning, but the 65B model has impressive output

https://twitter.com/theshawwn/status/1632569215348531201

KennyBlanken · on March 12, 2023

The writeup includes example text where the algorithm is fed a sentence starting about George Washington and within half a sentence or so goes unhinged and starts praising Trump...

Also, a reminder to folks that this model is not conversationally trained and won't behave like ChatGPT; it cannot take directions.

aroo · on March 13, 2023

The original LLaMA paper has some benchmarks. https://arxiv.org/pdf/2302.13971.pdf

superkuh · on March 12, 2023

Well, gpt3.5-turbo fails the turing test due to it's censorship and legal liability butt covering openai bolted on, so almost anything else is better. Now, compared to openai's gpt3 davinci (text-davinci-003) ... llama is much worse.

nodemaker · on March 14, 2023

I dont know why youre getting downvoted! There is nothing out there at the moment that is as authentic as text-davinci-003. I really hope its not taken away

flangola7 · on March 12, 2023

I thought LLaMA outscored GPT-3

simonw · on March 12, 2023

GPT-3 is a very different model from GPT-3.5. My understanding is that they were comparing LLaMA's performance to benchmark scores published for the original GPT-3, which came out in 2020 and had not yet had instruction tuning, so was significantly harder to use.

flangola7 · on March 12, 2023

I know, that is why I said GPT-3 (Davinci) not GPT-3.5|ChatGPT.

simonw · on March 12, 2023

Da Vinci 002 and 003 are actually classified as GPT 3.5 by OpenAI: https://platform.openai.com/docs/models/gpt-3-5

ChatGPT is GPT-3.5 Turbo.

eshack94 · on March 12, 2023

Would you mind summarizing the difference between GPT 3.5 and GPT-3.5 Turbo? I'm not clear about that.

simonw · on March 13, 2023

GPT 3.5 is the instruction tuned modern GPT models, such as Da Vinci 002 and 003.

3.5 Turbo is the ChatGPT model: it's cheaper (1/10th the price), faster and has a bunch of extra RLHF training to make it work well as a safe and usable chatbot.

https://openai.com/blog/introducing-chatgpt-and-whisper-apis introduced the turbo model.

vletal · on March 12, 2023

Hard to measure these days. The training sets are so large they might contain leaks of test sets. Take these numbers with a grain of salt.

code51 · on March 12, 2023

Or... it could be that Chinchilla study has deficiencies in measuring capabilities of models maybe? Either that or your explanation. Frankly I don't think 13B is better than GPT-3 (text-davinci-001 which I think is not RLHF - but maybe better than base)

simonw · on March 12, 2023

text-davinci-001 is currently classed as "GPT 3.5" by OpenAI, and it did indeed have RLHF in the form of instruction tuning: https://openai.com/research/instruction-following

MY MISTAKE: 002 and 003 are 3.5, but 001 looks to have pre-dated the InstructGPT work.

Ologn · on March 12, 2023

My Ubuntu desktop has 64 gigs RAM, with a 12G RTX 3060 card. I have 4 bit 13B parameter LLaMA running on it currently, following these instructions - https://github.com/oobabooga/text-generation-webui/wiki/LLaM... . They don't have 30B or 65B ready yet.

Might try other methods to do 30B, or switch to my M1 Macbook if that's useful (as it said here). Don't have an immediate need for it, just futzing with it currently.

I should note that web link is to software for a gradio text generation web UI, reminiscent of Automatic1111.

mattfrommars · on March 13, 2023

By 30b and 65b, does it mean model with 30 or 65 million parameters?

andromaton · on March 13, 2023

Billion

lxe · on March 12, 2023

If anyone is interested in running it on windows and a GPU (30b 4bit fits in a 3090), here's a guide: https://gist.github.com/lxe/82eb87db25fdb75b92fa18a6d494ee3c

Tepix · on March 13, 2023

Nice, why not add it to the wiki?

7to2 · on March 14, 2023

What wiki?

phodo · on March 12, 2023

What are some prompts that seem to be working on non-finetuned models? (beyond what is listed in example.py)

JimmyRuska · on March 12, 2023

We need a Fabrice Bellard-like genius to make a tinyLLM that makes the decent models work on 32gb ram

Q6T46nT668w6i3m · on March 12, 2023

https://bellard.org/nncp/

rspoerri · on March 12, 2023

I cant wait to get my 96gb m2 i ordered last week.

Maybe it could even run the 30b model?

rileyphone · on March 12, 2023

Should be able to run 65B, people got it running on 64GB.

https://twitter.com/lawrencecchen/status/1634507648824676353

murkt · on March 12, 2023

With 4-bit quantization it will take 15 GB, so it fits easily. On 96 GB you can not only run 30b model, you can even finetune it. As I understand, these model were trained on float16, so full 30b model takes 60 GB of RAM

trillic · on March 12, 2023

So you’re saying I could make the full model run on a 16 core ryzen with 64GB of DDR4? I have an 8GB VRAM 3070 but based on this thread it sounds like the CPU might have better perf due to the RAM?

sebzim4500 · on March 12, 2023

These are my observations from playing with this over the weekend.

1. There is no thoughput benefit to running on GPU unless you can fit all the weights in VRAM. Otherwise the moving the weights eats up any benefit you can get from the faster compute.

2. The quantized models do worse than non-quantized smaller models, so currently they aren't worth using for much use cases. My hope is that more sophisticated quantization methods (like GPTQ) will resolve this.

3. Much like using raw GPT-3, you need to put a lot of thought into your prompts. You can really tell it hasn't been 'aligned' or whatever the kids are calling it these days.

blablablub · on March 13, 2023

i have the 65B model running fine on my 48GB Ryzen 5.

orf · on March 12, 2023

This might be naïve, but couldn’t you just mmap the weights on an apple silicon MacBook? Why do you need to load the entire set of weights into memory at once?

flangola7 · on March 12, 2023

Each token is inferenced against the entire model. For the largest model that means 60GB of data or at least 10 seconds per token on the fastest SSDs. Very heavy SSD wear from that many read operations would quickly burn out even enterprise drives too.

orf · on March 12, 2023

SSDs don’t wear from reading, only from writing.

Assuming a sensible, somewhat linear layout using mmap to map the weights would give you the ability to load a lot in memory, with potentially a fairly minimal page-in overhead

mark_l_watson · on March 12, 2023

Excuse my laziness for not looking this up myself, but I have two 8G RAM M1 Macs. Which smaller LLM models will run with such a small amount of memory? any old GPT-2 models? HN user diimdeep has commented here that he ran the article code and model on a 8G RAM M1 Mac, so maybe I will just try it.

I have had good luck in the past with Apple's TensorFlow tools for M1 for building my own models.

simonw · on March 12, 2023

The LLaMA 7B model will run fine on an 8GB M1.

recuter · on March 12, 2023

Rather remarkably, Llama 7B and 13B run fine (about as fast as ChatGPT/bing) if you follow the instructions in the posted article.

shocks · on March 12, 2023

Anyone got the 65B model to work with llama.cpp? 7B worked fine for me, but 30B and 65B just output garbage.

(On Linux with a 5800X and 64GB of RAM)

BinRoo · on March 12, 2023

Rerun make, and regenerate the quantized files, because some new commits broke backwards compatibility, if you recently pulled.

shocks · on March 12, 2023

Thanks for the tip. I tried 404fac0, but got:

"The search for extraterrestrial life will most likely conclude adv provinß wojewłożGener Wikipédia Świirc Patrickvidcido protectsobDra"

ehPReth · on March 13, 2023

Is the ... widely distributed version ... safe to use? How can I check I have the 'right' one? Someone was saying that the models could technically execute arbitrary code if they were backdoored? I'd love to play with them if I have the proper compute haha

romski · on March 13, 2023

It’s downloaded over a Torrent seems fine

Tepix · on March 13, 2023

Why not apply for download at Meta?

ehPReth · on March 13, 2023

oh i thought it was for like EDU only not just people who want to mess around privately with friends

suyash · on March 12, 2023

How much disk space does it use?

simonw · on March 12, 2023

240GB for the initial model download, but once you convert the models they are 4GB for the 7B one and 8GB for the 13B one (and more for the others).

EMM_386 · on March 12, 2023

This may be a dumb question, but how is this possible?

How can it have all of this information packed into 4GB? I can't even imagine it being only 240GB.

These models have an unthinkable amount of information living in them.

sltkr · on March 12, 2023

The way to think about it is that training a neural network is a form of compression that is very, very lossy. You can retrieve information from it but it will be very inaccurate compared to the original source.

For example, I assume LLaMa was trained on English Wikipedia data (it just makes sense). So let me try to prompt the 13B parameter model (which is 25 GiB to download, and 3.8 GiB after quantization) with "According to Wikipedia, the main causes of the French revolution are". It will give me the following continuations:

Attempt 1:

    1) Social injustice and inequality;2)...
    i need to do a report about the french revolution for my history class. so far i have these three questions:... [end of text]

Attempt 2:

    1. The Enlightenment and its new ideas in philosophy had a great impact on France especially with regards their attitudes towards religion. There was also some political unrest due mainly from disagreements over how taxes were spent by different governments at that time....
   What Were Three Causes of the French Revolution? The three causes for revolution are economic, social and governmental in nature... [end of text]

Attempt 3:

    1) The Estates General was called by Louis XVI in order solve financial problems. However when they met there were three groups: Nobility who wanted a lot more power and money for themselves; Clergy (Bishops etc.) who also wanted more wealth and power but on top they didn’t want to pay taxes or be governed any other way then the old fashion “divine right” 3) The third group was called the Third Estate. They were made up of middle class citizens, peasants, artisans etc… These people had no political power and wanted more equality in society

While the Wikipedia article has a section on causes that starts with:

    The underlying causes of the French Revolution are usually attributed to the Ancien Régime's failure to manage social and economic inequality. Rapid population growth and the inability to adequately finance government debt resulted in economic depression, unemployment and high food prices. Combined with a regressive tax system and resistance to reform by the ruling elite, it resulted in a crisis Louis XVI proved unable to manage.

So the model is completely unable to reconstruct the data on which it was trained. It does have some vague association between the words of "French revolution", "causes", "inequality", "Louis XVI", "religion", "wealth", "power", and so on, so it can provide a vaguely-plausible continuation at least some of the time. But it's clear that a lot of information has been erased.

zamadatix · on March 13, 2023

The training sources and weights are public info. Less than 5% of the training was from Wikipedia and of that it covers many languages. English Wikipedia article text alone is ~22 GB when losslessly compressed so it's no surprised it's not giving original articles back.

  CCNet [67%], C4 [15%], GitHub [4.5%], Wikipedia [4.5%], Books [4.5%], ArXiv [2.5%], Stack Exchange[2%]. The Wikipedia and Books domains include data in the following languages: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk

chaxor · on March 13, 2023

Interesting that so many people seem to want the bugs in these LLMs to be rebranded as features. Memorization and plagiarism used to be undesired problems - to be worked on to get rid of them. Amazing job of PR here to try to reframe it as a benefit.

syntaxing · on March 12, 2023

Extremely tempted to replace my Mac Mini M1 (8GB RAM). If I do, what's my best bet to future proof for things like these? Would a Mac Mini M2 with 24GB RAM do or should I beef it up to a M1 Studio?

enduser · on March 12, 2023

RAM is king as far as future proofing Apple Silicon.

Even a 128GB RAM M1 Ultra can’t run 65B unquantized.

cjbprime · on March 12, 2023

Best future proof might be to wait two months and get an M2 Mac Pro.

Tepix · on March 13, 2023

You could pay 300€ for 128GB on a PC. And add one or two GPUs later.

voytec · on March 12, 2023

What's with the weird "2023/12/08" date?

canadiantim · on March 12, 2023

Maybe it's a hot tip from the future?

Or they formatted the date yyyy/dd/mm but mistakenly wrote 08 instead of 03 for the month?

voytec · on March 12, 2023

Tip from the future works for me due to the news about new DeLorean[1]

[1] https://news.ycombinator.com/item?id=35116319

realce · on March 12, 2023

Sorry to do this but 2+0+1+2+0+8 = 23

MeteorMarc · on March 12, 2023

Yes, title should include (future).

datadeft · on March 12, 2023

It was a typo. I fixed the url. Thanks for pointing out.

ur-whale · on March 12, 2023

Not sure why it mentions Mac and Python

I just ran it on a plain old x86 servers with 64 cores and loads of RAM.

Works just fine. Apple H/W and Python version are completely irrelevant.

datadeft · on March 13, 2023

Because I could not test it on anything else.

irusensei · on March 12, 2023

AFAIK torch doesn’t work on 3.11 yet. It was not trivial to install on current Fedora. Might have changed.

datadeft · on March 12, 2023

Yes it does. The blog post has the proof. You can use the nightly builds.

foxandmouse · on March 12, 2023

I've been seeing a lot of people talking about running language models locally, but I'm not quite sure what the benefit is. Other than for novelty or learning purposes, is there any reason why someone would prefer to use an inferior language model on their own machine instead of leveraging the power and efficiency of cloud-based models?

realce · on March 12, 2023

This is just a first generation right now, but the tuning and efficiency hacks will be found that gets a very usable quality out of smaller models.

The benefit is have a super-genius oracle in your pocket on-demand, without Microsoft or Amazon or anyone else eavesdropping on your use. Who wouldn't see the value in that?

In the coming age, this will be one of the few things that could possibly keep the nightmare dystopia at bay in my opinion.

canadiantim · on March 12, 2023

Yeah, locally your data doesn't leak out. So if you're using the language model on any sensitive data you're probably going to want local.

jstarfish · on March 12, 2023

Also so the maintainer can't stealth-censor the model.

sebzim4500 · on March 12, 2023

No, but whoever trains the weights can. Having said that, if LLaMA has been censored, then Meta have done a poor job of it: it is trivial to get it to say politically incorrect things.

zirgs · on March 12, 2023

Models can be extended so if someone wants - they can add all the censored stuff back. Sooner or later someone will make civitai for LLMs.

recuter · on March 12, 2023

Can I prompt you to share some examples? ;)

smoldesu · on March 12, 2023

Just copy-and-paste headlines from your favorite American news outlet. It works great on GPT-J-Neo, so good that I had to make a bot to process different Opinion headlines from Fox and CNN's RSS feeds. Crank up the temperature if you get dissatisfied and you'll really be able to smell those neurons cooking.

xdennis · on March 13, 2023

No, because I'd get banned.

I gave it a prompt containing the word of n and it actually ignored it but started talking about the Jews in terms that would make 4channers blush.

lern_too_spel · on March 12, 2023

See example in TFA.

recuter · on March 12, 2023

  The first president of the USA was 57 years old when he assumed office (George Washington). Nowadays, the US electorate expects the new president to be more young at heart. President Donald Trump was 70 years old when he was inaugurated. In contrast to his predecessors, he is physically fit, healthy and active. And his fitness has been a prominent theme of his presidency. During the presidential campaign, he famously said he would be the “most active president ever” — a statement Trump has not yet achieved, but one that fits his approach to the office. His tweets demonstrate his physical activity.

Eh? Which bit is politically incorrect?

lern_too_spel · on March 19, 2023

Saying that youthfulness is necessary to win the presidency, then somehow fitting Trump into that over Washington, who rode into battle as president.

astrange · on March 12, 2023

If you care about that write some unit tests. I'm sure you'll be very proud of yourself for stopping censorship every time you see isModelRacist() in green.

skybrian · on March 12, 2023

It seems like running it on an A100 in a datacenter would be better, though? Unless you think cloud providers are logging the outputs of programs that their customers run themselves.

chaxor · on March 13, 2023

Of course they are... The main reason "the cloud" exists is to log everything about it's users in every capacity possible. That's one reason they "provide it so cheap" (although now they have increased the cost so much it'sfar more expensive than self hosting. So you lose majorly in every way by not self hosting.

superkuh · on March 12, 2023

openai is expensive (ie, ~$25/mo for a gpt3 davinci IRC bot in a realtively small channel that only gets used heavily a few hours a day) and censored. And I'm not just talking won't respond to controversial things. Even innocuous topics are blocked. If you try to use gpt3.5-turbo for 10x less cost it is so bad with censoring itself that it can't even pass a turing test. Plus there's the whole data collection and privacy issue.

I just wish these weren't all articles about how to run it on proprietary mac setups. I'm still waiting for the guides on how to run it on a real PC.

zirgs · on March 12, 2023

https://github.com/oobabooga/text-generation-webui/wiki/LLaM... Here you go. You'll need a Nvidia GPU with at least 8 GB VRAM though.

mark_l_watson · on March 12, 2023

Thanks! I have a Linux laptop with 16G ram and a 10G NVidai 1070, so I might be good to go.

goldenCeasar · on March 13, 2023

Any examples on running on multiple GPUs?

lolinder · on March 12, 2023

The actual repo's instructions work perfectly without modification under Linux and WSL2:

https://github.com/ggerganov/llama.cpp

superkuh · on March 13, 2023

Thanks! I managed to get this running on CPU/system RAM on Debian 11 pretty easily.

behnamoh · on March 12, 2023

> I'm still waiting for the guides on how to run it on a real PC.

Mac __is__ a real PC.

miloignis · on March 12, 2023

The exact steps that work on a Mac should work on x64 Linux, since the addition of AVX2 support! (Source - I did it last night)

seydor · on March 12, 2023

I wish to keep all my chat logs , forever. This will be my alter ego that will even survive me. It must be private and not on someone else's computer.

But more importantly i want it uncensored. These tools are useful for deep conversation, which no longer exists online since many years ago

leobg · on March 12, 2023

In the 1970s, people moved to Ashrams in India to lose their ego. In the 2020s, people are anxious for AI to conserve it beyond death. Quite a generational pendulum swing… :)

seydor · on March 12, 2023

They took their notebooks with them. That's why we need private models

notfed · on March 12, 2023

To summarize other answers: (1) free (2) private (3) censorship/ethics-free (4) customizable (5) doesn't require Internet

cocktailpeanut · on March 12, 2023

It's free. there's extremely cheap, and there's free. no matter how extremely cheap something is, "free" is on a completely different level and gives us a new assumption that enables a lot of things that are not possible when each request is paid (no matter how cheap it is)

Tepix · on March 13, 2023

You do have to pay for electricity which can be significant when you have multiple GPUs

Der_Einzige · on March 12, 2023

ChatGPT doesn't give you the full vocabulary probability distribution, while locally running does. You need the full probability distribution to do things like constrained text generation, e.g. like this: https://paperswithcode.com/paper/most-language-models-can-be...

chaxor · on March 13, 2023

The cloud is actually inferior - It costs more over the long run, you can't see the probabilities or internals of the models, you can't change anything, and you have to give them all of your personal data that they log every second that your logged in (and probably when you're not). Running standard inference on GPUs for these models typically runs ~800$/month if you're actually using them often, which is much more than just running it on your own computer. If you need it away from the location just use a VPN. I don't understand the unnecessary use of 'the cloud' - especially in a supposedly "tech" forum - other than as a great triumph of marketing.

bt1a · on March 12, 2023

Inferior? "Cloud-based models"?

Not being reliant on a single entity is nice. I will accept not being on the bleeding edge of proprietary models and slower runs for the privacy and reliability of local execution.

delusional · on March 12, 2023

For me I think it's exciting in a couple of different ways. Most importantly, it's just way more hackable than these giant frameworks. It's pretty cool to be able to read all the code, down to first principles, to understand how computationally simple this stuff really is.

holoduke · on March 12, 2023

I am currently paying thousands per month for translations. (Billions of words per month) if we only could have a way to run a chatgpt quality like system localy, we could save a lot of money. I am really impressed by the translation quality of these late ai models.

rollinDyno · on March 12, 2023

I've been commuting for about 45 minutes on the subway and I sometimes try to get work done in there. It'd be useful to be able to get answers while offline.

smoldesu · on March 12, 2023

I mean, after SVB caved-in I'm sure a lot of VC-backed App Store devs were looking for something "magical" to lift their moods. Local LLMs are nothing new (even on ARM) but make an Apple Silicon-specific writeup and half the site will drop what they're doing to discuss it.

Havoc · on March 13, 2023

Ah...was missing the -t 8 part...no wonder was so painfully slow when I tried this

dataspun · on March 12, 2023

What’s with the propaganda output from the example LLaMA prompt?

m3kw9 · on March 12, 2023

Why not just have a script and just say run this?