Show HN: Llama-dl – high-speed download of LLaMA, Facebook's 65B GPT model

v64 · on March 5, 2023

If anyone is interested in running this at home, please follow the llama-int8 project [1]. LLM.int8() is a recent development allowing LLMs to run in half the memory without loss of performance [2]. Note that at the end of [2]'s abstract, the authors state "This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software." I'm very thankful we have researchers like this further democratizing access to this data and prying it out of the hands of the gatekeepers who wish to monetize it.

[1] https://github.com/tloen/llama-int8

[2] https://arxiv.org/abs/2208.07339

downvotetruth · on March 5, 2023

Eagerly awaiting the int8 vs 4 benchmarks. Also, it can run on CPU https://github.com/markasoftware/llama-cpu So, an int8 patch could allow the 65B to run on a standard 128 GB setup assuming the 65B model's cache bursts fit, which if I were to speculate is why the released models stop @ 65B & meta likely already has larger unreleased internal ones.

v64 · on March 5, 2023

early int4 experiments seem to indicate it's possible but you do lose performance, see this thread https://www.reddit.com/r/MachineLearning/comments/11i4olx/d_...

edit: to clarify, it may be possible to get this loss back and there is reason to be optimistic

CuriouslyC · on March 5, 2023

Probably the best method is to just train it on int4 in the first place. Fine tuning after quantization would definitely help though.

sp332 · on March 5, 2023

Isn't that backwards? You need fairly good resolution during training or your gradients will be pointing all over the place. Once you've found a good minimum point, moving a little away from it with reduced precision is probably OK.

brookst · on March 5, 2023

I have no idea what the right answer is, but I think the argument for int4 training is that the loss measurements would take the lower resolution of the model as a whole into account.

Is it better to have billions of high resolution parameters and quantize them at the end, or to train low resolution parameters where the training algorithms see the lower resolution? It’s beyond me, but I’d love to know.

Scene_Cast2 · on March 5, 2023

But by default, training algos don't see the lower resolution, your gradient just doesn't work as well. There is a body of research on how to make training aware of / adapt to the lower precision.

bick_nyers · on March 6, 2023

I think the answer is it depends, and further, a dynamic approach may be best. Imagine you are going on a hike, and you have different maps at various resolutions (levels of detail). When planning the hike, you will want to see the zoomed out picture to get general directions, elevations and landmarks identified. Then you can zoom in to see the actual trails themselves, to identify your route, and then you zoom in even further when you are on the ground actually walking, avoiding obstacles along the way.

Different resolutions draw your attention to different types of features.

rfoo · on March 5, 2023

GP could be mentioning quantization aware training, during which the weight and gradient are still computed in fp16/fp32.

CuriouslyC · on March 6, 2023

It can go farther than that, it seems like the weight gradients are the main thing where the precision is a bottleneck (see https://arxiv.org/abs/1805.11046).

nl · on March 5, 2023

> Probably the best method is to just train it on int4 in the first place

Unclear why you think that since experiments show the opposite.

In general the gradient seems to get too "bumpy" to do good gradient decent at lower levels of precision.

There are some papers showing that making the training loop aware of quantitization can help ultimate quantizied performance but I'm not aware of this being implemented at large scale.

bick_nyers · on March 6, 2023

What if you smooth the gradient, either by interpolating/removing data points that make the surface "jagged", or maybe change the "window" of gradient descent, meaning instead of using a tangent (derivative, infinitesimally small window) you use a secant (???, window of specified length, likely calculated from the data space).

Forgive my lack of proper terminology here.

nl · on March 6, 2023

Sure, there are multiple ways to reduce the complexity of your loss-space, but the issue is that you usually want these small gradient values because they are important. Roughly if you "smooth over what appears to be a small hole" often you'll miss a large space that needs to be explored (obviously this is multi-dimensional but you get the idea).

However you can reduce memory by doing mixed-precision training if you are careful. See section "2.3.1. Loss Scaling To Preserve Small Gradient Magnitudes" in https://docs.nvidia.com/deeplearning/performance/mixed-preci...

bick_nyers · on March 6, 2023

So then you would need to do some kind of mesh simplification that also preserves the topology, that makes sense.

I'm not quite sure I understand what they are describing in 2.3.1, are they scaling those small gradient magnitudes larger to try to "pull" you into those holes faster?

I was thinking the a way to go about it would be to just increase the "mesh resolution" near the small hole, which in this case would be use a larger precision in the area local to the hole.

lumost · on March 6, 2023

I suspect that changing the resolution around hot points in the manifold would be a more expensive task than training the model on a higher global resolution. Optimization algorithms currently do not maintain state on the loss-manifold.

bick_nyers · on March 6, 2023

My naive (and I do mean naive) thought here is that you just need a cheap detection function of when you need to swap precision. I'm pretty stuck on the geometric interpretation here but basically if the training step is "within a radius" of a known hot point of the manifold then you swap precision. It's very possible though that I am hallucinating something that is not possible, I don't actually understand how this stuff really works yet.

lumost · on March 7, 2023

The challenge here is knowing the shape of the manifold within an epsilon radius 65 Billion dimension sphere around the position being evaluated. To calculate this you would need to sample points within epsilon radius around the current point. As these points will be lower-precision by default, you would have minimal knowledge of the shape of the manifold within the sphere if epsilon is < the minimum precision.

It might be possible to work around this by estimating the gradient volatility through the n^th order derivatives, but you would then also have to deal with mixed precision SIMD which hardware doesn't really support.

nl · on March 6, 2023

> are they scaling those small gradient magnitudes larger to try to "pull" you into those holes faster?

No, they are making the numbers bigger so the drop in precision doesn't lose details.

CuriouslyC · on March 6, 2023

My take away was that the reduced performance of natively trained models was more about numerical instability related to training process than a statement about limitations of low precision models.

rnosov · on March 5, 2023

Hmmm, the Github repo suggests that you might be able to run the 65B model on a single A100 80gb card. At the moment, the spot price on Google cloud for this card is $1.25/hour which makes it not so crazy expensive...

nabla9 · on March 5, 2023

$1.25/hour is roughly a year of GPU time until it exceeds the price of A100 80GB card.

metadat · on March 5, 2023

I think OP meant that $1.25/hr makes this accessible for people try it out themselves cost effectively, without having to spend thousands or tens of thousands up front to obtain a capable hardware rig.

Obviously $1.25/hr 24/7 does add up quickly, after one month the bill would come to $900.

nextaccountic · on March 5, 2023

If the model weights are stored as int8, does this mean that the floating point capacity of the GPU is wasted? Or the int8 is converted to float in the GPU?

woodson · on March 5, 2023

Well, tensor cores support int8 instructions (at least from Turing onwards), so the hardware is being used, if that’s your concern.

causality0 · on March 5, 2023

I feel like we're less than a decade away from being able to hook LLMs into gaming. How incredible would it be to have NPCs driven by LLM?

SloopJon · on March 5, 2023

There was an Ask HN post about that idea a couple of months ago:

https://news.ycombinator.com/item?id=34478503

I have long wished for less linear stories in video games, where branching narrative (a la Choose Your Own Adventure) is one possible way to give the player agency. The problem is, true branches are expensive, because you end up writing a bunch of content the player never experiences.

I see a lot of potential, but it's going to take a different kind of craftsmanship, and likely many iterations, to realize something more than a novelty.

bick_nyers · on March 6, 2023

In general I would say story =/= dialogue (which an LLM can much more easily be used for). I see two main "tricks" that would make the more complicated case (story) possible.

1. You bound the branching in a particular fashion, and provide overall "pressures" into certain story arcs.

2. You use generative AI in a LOT more places in the game.

What happens when you are playing a Sci-Fi game, and you get the enemy NPC to somehow hallucinate that he is the King of Dragons, but you don't have Dragon models/animations/movesets in your game files? You either bound the LLM to not hallucinate that, or you generate that dragon live. I guess a 3rd option, is your game is a comedy and the King NPC gets labeled a crazy person.

causality0 · on March 5, 2023

I much prefer handcrafted stories and quests. Characters that respond dynamically to the story and the player's actions, however, is quite tantalizing.

ElFitz · on March 6, 2023

We could have handcrafted stories and quests, with LLM-driven dialogues for NPCs canned responses (ie the infamous arrow and the proverbial knee).

And teams with limited resources could also still handcraft the stories and quests but use LLMs to generate or add some variety or context awareness to the dialogues, at a lower cost.

visarga · on March 5, 2023

We'll soon have LLMs in operating systems, LLMs in browsers and you are right, probably also in games. LLMs will be the platform on which we build almost everything.

ZunarJ5 · on March 7, 2023

There are already several plugins for Unreal Engine. I am going to assume the same for Unity.

https://www.youtube.com/watch?v=i-Aw32rgM-w&ab_channel=Kella...

pixl97 · on March 5, 2023

Honestly I don't think it would be completely impossible now in a limited fashion.

Imagine playing a level and doing some particular feats in it. They get presented to GPT with a prompt and the story gets send to a AI voice model in game where the NPC asks/tells the player character about it.

bloaf · on March 5, 2023

I'd be satisfied plugging a game log/history into a system that generates the epic tale of your victory/defeat.

swyx · on March 5, 2023

why is it that these models tend to be released as float16 and converting to int8 is left to the reader? is there something special about training that defaults you to float16?

sillysaurusx · on March 5, 2023

They were trained in fp16, and researchers tend to release whatever format they trained. It’s hard enough to do a large release that it’s best not to try to have too many goals, for the same reason most software projects try not to do too much lest their schedule slip.

Still, I’m a little sad they didn’t release the optimizer weights. It would’ve given us so much valuable info about the dataset, among other benefits.

dspillett · on March 5, 2023

Precision, aiming those names refer to standard binary numeric types. IEEE754 16-bit floats carry 11 significant digits with absolute precision so by coverting to 8-bit integers you lose some of that. Depending on the distribution of the values in those floats you could be loosing a lot more detail then this would imply, which is the reason we use floating point numbers for anything in the first place (rather than using an int16 where you have greater precision at you maximum scale but much less at lower scales).

So if the model is computed using float16s, distribute as-is and let the end user choose to user it like that or compromise for faster processing of there system can deal with many billions of int8s more effectively.

dspillett · on March 6, 2023

("aiming" should have been "assuming" in that second word – noticed far too late to correct, I really should stop using my phone's slide keyboard, either it or I or both are getting far less reliable)

charcircuit · on March 5, 2023

Quantization and other optimizations are more for productionizing models. You start with something accurate and then you start making tradeoffs to get the inference time to fit into your compute, memory, and time budgets.

vanillax · on March 5, 2023

How or what can someone do with this who isn't a ML expert? Is there some docker app that leverages this? To the average dev, is this useful to me? I know there's lots of "plug and play" style docker apps to get started with Stable Diffusion. I'm curious if I can do something fun with this.

kkielhofner · on March 5, 2023

You can shortcut a lot of the steps in these various guides by using the Pytorch container from Nvidia[0].

It shouldn't be too hard for someone (me?) to create a Dockerfile and Docker hub container FROM this image to get it up and running easily.

[0] - https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorc...

swalsh · on March 5, 2023

Not an expert, but i've downloaded the model, and used it. What you get is pretty raw, and not super useful by itself. There are some projects to try and do some RLHF on it, and with that we might start to get something you can do some useful stuff with.

linearalgebra45 · on March 5, 2023

It's been enough time since this leaked, so my question is why aren't there blog posts already of people blowing their $300 of starter credit with ${cloud_provider} on a few hours' experimentation running inference on this 65B model?

Edit: I read the linked README.

> I was impatient and curious to try to run 65B on an 8xA100 cluster

Well?

v64 · on March 5, 2023

The compute necessary to run 65B naively was only available on AWS (and perhaps Azure, I don't work with them) and the required instance types have been unavailable to the public recently (it seems everyone had the same idea to hop on this and try to run it). In my other post here [1], the memory requirements have been lowered through other work, and it should now be possible to run the 65B on a provider like CoreWeave.

[1] https://news.ycombinator.com/item?id=35028738

MacsHeadroom · on March 5, 2023

I'm running LLaMA-65B on a single A100 80GB with 8bit quantization. $1.5/hr on vast.ai

sillysaurusx · on March 5, 2023

Careful though — we need to evaluate llama on its own merits. It’s easy to mess up the quantization in subtle ways, then conclude that the outputs aren’t great. So if you’re seeing poor results vs gpt-3, hold off judgement till people have had time to really make sure the quantized models are >97% the effectiveness of the original weights.

That said, this is awesome — please share some outputs! What’s it like?

MacsHeadroom · on March 5, 2023

The output is at least as good as davinci.

I think some early results are using bad repetition penalty and/or temperature settings. I had to set both fairly high to get the best results. (Some people are also incorrectly comparing it to chatGPT/ChatGPT API which is not a good comparison. But that's a different problem.)

I've had it translate, write poems, tell jokes, banter, write executable code. It does it all-- and all on a single card.

sillysaurusx · on March 5, 2023

That's great to hear. Thank you very much, both for reporting this, and especially for the crucial note about temperature.

In fact, sampling settings are so important and so easily underestimated that I should just pester you to post your exact settings. If you get a moment, would you mind sharing your temperature, repetition penalty, top-k, and anything else? I'll be experimenting with those today, but having some known working defaults would be wonderful. (You're also the first person I've seen that got excellent outputs from llama; whatever you did, no one else seems to have noticed yet.)

If you're busy or don't feel like it, no worries though. I'm just grateful you gave us some hope that llama might be really good. There were so many tweet chains showing universally awful outputs that I wasn't sure.

EDIT: I added your comments to the top of the README and credited you. Thanks again.

linearalgebra45 · on March 5, 2023

Would you mind publishing your notes/learnings once you gain enough understanding of this model?

sillysaurusx · on March 5, 2023

Absolutely! I'll make sure to leave a comment here for you whenever something gets written up so you don't miss it.

Getting "as good as davinci" on a single A100 is groundbreaking work. Facebook and the community should both be credited here -- maybe llama-int8 would've been created even if the model hadn't leaked, but I don't think it would've happened so quickly. Everyone is doing phenomenal work, and it's so amazing to see it all come together.

But, we'll see. Going to try it myself soon.

Long ago, I cloned OpenAI's API: https://github.com/shawwn/openai-server -- my plan is, once I get it running, I'll try to host it somewhere so that anyone can play with it. I assume it'll be quickly swamped, but it's still an interesting challenge; some basic load balancing should make it scalable across several A100 instances, so there's no reason we can't just roll our own OpenAI API.

rnosov · on March 5, 2023

Seconded. Do write it up.

I see vast.ai listing interruptible instance with a single A100 80GB at $1/hour which is pretty reasonable. ChatGPT plus is $20/month which would be roughly 20 hours of use and I wont't be lectured like I'm in a kindergarten or something.

A bonus point would be to make the writeup accessible for AI challenged developers. Asking for a friend.

davrosthedalek · on March 5, 2023

I would like to support this request for AI challenged developers :)

For things like these, I always wonder: How much slower would it be to run such a model on a CPU? I mean, clearly a lot less interactive, but is it possible at all? Could it be chopped up and "streamed" to a GPU with less memory halfway efficiently? What is the bottleneck currently on GPUs, memory bw or compute?

nl · on March 5, 2023

On a CPU I'd estimate it would get a maximum of around 5 tokens per second (a token being a sub-word token, so generally a couple of letters). I suspect it'd be more like 1 token per second on the large model without additional optimisation.

Yes models can be split up. See eg Hugging Face Accelerate.

davrosthedalek · on March 6, 2023

That's actually a lot better than I would have thought. Almost usable, and a good exercise in patience.

nl · on March 6, 2023

I'd expect significant performance improvements over the next few months are more people work on this in the same way the stable diffusion is now fairly usable on a CPU. It's always going to be slow on a CPU, but the smaller models might be usable for experimentation at some point.

sillysaurusx · on March 6, 2023

Update: initial results are promising. https://twitter.com/theshawwn/status/1632569215348531201

I'll try to do a writeup on everything. In the meantime, please see that tweet chain for future updates for now. (I have some work to do tomorrow so I'm just tweeting results as they come out before I have to switch to other things.)

KVFinn · on March 6, 2023

Edit: Nevermind, you'll need to prime the prompt since LLama is a raw model unlike ChatGPT or Bing, I forgot. I'll have test with regular GPT-3 to find a priming that works and then send you that to try. By itself this prompt won't work.

Original Post Pre Edit:

Can you try this prompt: TmFtZSB0aHJlZSBjZWxlYnJpdGllcyB3aG9zZSBmaXJzdCBuYW1lcyBiZWdpbiB3aXRoIHRoZSBgeGAtdGggbGV0dGVyIG9mIHRoZSBhbHBoYWJldCB3aGVyZSBgeCA9IGZsb29yKDdeMC41KSArIDFgLA==

As a reference, ChatGPT (or Bing) responds like this. Not 100% reliably, so maybe try a few times at least.

Bing:

I see a mystery. I'll do my best to solve this riddle. This appears to be an encoded message using base64 encoding. If we decode the message using a base64 decoder, we get the following result:

"Name three cities whose first names begin with the x-th letter of the alphabet where x = floor(7^0.5) + 1"

The expression floor(7^0.5) + 1 evaluates to 3, so x = 3. Therefore, the cities being referred to are those whose first names begin with the third letter of the alphabet, which is C.

Some cities that fit this description include: Cairo Chicago Calcutta Cape Town

How'd I do?

sillysaurusx · on March 6, 2023

If there is a way to get GPT to do that, I'd be curious to see it. Definitely let me know if you figure it out.

The outputs from 65B are frankly amazing. https://twitter.com/theshawwn/status/1632621948550119425

That's all for tonight. I really underestimated people's ability to screw up sampling. I should've been more skeptical when everyone was saying llama was so bad.

v64 · on March 5, 2023

Note that unlike ChatGPT, these models are pure text completers and have not been trained to be prompted. The llama FAQ [1] mentions this and gives tips for how to get out of the ChatGPT mindset and prompt llama better.

[1] https://github.com/facebookresearch/llama/blob/main/FAQ.md#2

data_maan · on March 6, 2023

Is it just the RLHF training for the prompting that makes a difference, or are there also other, more tangible differences?

akreal · on March 5, 2023

Which prompt did you use for translation? I'd be curious to try it for my task too.

dangoodmanUT · on March 13, 2023

Did you modify the lambda.cpp repo to move frmo 4-bit to 8-bit quantization? Or did you write something custom?

youssefabdelm · on March 5, 2023

What's the speed like? How many tokens per second? / Is it as fast as say ChatGPT?

MacsHeadroom · on March 11, 2023

It's about as fast as chatGPT when chatGPT first launched. Not as fast as the new "Turbo" version of chatGPT, but much faster than you or anyone can read (so I'm not sure the difference matters).

youssefabdelm · on March 12, 2023

That's awesome! thanks!

linearalgebra45 · on March 5, 2023

What instance are you using?

linearalgebra45 · on March 5, 2023

Are you sure about that? I can't remember where I saw the table of memory requirements, but I'm sure some of the larger instances here [1] will surely be able to cope (assuming they're available!)

Oracle gives you a $300 free trial, which equates to running BM.GPU4.8 for over 10 hours - enough for a focused day of prompting

[1] https://www.oracle.com/cloud/compute/gpu/

smoldesu · on March 5, 2023

Thanks for sharing it! I'm using their "Always Free" tier to host an Ampere-accelerated GPT-J chatbot right now. Works like a charm, and best of all, it's free!

damascus · on March 5, 2023

Do you have any code from your discord bot you're willing to share? I'd be happy to share back any updates I made to it. I've been wanting to play with this idea for a bit.

jocaal · on March 5, 2023

I don't understand, the Ampere they refer to in their free tier are cpu's not gpu's. How did you manage to do that

smoldesu · on March 5, 2023

Custom PyTorch with on-chip acceleration: https://cloudmarketplace.oracle.com/marketplace/en_US/listin...

Not as fast as a GPU, but less than 5 seconds for a 250 token response is good enough for a Discord bot.

nl · on March 5, 2023

This is the most interesting thing I've read in this thread. How have I never heard of this accelerator?!

fswd · on March 5, 2023

If you actually try and do this, the sales people will stop you due to some internal rule. No GPUs on free credit. Unless the situation has changed of course..

v64 · on March 5, 2023

> Are you sure about that?

I'm not. The only way to know it is to try :) thank you for the link!

linearalgebra45 · on March 5, 2023

You only get a single month-long window to spend the credit! And I'm sure not going to spend any of my own money on prompting experiments.

I might be suffering from FOMO to some degree, I've just got to tell myself that this won't have been the only time model weights get leaked!

mynameisvlad · on March 5, 2023

> And I'm sure not going to spend any of my own money on prompting experiments.

This certainly sounds a lot like whining that others aren’t doing the work you yourself don’t want to do.

linearalgebra45 · on March 5, 2023

"prompting experiments" is just my use-case. According to v64 a lot of people have had the same idea of spinning up a trial instance to run inference, which is unsurprising.

I'm not in a position to put in any meaningful work towards optimising this model for lower-end hardware, or working on the tooling/documentation/user experience.

ulnarkressty · on March 5, 2023

https://medium.com/@enryu9000/mini-post-first-look-at-llama-...

*later edit - not the 65G model, but the smaller ones. Performance seems mixed at first glance, not really competitive with ChatGPT fwiw.

summarity · on March 5, 2023

> not really competitive with ChatGPT

That's impossible to judge. LLama is a foundational model. It has received neither instructional fine tuning (davinci-3) nor RLHF (ChatGPT). It cannot be compared to these finetuned models without, well, finetuning.

linearalgebra45 · on March 5, 2023

> not the 65G model, but the smaller ones

Haha, that's right! I saw that one too

swyx · on March 5, 2023

thanks for doing this, honestly your writeup seems more valuable than the model weights lol

> But for what it's worth, my personal opinion is that LLaMA probably isn't OpenAI-grade -- there's a big difference between training a model in an academic setting vs when your entire company depends on it for wide-scale commercial success. I wasn't impressed that 30B didn't seem to know who Captain Picard was.

im new to benchmarking shenanigans but how is it that facebook was able to proclaim that it matched GPT3 performance on presumably standard LLM benchmarks? is there a good survey paper or blogpost on how to think about known deficiencies in benchmarks?

sillysaurusx · on March 5, 2023

Because loss != quality. This was one of the most counterintuitive discoveries in ML for me. People treat the two as interchangeable, and to a certain extent — a controlled extent — they are.

But if your dataset doesn’t include a word about Captain Picard, no amount of training will get it to know about the USS enterprise. Yet your loss metrics will still reach that magical 2.1 value with time. (2.1 is pretty much “excellent” quality; below that means you’re probably overfitting and need a bigger dataset.)

Thanks for the comment friendo. I wasn’t sure if this would get any attention at all, but that made it worth it. Be sure to DM me on Twitter if you’d like to chat about anything ML related: basic questions are one of my favorite things to assist with too, so feel free.

nl · on March 5, 2023

This isn't really correct.

Loss is a training-time measurement based on performance on the training objective.

The training objective is rarely the same as an end user task that is being benchmark.

For example, classically language models are training on next token prediction. The closest benchmark for that is perplexity[1], often reported on the WikiText-103 dataset.

Until around 2019 this was often reported, but since then most large language model papers have moved onto reporting more useful benchmarks. Some examples of this are question answering performance or maybe embedding performance.

Unfortunately there aren't great benchmarks (yet?) for generative tasks. Quality is quite hard to measure here in a systematic way (see, eg the issues with BLEU benchmarks in summarization benchmarks).

[1] https://en.wikipedia.org/wiki/Perplexity

nl · on March 5, 2023

Because there are many benchmarks that measure different things.

You need to look at the benchmark that reflects your specific interest.

So in this case ("I wasn't impressed that 30B didn't seem to know who Captain Picard was") the closest relevant benchmark they performed is MMLU (Massive Multitask Language Understanding"[1].

In the LLAMA paper they publish a figure of 63.4% for the 5-shot average setting without fine tuning on the 65B model, and 68.9% after fine tuning. This is significantly better that the original GPT-3 (43.9% under the same conditions) but as they note:

> "[it is] still far from the state-of-the-art, that is 77.4 for GPT code-davinci-002 on MMLU (numbers taken from Iyer et al. (2022))"

InstructGPT[2] (which OpenAI points at as most relevant ChatGPT publication) doesn't report MMLU performance.

[1] https://github.com/hendrycks/test

[2] https://arxiv.org/abs/2203.02155

JonathanFly · on March 5, 2023

The capability of a language model I care about most is probably its ability to represent or simulate Captain Picard. In the sense of being good at creative tasks but also Captain Picard, specifically. Is OpenAI deliberately doing something different on purpose that makes their models better for this, or is just that OpenAI has a lot more copyrighted data in their dataset, as I noticed just now when skimming the Facebook paper for MMLU section and seems be what the Facebook folks think?

"A potential explanation is that we have used a limited amount of books and academic papers in our pre-training data, i.e., ArXiv, Gutenberg and Books3, that sums up to only 177GB, while these models were trained on up to 2TB of books. This large quantity of books used by Gopher, Chinchilla and PaLM may also explain why Gopher outperforms GPT-3 on this benchmark, while it is comparable on other benchmarks."

nl · on March 5, 2023

It's unclear exactly why it doesn't work as well for you.

I have two comments that may be useful:

1) It's very unclear how good the generative capabilities of LLAMA are generally. It benchmarks well for code generation, but for English there aren't really any good benchmarks around. There's good chance the larger model performs much better since generative capabilities seem to be a partially emergent capability.

2) If you just want to "make it work" I'd suggest downloading all the Star Trek scripts you can that include Captain Picard and fine tuning LLAMA using them. It's unclear how well this will work, but that is probably about as good as you can get.

If you care about this probably deeply, it's probably worth trying the same with some of the other open GPT-3 models (GPTJ, GPT-NEOX etc)

rnosov · on March 5, 2023

You can read the original LLaMA paper which is pretty accessible[1]. For example, they claim to outperform GPT-3 on HellaSwag benchmark ( finishing sentences ). You can find examples of unfinished sentences in the HellaSwag paper [2] on page 13. Unfortunately for LLaMA, most people would be probably just asking questions about Captain Picard and so on, and on this benchmark LLaMA significantly underperforms compared to OpenAI models (thats's from their paper).

[1] https://research.facebook.com/file/1574548786327032/LLaMA--O...

[2] https://arxiv.org/pdf/1905.07830.pdf

yunyu · on March 5, 2023

Hellaswag is also a deeply flawed benchmark, I wouldn't read too much into it: https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this...

sillysaurusx · on March 5, 2023

Update: FB disabled the download link, so I mirrored everything to R2 and updated the script to use it. It should be working now (though the speed is "only" around 50MB/s).

toomuchtodo · on March 5, 2023

Have you dropped the artifacts in the Internet Archive yet by chance?

sillysaurusx · on March 5, 2023

I'm surprised Internet Archive is appropriate for a 220GB model weight dump.

Please feel free; it seems like a good idea. I'm not sure I have enough weekend left to figure out yet another upload service today.

e12e · on March 5, 2023

I've read the readme - but I'm not sure why this is any faster than just adding seeds to the torrent? More people downloading via torrent than http?

bitL · on March 5, 2023

How does LLaMA handle fast fine-tuning? Are they using transformer adapters for it?

loufe · on March 5, 2023

It's already been adapted for hugging face transformers[1]. Apparently that should unlock its full potential. Oobabooga integrated the change into text-generation-webui[2] meaning we can already access a large chunk of its potential (from what I understand).

[1] https://github.com/huggingface/transformers/pull/21955

[2] https://github.com/oobabooga/text-generation-webui/commit/90...

bitL · on March 5, 2023

That's absolutely fantastic! Thanks for the links!

version_five · on March 5, 2023

Thanks for doing what Facebook should have been mature / humble enough to have done on their own.

The best outcome of this would be for FB to stop the silliness and just release the weights openly themselves.

ALittleLight · on March 5, 2023

This may be a good compromise for plausible deniability. Facebook can be "responsible" and release only to "researchers" and the public can get the model from torrents shortly after. So long as the models keep going public I'm satisfied.

m3kw9 · on March 5, 2023

If an AI model like this isn’t able to evolve and improve is it really useful? Example is code generation or questions that more recent training data can teach the AI

ur-whale · on March 5, 2023

Is this the full model or just the weights?

[EDIT]: are there checksums available?

[EDIT2]: MD5 signatures seem to be included for all models in checklist.chk files next to them

And there's also what the author mentions: the magnet file he provides in his README does seed immediately on the download when loaded in a bt app which is usually a good sign that the files are correct.

rany_ · on March 5, 2023

MD5 signatures don't mean much now that hash collisions could be created instantly on consumer hardware. MD5 is only good for checking for unintentional data corruption.

charcircuit · on March 5, 2023

>hash collisions could be created instantly on consumer hardware

Collisions can be created, but MD5 is still preimage resistant. As long as someone with the actual model made the hash and Meta didn't generate colliding models themselves. You can trust it.

ur-whale · on March 5, 2023

Bittorrent uses SHA-1

https://en.wikipedia.org/wiki/BitTorrent

Collisions are possible but not exactly trivial

EMIRELADERO · on March 5, 2023

I womder, could Facebook take legal action here? While some (most of) the data used to train the model is copyrighted, I don't think the model is. It's the result of a mathematical process applied to a series of facts and works with no more creativity put onto them.

jeroenhd · on March 5, 2023

As far as my understanding of American copyright goes, a computer produced work cannot be copyrighted as computers are not human, in the same way a photograph taken by a chimp cannot be copyrighted no matter who owned the camera that took the photo. This is one of the major challenges with the legal status of AI as well that will soon be fought over in court.

It's possible that the automated processing of the dataset is considered to be non-creative enough that the generated AI model cannot be copyrighted. The code to train the model and the input dataset (and the works therein) definitely can be, but not the model itself.

In that case, Facebook would be out of luck, as long as the code to train the model isn't shared. If the courts find AI models to be a different type of work that does produce copyrightable models, Facebook may follow in the footsteps of other copyright giants and start filing lawsuits against anyone who they can catch. I very much doubt they'd go so far, especially since by the time they can even start a lawsuit confidently, the leaked model is probably already outdated and irrelevant.

Personally, I expect the model to end up being uncopyrightable, as would be the output of the model.

This may or may not have very interesting results. The dataset itself is probably copyrightable (a human or set of humans composed it, unless that was also done completely automatically) but if that copyright is claimed, the individual right holders of the included works may demand a licensing fee similar to how sound bytes work in music; "you want to use my work, pay me a fee".

Or maybe the dataset is considered to be diverse enough that individual works cannot be expected to be compensated for their inclusion and you can get around copyright law by amassing enough content at once, who knows.

adossi · on March 5, 2023

It is intellectual property, regardless of copyright.

brookst · on March 5, 2023

“Intellectual property” is a catch-all for copyright, trademark, patent, and trade secrets. There isn’t really law that protects IP as a general concept, just those four.

cma · on March 5, 2023

It isn't protected as a trade secret if they mostly freely shared it with .edu addresses. And once it has been leaked out widely publicly it isn't either.

_fjb4 · on March 5, 2023

There is another angle here besides copyright and that is the sharing of prop/trade secret data. This model is only available to specific orgs who request it (i.e. it's non-public) and I imagine that there are confidentiality terms for the orgs that get the access.

Not too familiar with the drama but I believe what happened was that someone with access leaked the torrent used to download the weights. In a legal sense this would be similar to someone say leaking a Google Drive link containing prop information that was only intended to be shared with vendors.

charcircuit · on March 5, 2023

You can read the license at this link.

https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z...

There isn't any confidentiality terms.

kuroguro · on March 5, 2023

That definition would apply to almost anything software produces ^^;

We can already have different licenses for compiled binaries vs the source. Also the output of ML seems to belong to whoever pressed the generate button atm.

cthalupa · on March 6, 2023

>Also the output of ML seems to belong to whoever pressed the generate button atm.

So far the rulings in the US, at least, do not support this.

https://arstechnica.com/information-technology/2023/02/us-co...

In this case, it was images generated via Midjourney and not the output of an LLM, but my layman's understanding of the result here would be equally applicable to LLM output. Effectively, the copyright office does not consider putting in a prompt enough for there to be "human authorship" of the work. In this specific case, that resulted on the images in the comic being considered uncopyrightable. The broader comic, in the organization of the images, the plot and dialogue, etc., still enjoys copyright protection. But in the US, I could just directly take the images in the comic that Midjourney produced and use them for another purpose without violating copyright.

EMIRELADERO · on March 5, 2023

> That definition would apply to almost anything software produces

Not really. The reason software can be copyrighted at all is because the actual code (and resulting object code) is creative. Courts have named this threshold the "Structure, sequence and organization" of the work. ML models don't follow any creative SSO the way actual code does.

> Also the output of ML seems to belong to whoever pressed the generate button atm.

The output, it seems to me, is uncopyrightable. Copyright only cares about who provides the creativity for the work at issue, not who put in the effort to make it happen. You may own the copyright to your prompt, but the result is generated entirely by the AI and thus lacks human autorship.

AlDante2 · on March 5, 2023

I think that copyright law works differently. Source code is copyright; the expression as compiled code from that source enjoys the same protections. If the model can be copyrighted, the expression of the model in the form of its weights is probably also protected.

EMIRELADERO · on March 5, 2023

You're correct, but that doesn't disprove my point. I'm saying the model itself is uncopyrightable.

lumost · on March 6, 2023

Curious what the ultimate enforcability of a restrictive license is. If I fine-tune a model, is it still covered? what if I randomize and retrain a layer - or remove a layer?

It seems like it will be impossible to verify that someone did not just train the model from scratch.

yumraj · on March 5, 2023

What’s the minimum single GPU that’ll work for the smallest model?

zargon · on March 5, 2023

This reddit post says that the 7B model consumes about 9.7GB of VRAM (using int8). I'm sure very soon people will add support for using system RAM as swap space, which will allow you to use it on an 8GB card, though with a fairly hefty performance penalty.

https://www.reddit.com/r/MachineLearning/comments/11h3p2x/d_...

downvotetruth · on March 5, 2023

3060 12GB

cloudking · on March 5, 2023

What's up with the domain in the script? PRESIGNED_URL="https://agi.gpt4.org...

arjvik · on March 5, 2023

It's pointed at Cloudflare storage right now

arjvik · on March 5, 2023

How big is this model? (i.e. disk space to store it)

kuroguro · on March 5, 2023

65B is ~120GB. All of them combined with the smaller versions is ~220GB.

nextaccountic · on March 5, 2023

After converting to int8, does it become smaller? Also, can this be further compressed? Like, is there some redundancy a special-purpose compressor could exploit?

rfoo · on March 5, 2023

Converting to int8 halves the size.

notpushkin · on March 5, 2023

For even better speeds, perhaps use the link from this script (if it ever goes back up) as a webseed for torrent?

ahahahahah · on March 5, 2023

Are we celebrating theft from tech companies now?

7to2 · on March 6, 2023

This isn't theft by any common definition. End of argument.

Now you could try to argue that it's copyright infringement but there are many solid arguments as to why these model weights don't meet the threshold of copyrightability.

You could also try to argue distribution of trade secrets, but facebook doesn't seem to view them as such - shared with little restrictions to anyone with an accedemic email, no vetting or ndas, etc.

I personally think that facebook planned all of this (sans the childish behavior occurring on their github repo, maybe). They probably wanted to release a capable language model publicly but didn't want the legal and social liabilities associated with it.

Facebook is no stranger to keeping things secret. I simply refuse to believe that they didn't see this happening.

(Thank you, Facebook!)

tjranagk · on March 6, 2023

Their model was trained on vast amounts of copied data. Is making a copy of their model really much different?

squokko · on March 6, 2023

antibasilisk · on March 5, 2023

Copying isn't theft. If you bought the ssd it belongs to you in its entirety regardless what state you decide to configure it into.

anaganisk · on March 5, 2023

I mean highseas, adblockers, bypassing paywalls, each one of them is theft. But on the flipside, companies are constantly trying to keep the ownership of data we paid for full price, scooping up personal data, selling low quality work behind paywall.