XGen-7B, a new 7B foundational model trained on up to 8K length for 1.5T tokens

brucethemoose2 · on June 29, 2023

> The training recipe and model architecture follow LLaMA

This is huge.

MPT and Falcon are cool, but the inference runtimes and various tooling is mostly optimized for LLaMA. If this is a drop-in replacement for 7B, it's going to catch on much faster than any other small model.

lhl · on June 29, 2023

There's also OpenLLaMA, which has a 13B version as well and is a straight drop-in (except for code generation due to multiple space tokenization: https://github.com/openlm-research/open_llama#update-0615202...).

XGen-7B is probably the superior 7B model, it's trained on more tokens and a longer default sequence length (although both presumably can adopt SuperHOT (Position Interpolation) to extend context), but larger models still probably perform better on an absolute basis.

jxy · on June 29, 2023

It seems to use a different tokenizer than LLaMA, though the neural network architecture is the same.

minimaxir · on June 29, 2023

The tokenizer appears to be the original GPT-2 tokenizer, with some curious added tokens: https://huggingface.co/Salesforce/xgen-7b-8k-base/blob/5e1ad...

imrehg · on June 29, 2023

Those look all programming related tokens, and I think they should relate to their focus of improving this model's code generation capabilities, and adding quite a bit of data related to that (BigCode Starcoder in the Second stage in pre-training https://blog.salesforceairesearch.com/xgen/#pre-training-dat... )

imrehg · on June 29, 2023

Looks like this is the limitation why it cannot just be plugged into llama.cpp (doesn't have a saved tokeniser model, and I'm not sure how would one go about creating one). Otherwise it would be cool to try it out, the metrics in the article are promising to have something running locally on an M1 Mac...

TOMDM · on June 29, 2023

From all the experimentation I've done, 7B parameter models just don't seem to be able to produce useful output reliably enough for my use cases.

What use cases do people have for these smaller LLM's?

brucethemoose2 · on June 29, 2023

7B LLaMA is a terrible general purpose model, but the finetunes are pretty good at very specific roles, like dialogue/roleplay, a dungeon master bot or even code completion.

The metrics are good though, perhaps placing this closer to 13B.

And 8K context is huge. When you can stuff that much example text in, it gives the model more to "latch onto," and its also the point where you would start worrying about RAM/VRAM consumption for a ~13B model.

Tostino · on June 29, 2023

You must have missed the memo... It's now super easy to extend the context of 2k llama models to 8k, 16k, or even 32k with just a small fine tune and a tweak to the code.

You still need the memory to be able to go that high, but it's totally doable.

thomasahle · on June 29, 2023

What memo/paper does this?

Tostino · on June 29, 2023

https://arxiv.org/abs/2306.15595

brucethemoose2 · on June 29, 2023

I saw the SuperHOT LORAs as well.

But I assumed full training would give better perplexity for large contexts, and perhaps this method would be more effective at 16K+ with an 8K model to start with.

Tostino · on June 30, 2023

Possibly, but the perplexity has shown to decrease while fine-tuning a 2048 model on larger context sizes for outputs within it's original context limit...so, more research needed.

ttt3ts · on June 29, 2023

Do you know a dataset for fine-tuning roleplaying?

aryamaan · on June 29, 2023

would appreciate resources on fine tuning

brucethemoose2 · on June 29, 2023

Honestly, I dunno. I think most people are using lit-llama or EasyLM (on TPUs) for finetuning?

QLORA is the gold standard for more affordable training.

As for datasets, just look at the open datasets the best-in-class models are using, like Vicuna or https://huggingface.co/NousResearch/Nous-Hermes-13b

Some model datasets like Manticore, Chronos or the infamous Pygmalion are more "secretive," but you can find the dataset gathering scripts on Github or in community chats.

ttt3ts · on June 29, 2023

https://huggingface.co/blog/stackllama

You can easily finetune 7B or 15B LORA model with that on consumer GPUs.

minimaxir · on June 29, 2023

That blog post demonstrates that it's not "easily" finetuneable, just possible to finetune. There's many technical considerations even beyond hardware (dataset formatting, training hyperparameter nuances) that do not make it accessible to the newbie experimenting with LLMs.

It's a rabbithole, and unfortunately there's no good shortcuts.

ttt3ts · on June 29, 2023

It isn't script kiddie level but it isn't hard I finetuned a 15B parameter reddit bot with an afternoon of time and a day of training on a 3090. Bot got a few thousand Karma in a couple of days before I turned it off (proof of concept done).

If all you have is an M1 or whatever, ya, you need a real workstation and depending on your use ChatGPT might be cheaper/better.

Solvency · on June 29, 2023

Why have there been thousands of overnight AI/GPT startups and products in the last few months and NOT a single simple intuitive "fine tuning wizard" app? That seems like such an obvious glaring gap.

minimaxir · on June 29, 2023

Because the ChatGPT API (and analogous competitors) is cheap enough that it's both faster and more cost effective to just use it instead instead of using your own model, with maybe some shenanigans to handle its shortcomings without increasing cost much if at all.. And that was before gpt-3.5-turbo-0613, which dropped the price more and is about 2-3x faster.

There are startups that do finetuning on your own data, but with zero hints on how to preprocess your data and absurd costs (both upfront training and GPUs for serving inference) that's it's extremely difficult to argue from a customer business perspective compared to just using an API.

shapefrog · on June 29, 2023

> Why have there been thousands of overnight AI/GPT startups and products in the last few months and NOT a single simple intuitive "fine tuning wizard" app?

Vapourware GPT startup inc is valued at $2bn the afternoon after you form the company and buy your first macbook.

Actual usage of Ai, fine tuning etc. I can offer you $100,000 for 30% of your company if you can demonstrate a fully working product.

SparkyMcUnicorn · on June 29, 2023

This and/or text-generation-webui training doc are a good place to start.

https://github.com/zetavg/LLaMA-LoRA-Tuner

lhl · on June 29, 2023

Here's a good writeup that goes into more depth than most of the READMEs on fine-tuning: https://erichartford.com/uncensored-models

jrflowers · on June 29, 2023

> What use cases do people have for these smaller LLM's?

None. Training a functionally useless model and releasing it is a great way to demonstrate that your company is hip and current. That way when prospective clients ask about AI you can vaguely gesture at some model that you released and say you employ cutting edge AI experts.

wokwokwok · on June 29, 2023

yep, that’s why it’s free.

If it was good, then they’d charge for it.

…what they (and everyone) is gonna do is play with smallish models to iterate on the process for relatively small expense and earn karma.

Then pay big $$$ to make a really good model for internal use and/or an api that people have to pay for.

Tldr; it’s free. It’s by sales force. You should expect it to be a) crippled and b) a loss leader for a paid product.

Not judging; it’s a fair strategy. Just saying: salesforce is not a company that just gives hundreds of thousands of dollars away for nothing.

If you want a good free open model, you’re kidding yourself if you think a corporate giant is going to kiss you on the head and give it to your for free.

jrflowers · on June 29, 2023

> If you want a good free open model, you’re kidding yourself if you think a corporate giant is going to kiss you on the head and give it to your for free.

Yep! That makes sense!

I would love to be the CEO of the company that does give away an actually useful model and little forehead kisses though. The amount of goodwill that one would generate from that would be astronomical and training costs are getting so low that nearly any company with enough cash could do it.

I look forward to waking up and hearing that the Nabisco/Canadian Tire/A&W usefully-tuned model is revolutionizing the economy and seeing the infinite amount of good press that it would generate.

fragmede · on June 29, 2023

OpenAI did this when releasing Whisper, but I mostly hear sneers about how they're not really open, and no gratitude for the "little kiss". Given that, I don't know that as CEO, I'd be very benevolent.

ilaksh · on June 30, 2023

Not quite the same but that's what Stable Foundation did with stable diffusion.

AuryGlenz · on June 29, 2023

To be fair, Stable Diffusion (especially the upcoming SDXL) are good and free.

visarga · on June 29, 2023

I don't agree. LLaMA models are great if you want to run your own models on your own systems, but only if you fine-tune them to specific tasks. The problem is that LLaMA is non-commercial. There was a need for a small efficient pre-trained model to build on. This is what Salesforce released. It's not intended to be used with general purpose prompting like chatGPT.

The problems with chatGPT are many - dependence on third party, privacy, externally imposed ideology and rules, cost, and most importantly - prompting is context-size limited and token-expensive, you can't pack much data into it.

Fine-tuning is a more powerful approach where you can actually fix the model problems instead of futzing around with the prompt and demonstrations. Yes, you got to work on your dataset. But if you don't already have it you can bootstrap with GPT-4 for a small sum.

Meta provided the training wheels - LLaMA, every company tried fine-tuning it for their purposes, but could not proceed for lack of a commercial base model. Salesforce XGen and a few other open-small-LLMs (funny how that sounds!) open the flood gates.

So the recipe is: use an existing dataset, or make one with regular GPT-4 prompting and a bit of curation. Then fine-tune a small open model. You can get it to be better than stock GPT-4, cheaper, faster and private. If you use LoRA's you can save each skill in a separate diff model just 1% the size of the base model and use a single GPU to fine-tune it, in a single day.

basch · on June 29, 2023

If your main product is a business tool to spam people, it would be in your best interest to enable as many new businesses to sprout up and start spamming people as possible. It would also be in your best interest to prevent your competitors from creating that enablement product and selling it as a service, earning revenue, and pumping that money back into their spam product.

These free models are both a defensive move against behemoths, and kindling to rapid business development.

jimmyl02 · on June 29, 2023

The main use case is that it's probably the only size consumers can run on their personal devices. If you don't want your data going into an external platform like OpenAI it's the only solution even if it's not very usuable.

m00x · on June 29, 2023

You can run big models on the cloud yourself, or with a 3090/4090 quantized.

You don't have to go to openai.

aryamaan · on June 29, 2023

What’re some models and hardware combos we can run now? I am avoiding to go to OpenAI with my office’s stuff and can use some gpu(s)

MINIMAN10000 · on June 29, 2023

You would just need a computer which can fit 2 3090s in order to run those to run something like TheBloke/airoboros-65B-gpt4-1.3-GPTQ

https://www.reddit.com/r/LocalLLaMA/wiki/models/ gives you a list of VRAM requirements to load the model into GPU VRAM. the more VRAM the computer has, the larger the model you can load in, thus making 3090s the current consumer grade king due to price to max VRAM.

This being said however most models are LLAMA based which all fall under that specific research license.

So following the rules, you would be limited to a subset of models which are foundational models which allow for commercial use

brucethemoose2 · on June 29, 2023

I can easily run LLaMA 13B on my 6GB VRAM/16GB RAM laptop using llama.cpp (specifically Kobold.cpp as the frontend).

I can barely run 33B, but anything more than 800 context and I oom. But it would run very comfortably on a bigger GPU or a 24GB+ laptop.

Theoretically some phones can comfortably handle 13B on mlc-llm though in practice its not really implemented yet.

int_19h · on June 29, 2023

llama-30B (which is actually 33B) and derivatives generally run fine with 4-bit quantization on a single RTX 3090 or 4090, although depending on group size used for quantization you may need to slightly dial down the context size.

jahewson · on June 29, 2023

These are constraints but not a use case.

verwkljdslfklj · on June 29, 2023

Isn't it obvious they're referring to the use case of using a model given those constraints?

roughly · on June 29, 2023

Yes, but I think the responder is wondering if there are useable use cases for that - like, what can you actually Do with that model. I’m in the same boat - I don’t want to ship my data to openai, I do want to run local, so I’d love to hear what other folks are Doing with models of that size.

sidibe · on June 29, 2023

"Use case of using" :)

reissbaker · on June 29, 2023

They're only modestly worse than text-davinci-003 in my experience, and you can finetune them cheaply to do simple tasks e.g. triage human input and decide where to send it. But yeah, if you can afford to run a larger model or pay OpenAI for gpt4 calls, that's gonna work a lot better.

If you're relying on prompting for the 7B models IMO you're gonna have a bad time — they're mostly toys at that: interesting output but not consistently useful. But finetuning gets better results, and it's cheap to finetune.

steppi · on June 29, 2023

I did some old school NLP before, but don’t really work in the field anymore. As a generative model. As a general purpose generative model, maybe this isn’t very useful. As a foundation to make models to perform text classification and information extraction tasks, this could be very useful. For these kinds of tasks, you can still get good results with the classic bag of words type approaches people were using 30 years ago even. I remember when transformers first came out, limitations in sequence size made them unusable for some classification tasks.

ftxbro · on June 29, 2023

The use case is people who got some fear of missing out on AI and also they are not capable of seeing the difference between the model outputs themselves. Maybe you will say I'm elitist but it's disturbing to me how many people literally can't tell the difference between gpt 3 and gpt 4 outputs, it's like the "It's the same picture" meme. Maybe that makes me a hipster gpt connoisseur elitist that I can recognize and have opinions about those differences between versions and I hate using the small ones and I think only the very biggest ones are good.

ipaddr · on June 29, 2023

It doesn't make you an elitist hipster. It puts you in a category of most people on earth. An elist hipster would swear by a custom smaller parameter model finetuned by a wizard for specific tasks and have a spell book of models.

galaxyLogic · on June 29, 2023

Right, that's the Way of the Hipster

rolisz · on June 29, 2023

I've had very good results with even a 3.5 B model (Fastchat-T5) for retrieval augmented generation (aka putting information into the context window and letting the model rephrase it).

makestuff · on June 29, 2023

I feel like this would work well as a local LLM for a home assistant type setup. Fully local instead of having Alexa send everything to the cloud.

riku_iki · on June 29, 2023

Bert had lots of use cases and this one is supposedly stronger model.

TOMDM · on June 29, 2023

I've used BERT in a number of production apps, it feels like a very apples to oranges comparison given how the AI landscape has changed since BERT's release.

riku_iki · on June 29, 2023

Then looks like you know already where smaller and weaker models are useful.

nextaccountic · on June 29, 2023

What did you use BERT for? Maybe 7B models are up to the task?

brucethemoose2 · on June 29, 2023

Also, their metric table is very interesting. It shows Falcon 7B and OpenLlama 7B much less favorably than other evaluations (including the HuggingFace leaderboard, which I am kinda suspicious of), and instruct benchmarks like that aren't seen as much.

profsummergig · on June 29, 2023

If someone could elucidate on what these phrases signify, I'd be very grateful:

1) 7B foundational model

2) 8K length

3) 1.5T tokens

raincole · on June 29, 2023

A (overly) simplified explanation:

- 7B means 7 billions parameters.

- 8K length means the size of input/output is 8K tokens.

- 1.5T tokens mean the training set has 1.5T tokens.

A: What's a parameter?

Q: More parameters your model has, more complex relationship it can represent. For example let's say you have a function f(x). This is a 2-parameter model:

f(x) = ax + b

This is a 4 parameter model:

f(x) = ax^3 + bx^2 + cx + d

As you can see as the number of parameters grows, the function is able to represent more complex relationship between f(x) and x.

A: What's a token?

Token is a way to encode text, like ASCII or Unicode. Unlike Unicode, tokenizor usually favors common combinations of alphabets. For example, "the" is a single token for GPT-3 tokenizor, but "eht" is two tokens (e and ht).

* Note that the number of parameters is more like an "upper limit" of the model's capabilities. If your a, b, c, d are just random shit, it's still a 4-parameter model, but it's still useless. The whole concept of "training" is just "finding the best parameters".

knaik94 · on June 29, 2023

1) 7B foundational model means this is a base model that has not been fine-tuned on the "prompt: response:" (instruct) structure, and it has 7 billion weights/biases. It's the working size of the model.

2) Currently every model that can run locally was trained with a 2K context size. It's a hard limit on prompt length. There have been recent advances with [A] position interpolation, but those methods explore fine-tuning/loras. This base model was trained with 8k sequences.

3. 1.5T tokens is the size of the total training corpus. Training cost and time increases with training size. [B]

A. https://arxiv.org/abs/2306.15595

B. https://www.semianalysis.com/p/the-ai-brick-wall-a-practical... (Jan 2023)

lhl · on June 29, 2023

Here are some high level answers:

"7B" refers to the number of parameters or weights for a model. For a specific model, the versions with more parameters take more compute power to train and perform better.

A foundational model is the part of a ML model that is "pretrained" on a massive data set (and usually is the bulk of the compute cost). This is usually considered the "raw" model after which it is fine-tuned for specific tasks (turned into a chatbot).

"8K length" refers to the Context Window length (in tokens). This is basically an LLM's short term memory - you can think of it as its attention span and what it can generate reasonable output for.

"1.5T tokens" refers to the size of the corpus of the training set.

In general Wikipedia (or I suppose ChatGPT 4/Bing Chat with Web Browsing) is a decent enough place to start reading/asking basic questions. I'd recommend starting here: https://en.wikipedia.org/wiki/Large_language_model and finding the related concepts.

For those going deeper, there are lot of general resources lists like https://github.com/Hannibal046/Awesome-LLM or https://github.com/Mooler0410/LLMsPracticalGuide or one I like, https://sebastianraschka.com/blog/2023/llm-reading-list.html (there are a bajillion of these and you'll find more once you get a grasp on the terms you want to surf for). Almost everything is published on arXiv, and most is fairly readable even as a layman.

For non-ML programmers looking to get up to speed, I feel like Karpathy's Zero to Hero/nanoGPT or Jay Mody's picoGPT https://jaykmody.com/blog/gpt-from-scratch/ are alternative/maybe a better way to understand the basic concepts on a practical level.

nullptr_deref · on June 29, 2023

1. The trained model has 7B parameters or weights for each neuron.

2. It can handle upto 8k tokens. Tokens are usually some representation for a word. If your tokens are characters then, "h", "e", "y" represent 3 tokens for hey. Most of the algos use byte pair encoding. For example "hand-le" has two tokens "hand" and "le". This is a very crud example which is enough to give the gist but is not accurate. You can look into byte pair encoding for more details.

3. The token size 1.5T token means they have huge variations for input and output. Simply put, it was trained on large data corpus.

I hope this simplifies it. You can research further if you are interested! Hope it helps!

thomasahle · on June 29, 2023

Don't just post ChatGPT answers as comments on hackernews.

This one doesn't even make any sense. Of course it doesn't have 7B parameters _per_ neuron.

nullptr_deref · on June 29, 2023

Sorry doc, I wrote that comment in a smartphone without putting any thought. What I wanted to say was: > there are 7B parameters. A parameter is a weight assigned to single neuron.

I hope this clarifies the answer now.

Now that is done I am quite curious on how you came up with the idea it was written by ChatGPT? I just wanted to simplify as best as I could. It’s funny you thought it that way.

What could I have done so that it didn’t sound like response from ChatGPT? I am asking it to prevent future misunderstandings. I thought my grammatical errors would be enough to show it wasn’t a ChatGPT response.

Looking forward to your reply!

johanvts · on June 29, 2023

Don’t be too critical dr. Ahle. Maybe it’s a new single-neuron architecture.

Sunhold · on June 29, 2023

Doesn't look like ChatGPT. Grammatical errors like "on large data corpus," the poor comma usage, misspelling crude, etc. are more of a human thing.

thomasahle · on June 29, 2023

Maybe you are right. I was confused by sentences like "I hope this simplifies it. You can research further if you are interested! Hope it helps!" which seemed to be responding to a prompt other than just the previous comment.

raincole · on June 29, 2023

This is a good example on why StackOverflow banned ChatGPT-generated answers.

DanAtC · on June 29, 2023

I have no idea what any of these words mean, but I'd like to. Can someone point me in the direction of an "AI for Dipshits"?

esafak · on June 29, 2023

It's an open source language model with seven billion parameters (a measure of its complexity), and a longer than typical sequence length of 8K, which allows the provision of more context when querying the model. For example, it allows you to better generate text in someone else's voice by providing a longer example of their work.

https://en.wikipedia.org/wiki/Foundation_models

sacnoradhq · on June 29, 2023

The number of parameters seems meaningless when the training sets are dogshit and hardened old gum chipped from the shoes of Gregslist and FleaBay.

OTOH, there ought to be a construction in the form of an web app that can pinch out nonrepetitive, coherent ~100 page trashy romance novels in the style of any author given name with open source or specific text(s) or transcripts with enough original input volume: Churchill, The Unabomber, psycho happy kindergarten child development IEP manual writer, The Dude, Walter (agro gun nut), Bob Ross, Grace Hopper, Ayn Rand, LBJ, The Dalai LaMa%, Hitler, Kanye (Ye), Bhad Bhabie, and the King James Bible. Ethical and generational safety features be damned; it'd be generating fucking^2 art for hilarious entertainment purposes. How does one stretch training input to something that might involve human/computer output validation to discard sticking on repetitive nonsense?

% He never saw that one coming Ow^(3 + i).

scubbo · on June 29, 2023

Do you have any recommendations for cryptocoins?

sacnoradhq · on June 30, 2023

Do you always ask flippant and rude questions that add no value to HN?

luxuryballs · on June 29, 2023

dang so this means that just like my 30yo 4GB hard drive is to my 4GB RAM phone… we ain’t seen nothing yet if we’re still counting these metrics?

seeknotfind · on June 29, 2023

Why not ask the AIs themselves? They are pretty good at explaining this type of thing.

deedree · on June 29, 2023

But if you don't know a certain amount about a subject already you won't know when it's lying to you. That would probably be the case here.

CamperBob2 · on June 29, 2023

It's basically what an interactive dialogue with Wikipedia would look like, which is still a darned useful thing.

TeMPOraL · on June 29, 2023

If Wikipedia had the conviction and focus of a 4 year old. But yeah, while tricky, it's still usually net positive.

dcre · on June 29, 2023

Sounds like a joke but it's true.

minimaxir · on June 29, 2023

Per the validation perplexity chart shown, the 8K length model performs better than the 4K length model even at <4K length, so why are they even offering the 4K model if the 8K is strictly better?

numeri · on June 29, 2023

I'd say for research purposes. HackerNews seems to tend to mostly represent the LLM consumer viewpoint, but these waves of models being released are honestly more interesting from a research than a user point of view. As a LLM user, you're (vast generalization/simplification) really just interested in the best of N models, but as a researcher, I'm super interested in each model's performance and analyzing the reasons for any differences.

With this model (and they say this in the blog post), they were testing the hypothesis that training on a longer context size would provide more performance at the same parameter count/inference FLOPs. From a quick perusal of their post, it looks like this was true, and we should train all future models with as long of a context size as we can afford.

artemonster · on June 29, 2023

Please recommend a good tutorial/book/video on modern LLMs and NNs in general, for programmers and technical people. Where you get the idea of how it works. Tried googling with dozens of queries and it just sucks, a lot of hand-wavy articles for lay people or some paid courses.

foolfoolz · on June 29, 2023

when will the llm race peak? have we peaked already?

DantesKite · on June 29, 2023

If you’re curious, you can check the progress of many open source LLM’s and how they perform on various evals here:

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

entrep · on June 29, 2023

With proprietary models: https://chat.lmsys.org/?leaderboard

bratao · on June 29, 2023

I think we are only at the beginning. Call it first generation. LLaMA was even missing many improvements that existed before it (xPos, Multi-Query Attention, Blockwise Parallel Transformer)

Many researchers are improving very fast, and I would bet that soon we will see more efficient LLMs.

sacnoradhq · on June 29, 2023

I don't think we need anymore high-flying search engines, do we?

salesforce air esearch ;)

az226 · on June 29, 2023

We’re still in the early innings.

nico · on June 29, 2023

Long way from that

kristianp · on June 29, 2023

In terms of Gartner hype cycle, we are at the peak of inflated expectations for llms I reckon.

esafak · on June 29, 2023

When it's superhuman.

phillnom · on June 29, 2023

To quote Dennis Reynolds, it hasn't even begun to peak.

rasz · on June 29, 2023

Nobody doing xB models is participating in any AI races, at this point those are useless toys with garbage output.

happycube · on June 29, 2023

How far we've come from GPT2...

TeMPOraL · on June 29, 2023

Biggest models crossed the threshold between glorified Markov chain and proto-intelligence, so sure enough, expectations shoot up.