> The training recipe and model architecture follow LLaMA
This is huge.
MPT and Falcon are cool, but the inference runtimes and various tooling is mostly optimized for LLaMA. If this is a drop-in replacement for 7B, it's going to catch on much faster than any other small model.
XGen-7B is probably the superior 7B model, it's trained on more tokens and a longer default sequence length (although both presumably can adopt SuperHOT (Position Interpolation) to extend context), but larger models still probably perform better on an absolute basis.
Those look all programming related tokens, and I think they should relate to their focus of improving this model's code generation capabilities, and adding quite a bit of data related to that (BigCode Starcoder in the Second stage in pre-training https://blog.salesforceairesearch.com/xgen/#pre-training-dat... )
Looks like this is the limitation why it cannot just be plugged into llama.cpp (doesn't have a saved tokeniser model, and I'm not sure how would one go about creating one). Otherwise it would be cool to try it out, the metrics in the article are promising to have something running locally on an M1 Mac...
7B LLaMA is a terrible general purpose model, but the finetunes are pretty good at very specific roles, like dialogue/roleplay, a dungeon master bot or even code completion.
The metrics are good though, perhaps placing this closer to 13B.
And 8K context is huge. When you can stuff that much example text in, it gives the model more to "latch onto," and its also the point where you would start worrying about RAM/VRAM consumption for a ~13B model.
You must have missed the memo... It's now super easy to extend the context of 2k llama models to 8k, 16k, or even 32k with just a small fine tune and a tweak to the code.
You still need the memory to be able to go that high, but it's totally doable.
But I assumed full training would give better perplexity for large contexts, and perhaps this method would be more effective at 16K+ with an 8K model to start with.
Possibly, but the perplexity has shown to decrease while fine-tuning a 2048 model on larger context sizes for outputs within it's original context limit...so, more research needed.
Some model datasets like Manticore, Chronos or the infamous Pygmalion are more "secretive," but you can find the dataset gathering scripts on Github or in community chats.
That blog post demonstrates that it's not "easily" finetuneable, just possible to finetune. There's many technical considerations even beyond hardware (dataset formatting, training hyperparameter nuances) that do not make it accessible to the newbie experimenting with LLMs.
It's a rabbithole, and unfortunately there's no good shortcuts.
It isn't script kiddie level but it isn't hard I finetuned a 15B parameter reddit bot with an afternoon of time and a day of training on a 3090. Bot got a few thousand Karma in a couple of days before I turned it off (proof of concept done).
If all you have is an M1 or whatever, ya, you need a real workstation and depending on your use ChatGPT might be cheaper/better.
Why have there been thousands of overnight AI/GPT startups and products in the last few months and NOT a single simple intuitive "fine tuning wizard" app? That seems like such an obvious glaring gap.
Because the ChatGPT API (and analogous competitors) is cheap enough that it's both faster and more cost effective to just use it instead instead of using your own model, with maybe some shenanigans to handle its shortcomings without increasing cost much if at all.. And that was before gpt-3.5-turbo-0613, which dropped the price more and is about 2-3x faster.
There are startups that do finetuning on your own data, but with zero hints on how to preprocess your data and absurd costs (both upfront training and GPUs for serving inference) that's it's extremely difficult to argue from a customer business perspective compared to just using an API.
> Why have there been thousands of overnight AI/GPT startups and products in the last few months and NOT a single simple intuitive "fine tuning wizard" app?
Vapourware GPT startup inc is valued at $2bn the afternoon after you form the company and buy your first macbook.
Actual usage of Ai, fine tuning etc. I can offer you $100,000 for 30% of your company if you can demonstrate a fully working product.
> What use cases do people have for these smaller LLM's?
None. Training a functionally useless model and releasing it is a great way to demonstrate that your company is hip and current. That way when prospective clients ask about AI you can vaguely gesture at some model that you released and say you employ cutting edge AI experts.
…what they (and everyone) is gonna do is play with smallish models to iterate on the process for relatively small expense and earn karma.
Then pay big $$$ to make a really good model for internal use and/or an api that people have to pay for.
Tldr; it’s free. It’s by sales force. You should expect it to be a) crippled and b) a loss leader for a paid product.
Not judging; it’s a fair strategy. Just saying: salesforce is not a company that just gives hundreds of thousands of dollars away for nothing.
If you want a good free open model, you’re kidding yourself if you think a corporate giant is going to kiss you on the head and give it to your for free.
> If you want a good free open model, you’re kidding yourself if you think a corporate giant is going to kiss you on the head and give it to your for free.
Yep! That makes sense!
I would love to be the CEO of the company that does give away an actually useful model and little forehead kisses though. The amount of goodwill that one would generate from that would be astronomical and training costs are getting so low that nearly any company with enough cash could do it.
I look forward to waking up and hearing that the Nabisco/Canadian Tire/A&W usefully-tuned model is revolutionizing the economy and seeing the infinite amount of good press that it would generate.
OpenAI did this when releasing Whisper, but I mostly hear sneers about how they're not really open, and no gratitude for the "little kiss". Given that, I don't know that as CEO, I'd be very benevolent.
I don't agree. LLaMA models are great if you want to run your own models on your own systems, but only if you fine-tune them to specific tasks. The problem is that LLaMA is non-commercial. There was a need for a small efficient pre-trained model to build on. This is what Salesforce released. It's not intended to be used with general purpose prompting like chatGPT.
The problems with chatGPT are many - dependence on third party, privacy, externally imposed ideology and rules, cost, and most importantly - prompting is context-size limited and token-expensive, you can't pack much data into it.
Fine-tuning is a more powerful approach where you can actually fix the model problems instead of futzing around with the prompt and demonstrations. Yes, you got to work on your dataset. But if you don't already have it you can bootstrap with GPT-4 for a small sum.
Meta provided the training wheels - LLaMA, every company tried fine-tuning it for their purposes, but could not proceed for lack of a commercial base model. Salesforce XGen and a few other open-small-LLMs (funny how that sounds!) open the flood gates.
So the recipe is: use an existing dataset, or make one with regular GPT-4 prompting and a bit of curation. Then fine-tune a small open model. You can get it to be better than stock GPT-4, cheaper, faster and private. If you use LoRA's you can save each skill in a separate diff model just 1% the size of the base model and use a single GPU to fine-tune it, in a single day.
If your main product is a business tool to spam people, it would be in your best interest to enable as many new businesses to sprout up and start spamming people as possible. It would also be in your best interest to prevent your competitors from creating that enablement product and selling it as a service, earning revenue, and pumping that money back into their spam product.
These free models are both a defensive move against behemoths, and kindling to rapid business development.
The main use case is that it's probably the only size consumers can run on their personal devices. If you don't want your data going into an external platform like OpenAI it's the only solution even if it's not very usuable.
You would just need a computer which can fit 2 3090s in order to run those to run something like TheBloke/airoboros-65B-gpt4-1.3-GPTQ
https://www.reddit.com/r/LocalLLaMA/wiki/models/ gives you a list of VRAM requirements to load the model into GPU VRAM. the more VRAM the computer has, the larger the model you can load in, thus making 3090s the current consumer grade king due to price to max VRAM.
This being said however most models are LLAMA based which all fall under that specific research license.
So following the rules, you would be limited to a subset of models which are foundational models which allow for commercial use
llama-30B (which is actually 33B) and derivatives generally run fine with 4-bit quantization on a single RTX 3090 or 4090, although depending on group size used for quantization you may need to slightly dial down the context size.
Yes, but I think the responder is wondering if there are useable use cases for that - like, what can you actually Do with that model. I’m in the same boat - I don’t want to ship my data to openai, I do want to run local, so I’d love to hear what other folks are Doing with models of that size.
They're only modestly worse than text-davinci-003 in my experience, and you can finetune them cheaply to do simple tasks e.g. triage human input and decide where to send it. But yeah, if you can afford to run a larger model or pay OpenAI for gpt4 calls, that's gonna work a lot better.
If you're relying on prompting for the 7B models IMO you're gonna have a bad time — they're mostly toys at that: interesting output but not consistently useful. But finetuning gets better results, and it's cheap to finetune.
I did some old school NLP before, but don’t really work in the field anymore. As a generative model. As a general purpose generative model, maybe this isn’t very useful. As a foundation to make models to perform text classification and information extraction tasks, this could be very useful. For these kinds of tasks, you can still get good results with the classic bag of words type approaches people were using 30 years ago even. I remember when transformers first came out, limitations in sequence size made them unusable for some classification tasks.
The use case is people who got some fear of missing out on AI and also they are not capable of seeing the difference between the model outputs themselves. Maybe you will say I'm elitist but it's disturbing to me how many people literally can't tell the difference between gpt 3 and gpt 4 outputs, it's like the "It's the same picture" meme. Maybe that makes me a hipster gpt connoisseur elitist that I can recognize and have opinions about those differences between versions and I hate using the small ones and I think only the very biggest ones are good.
It doesn't make you an elitist hipster. It puts you in a category of most people on earth. An elist hipster would swear by a custom smaller parameter model finetuned by a wizard for specific tasks and have a spell book of models.
I've had very good results with even a 3.5 B model (Fastchat-T5) for retrieval augmented generation (aka putting information into the context window and letting the model rephrase it).
I've used BERT in a number of production apps, it feels like a very apples to oranges comparison given how the AI landscape has changed since BERT's release.
Also, their metric table is very interesting. It shows Falcon 7B and OpenLlama 7B much less favorably than other evaluations (including the HuggingFace leaderboard, which I am kinda suspicious of), and instruct benchmarks like that aren't seen as much.
- 8K length means the size of input/output is 8K tokens.
- 1.5T tokens mean the training set has 1.5T tokens.
A: What's a parameter?
Q: More parameters your model has, more complex relationship it can represent. For example let's say you have a function f(x). This is a 2-parameter model:
f(x) = ax + b
This is a 4 parameter model:
f(x) = ax^3 + bx^2 + cx + d
As you can see as the number of parameters grows, the function is able to represent more complex relationship between f(x) and x.
A: What's a token?
Token is a way to encode text, like ASCII or Unicode. Unlike Unicode, tokenizor usually favors common combinations of alphabets. For example, "the" is a single token for GPT-3 tokenizor, but "eht" is two tokens (e and ht).
* Note that the number of parameters is more like an "upper limit" of the model's capabilities. If your a, b, c, d are just random shit, it's still a 4-parameter model, but it's still useless. The whole concept of "training" is just "finding the best parameters".
1) 7B foundational model means this is a base model that has not been fine-tuned on the "prompt: response:" (instruct) structure, and it has 7 billion weights/biases. It's the working size of the model.
2) Currently every model that can run locally was trained with a 2K context size. It's a hard limit on prompt length. There have been recent advances with [A] position interpolation, but those methods explore fine-tuning/loras. This base model was trained with 8k sequences.
3. 1.5T tokens is the size of the total training corpus. Training cost and time increases with training size. [B]
"7B" refers to the number of parameters or weights for a model. For a specific model, the versions with more parameters take more compute power to train and perform better.
A foundational model is the part of a ML model that is "pretrained" on a massive data set (and usually is the bulk of the compute cost). This is usually considered the "raw" model after which it is fine-tuned for specific tasks (turned into a chatbot).
"8K length" refers to the Context Window length (in tokens). This is basically an LLM's short term memory - you can think of it as its attention span and what it can generate reasonable output for.
"1.5T tokens" refers to the size of the corpus of the training set.
In general Wikipedia (or I suppose ChatGPT 4/Bing Chat with Web Browsing) is a decent enough place to start reading/asking basic questions. I'd recommend starting here: https://en.wikipedia.org/wiki/Large_language_model and finding the related concepts.
For non-ML programmers looking to get up to speed, I feel like Karpathy's Zero to Hero/nanoGPT or Jay Mody's picoGPT https://jaykmody.com/blog/gpt-from-scratch/ are alternative/maybe a better way to understand the basic concepts on a practical level.
1. The trained model has 7B parameters or weights for each neuron.
2. It can handle upto 8k tokens. Tokens are usually some representation for a word. If your tokens are characters then, "h", "e", "y" represent 3 tokens for hey. Most of the algos use byte pair encoding. For example "hand-le" has two tokens "hand" and "le". This is a very crud example which is enough to give the gist but is not accurate. You can look into byte pair encoding for more details.
3. The token size 1.5T token means they have huge variations for input and output. Simply put, it was trained on large data corpus.
I hope this simplifies it. You can research further if you are interested! Hope it helps!
Sorry doc, I wrote that comment in a smartphone without putting any thought. What I wanted to say was:
> there are 7B parameters. A parameter is a weight assigned to single neuron.
I hope this clarifies the answer now.
Now that is done I am quite curious on how you came up with the idea it was written by ChatGPT? I just wanted to simplify as best as I could. It’s funny you thought it that way.
What could I have done so that it didn’t sound like response from ChatGPT? I am asking it to prevent future misunderstandings. I thought my grammatical errors would be enough to show it wasn’t a ChatGPT response.
Maybe you are right. I was confused by sentences like "I hope this simplifies it. You can research further if you are interested! Hope it helps!" which seemed to be responding to a prompt other than just the previous comment.
It's an open source language model with seven billion parameters (a measure of its complexity), and a longer than typical sequence length of 8K, which allows the provision of more context when querying the model. For example, it allows you to better generate text in someone else's voice by providing a longer example of their work.
The number of parameters seems meaningless when the training sets are dogshit and hardened old gum chipped from the shoes of Gregslist and FleaBay.
OTOH, there ought to be a construction in the form of an web app that can pinch out nonrepetitive, coherent ~100 page trashy romance novels in the style of any author given name with open source or specific text(s) or transcripts with enough original input volume: Churchill, The Unabomber, psycho happy kindergarten child development IEP manual writer, The Dude, Walter (agro gun nut), Bob Ross, Grace Hopper, Ayn Rand, LBJ, The Dalai LaMa%, Hitler, Kanye (Ye), Bhad Bhabie, and the King James Bible. Ethical and generational safety features be damned; it'd be generating fucking^2 art for hilarious entertainment purposes. How does one stretch training input to something that might involve human/computer output validation to discard sticking on repetitive nonsense?
Per the validation perplexity chart shown, the 8K length model performs better than the 4K length model even at <4K length, so why are they even offering the 4K model if the 8K is strictly better?
I'd say for research purposes. HackerNews seems to tend to mostly represent the LLM consumer viewpoint, but these waves of models being released are honestly more interesting from a research than a user point of view. As a LLM user, you're (vast generalization/simplification) really just interested in the best of N models, but as a researcher, I'm super interested in each model's performance and analyzing the reasons for any differences.
With this model (and they say this in the blog post), they were testing the hypothesis that training on a longer context size would provide more performance at the same parameter count/inference FLOPs. From a quick perusal of their post, it looks like this was true, and we should train all future models with as long of a context size as we can afford.
Please recommend a good tutorial/book/video on modern LLMs and NNs in general, for programmers and technical people. Where you get the idea of how it works. Tried googling with dozens of queries and it just sucks, a lot of hand-wavy articles for lay people or some paid courses.
I think we are only at the beginning. Call it first generation. LLaMA was even missing many improvements that existed before it (xPos, Multi-Query Attention, Blockwise Parallel Transformer)
Many researchers are improving very fast, and I would bet that soon we will see more efficient LLMs.
This is huge.
MPT and Falcon are cool, but the inference runtimes and various tooling is mostly optimized for LLaMA. If this is a drop-in replacement for 7B, it's going to catch on much faster than any other small model.