I initially loved looking for obscure stuff, e.g. setting region to soviet union. It surely is the case that 99% of the users want 10% of the data at most. I'll have to work ability to select the file and download & cache it only if the relevant query is asking for it.
Important to note that this model excels in reasoning capabilities.
But it was on purpose not trained on the big “web crawled” datasets to not learn how to build bombs etc, or be naughty.
So it is the “smartest thinking” model in weight class or even comparable to higher param models, but it is not knowledgeable about the world and trivia as much.
This might change in the future but it is the current state.
* by want, I mean need. People self-peasantized heavily on "censorsed models" and don't really understand how these work, and the SNR is out of wack because there's a 100000x more waifu creators and culture warriors than knowledgable people sharing on this subject
If you think that LLMs have basically two properties: habitability to use natural language and knowledge to answer questions, then Small language models should being seen just excellent at natural language, and that's great because for many tasks general knowledge is not needed, specially for RAG.
> This might change in the future but it is the current state
I hope it doesn't change. The focus of a model shouldn't be to embed data. Retrieval is a better method to provide data to a model, and leads to less "sounds smart" but very wrong results.
Having less data embedded also means that the model is more generally usable outside the realm of chat assistants, where you only want the model to be aware about data you provide it. One example could be in games where you might have a medieval fantasy setting, it would be really weird if you could get a character to start talking to you about US politics. That probably still wouldn't work with Phi-2 without fine-tuning (as I imagine it does have some data of US politics embedded), but I hope it illustrates the point.
It was trained on "textbook quality" synthetic data + some high quality web data.
The question is - if we train a model on synthetic data generated by GPT-4 which has copyright issues, what is the status of this model? Will MS have to delete it as well? And all models trained with GPT-4 data?
This "vibe" check that it's even better than GPT-4 Turbo is not what its Elo rating shows on the Chatbot Arena based on not 1 but thousands of user votes.
GPT-4 (Turbo) is in a league of its own still.
That depends on what real world use you're targeting, but unfortunately I'm not aware of anything better than that leaderboard in terms of sample size and model coverage.
This is based on users choosing the better from 2 models at a time, and calculating an ELO rating from who-beats-who.
BYOT - bring your own tests style.
Gives a better picture of real-world performance and more robust against contamination.
They collected over 6000 and 1500 votes for Mixtral-8x7B and Gemini Pro.
While ELO ratings are widely used to rank performance in Chess or among sports teams, here's a disclaimer by the makers of the leaderboard:
---
> Please note Arena is a "live eval" and pretty much a sampling process to estimate models capability.
> That's why we show the confidence intervals through bootstrapping. Statistically, these models (e.g., GPT-3.5, Mixtral, Gemini Pro) are very close and only looking at their ranking can be misleading.
Exclude movies with very low number of rating or potentially very low scores too.
The long tail reduction would be significant
reply