A 0.5B parameter model with a 32k context length that also makes good use of that full window?! That's very interesting.
The academic benchmarks on that particular model relative to 1.5B-2B models are what you would expect, but it would make for an excellent base for finetuning/embedding generation.
Qwen1.5-0.5B supposedly supported up to 32k context as well, but I can't even get it to summarize a ~2k token input with any level of coherence.
I'm always excited to try a new model, so I'm looking forward to trying Qwen2-0.5B... but I wouldn't get your hopes up this much. These super tiny models seem far more experimental than the larger LLMs.
Phi-3-mini (3.8B) supports a 128k context, and it is actually a reasonably useful model in my tests. Gemma-2B-1.1-it is a 2B model that only supports 8k context, but it also does fairly well for summarization.
Summarization is one of the most difficult tasks for any LLM and over that context window, crazy to think it could do it.
That context window is useful if you have a smaller data extraction task, like dates, times, place names, etc. And even that it might need to be fine tuned on. These small models are a feedstock.
What tasks do you consider a 3.8B model to be useful for? Chat applications on lesser hardware, im still finding it difficult to parse what the real world application would ever be. However, I do understand that the goal is to make the smallest most efficient model to compete with the larger model capabilities one day and you can't get there without making these. But do these types of models have any value for any sort of product or real world project?
I think most of the interesting applications for these small models are in the form of developer-driven automations, not chat interfaces.
A common example that keeps popping up is a voice recorder app that can provide not just a transcription of the recording (which you don't need an LLM for), but also a summary of the transcription, including key topics, key findings, and action items that were discussed in a meeting. With speaker diarization (assigning portions of the transcript to different speakers automatically), it's even possible to use an LLM to assign names to each of the speakers in the transcript, if they ever identified themselves in the meeting, and then the LLM could take that and also know who is supposed to be handling each action item, if that was discussed in the meeting. That's just scratching the surface of what should be possible using small LLMs (or SLMs, as Microsoft likes to call them).
An on-device LLM could summarize notifications if you have a lot of catching up to do, or it could create a title for a note automatically once you finish typing the note, or it could be used to automatically suggest tags/categories for notes. That LLM could be used to provide "completions", like if the user is writing a list of things in a note, the user could click a button to have that LLM generate several more items following the same theme. That LLM can be used to suggest contextually-relevant quick replies for conversations. In a tightly-integrated system, you could imagine receiving a work phone call, and that LLM could automatically summarize your recent interactions with that person (across sms, email, calendar, and slack/teams) for you on the call screen, which could remind you why they're calling you.
LLMs can also be used for data extraction, where they can be given unstructured text, and fill in a data structure with the desired values. As an example, one could imagine browsing a job posting... the browser could use an LLM to detect that the primary purpose of this webpage is a job posting, and then it could pass the text of the page through the LLM and ask the LLM to fill in common values like the job title, company name, salary range, and job requirements, and then the browser could offer a condensed interface with this information, as well as the option to save this information (along with the URL to the job posting) to your "job search" board with one click.
Now, it might be a little much to ask a browser to have special cases for just job postings, when there are so many similar things a user might want to save for later, so you could even let the user define new "boards" where they describe to a (hopefully larger) LLM the purpose of the board and the kinds of information you're looking for, and it would generate the search parameters and data extraction tasks that a smaller LLM would then do in the background as you browse, letting the browser present that information when it is available so that you can choose whether to save it to your board. The larger LLM could still potentially be on-device, but a more powerful LLM that occupies most of the RAM and processing on your device is something you'd only want to use for a foreground task, not eating up resources in the background.
LLMs are interesting because they make it possible to do things that traditional programming could not do in any practical sense. If something can be done without an LLM, then absolutely... do that. LLMs are very computationally intensive, and their accuracy is more like a human than a computer. There are plenty of drawbacks to LLMs, if you have another valid option.
If you are resource limited, remember that you can also play with the quantization to fit more parameters into less amount of RAM. Phi-3-mini [1] (a 3.8B model) is 7.64GB with full (16-bit floating point) precision, but it is only 2.39GB when quantized to 4 bits.
That being said, I haven't personally tested it, but have heard good things for CodeGemma 2B [2].
CodeGemma-2b does not come in the "-it" (instruction tuned) variant, so it can't be used in a chat context. It is just a base model designed for tab completion of code in an editor, which I agree it is pretty good at.
To be honest the "Needle In A Haystack" test is the most trivial test for a model that relies on full attention, it's expected to be easy to pass if the model was trained correctly.
i just hope people don't claim "X model support Y context window", when the evaluation is done on "Needle in a haystack" only. It creates so much unnecessary hype.
I wonder if the 0.5B model would be usable for ML tasks like summarization, classification, or embeddings, replacing small models like Spacy usually used for embeddings.
It won't. Amazon kind of when that angle with MistralLite[1] (a 7B finetune), and it was barely passing in terms of being an effective summarizer. 0.5B are pretty much useless.
The official Mistral-7B-v0.2 model added support for 32k context, and I think it's far better than MistralLite. Third-party finetunes are rarely amazing at the best of times.
Now, we have Mistral-7B-v0.3, which is supposedly an even better model:
My experience is that < 500M models are pretty useful when fine-tuned on traditional NLP tasks, such as text classification and sentence/token level labeling. A modern LM with a 32K context window size could be a nice replacement for BERT, RoBERTa, BART.
A properly finetuned model can perform better for a use case but even with PEFT/LoRAs, finetuning and managing "smaller" open-source LLMs (7B params) models like Llama3 is annoying. That's partially why the even-smaller ~2B Phi series of models took off.
A 0.5B model may not be that great out of the box but there's a lot of oppertunity if it's responsive to finetuning.
Ye smaller models are fantastic for finetuning and probably on device applications - they can act as like a first pass on most LLM applications, then if it requires a larger model to intervene, then it can pass if off to some larger one. I do have a Colab to finetune 0.5b 2x faster here for those interested: https://colab.research.google.com/drive/1-7tjDdMAyeCueyLAwv6...
Tiny LLMs (oxymoron?) could be used for text completion, predictive keyboards, compression, to improve OCR and speech transcription, etc...
For these applications, you don't need a super smart model, it just needs to give out hints. For completion, the user can just not use the suggestion if it is not what he wants, for compression, it will just lower the compression ratio a bit, and for transcription, it will only be used as disambiguation. For all these applications, a LLM is not needed, but it can improve the results.
Note that for OCR/transcription, I am a bit weary of engines that are a bit too smart, as a lot of them are today, and LLMs go in this direction. The results are often objectively better, that is more words are properly transcribed, but for those that are not, it is often subtly wrong, that is, it makes sense, but it is not the right transcription. Whereas with nonsense, we know it can't be trusted and we act accordingly, also, people can be quite good at filling the blanks themselves. It results in more effort, but maybe better understanding in the end.
The academic benchmarks on that particular model relative to 1.5B-2B models are what you would expect, but it would make for an excellent base for finetuning/embedding generation.