A 0.5B parameter model with a 32k context length that also makes good use of tha...

coder543 · on June 6, 2024

Qwen1.5-0.5B supposedly supported up to 32k context as well, but I can't even get it to summarize a ~2k token input with any level of coherence.

I'm always excited to try a new model, so I'm looking forward to trying Qwen2-0.5B... but I wouldn't get your hopes up this much. These super tiny models seem far more experimental than the larger LLMs.

Phi-3-mini (3.8B) supports a 128k context, and it is actually a reasonably useful model in my tests. Gemma-2B-1.1-it is a 2B model that only supports 8k context, but it also does fairly well for summarization.

noman-land · on June 6, 2024

Phi-3-mini has been really surprising me. It's quite good!

sitkack · on June 6, 2024

Summarization is one of the most difficult tasks for any LLM and over that context window, crazy to think it could do it.

That context window is useful if you have a smaller data extraction task, like dates, times, place names, etc. And even that it might need to be fine tuned on. These small models are a feedstock.

TechDebtDevin · on June 6, 2024

What tasks do you consider a 3.8B model to be useful for? Chat applications on lesser hardware, im still finding it difficult to parse what the real world application would ever be. However, I do understand that the goal is to make the smallest most efficient model to compete with the larger model capabilities one day and you can't get there without making these. But do these types of models have any value for any sort of product or real world project?

coder543 · on June 6, 2024

I think most of the interesting applications for these small models are in the form of developer-driven automations, not chat interfaces.

A common example that keeps popping up is a voice recorder app that can provide not just a transcription of the recording (which you don't need an LLM for), but also a summary of the transcription, including key topics, key findings, and action items that were discussed in a meeting. With speaker diarization (assigning portions of the transcript to different speakers automatically), it's even possible to use an LLM to assign names to each of the speakers in the transcript, if they ever identified themselves in the meeting, and then the LLM could take that and also know who is supposed to be handling each action item, if that was discussed in the meeting. That's just scratching the surface of what should be possible using small LLMs (or SLMs, as Microsoft likes to call them).

An on-device LLM could summarize notifications if you have a lot of catching up to do, or it could create a title for a note automatically once you finish typing the note, or it could be used to automatically suggest tags/categories for notes. That LLM could be used to provide "completions", like if the user is writing a list of things in a note, the user could click a button to have that LLM generate several more items following the same theme. That LLM can be used to suggest contextually-relevant quick replies for conversations. In a tightly-integrated system, you could imagine receiving a work phone call, and that LLM could automatically summarize your recent interactions with that person (across sms, email, calendar, and slack/teams) for you on the call screen, which could remind you why they're calling you.

LLMs can also be used for data extraction, where they can be given unstructured text, and fill in a data structure with the desired values. As an example, one could imagine browsing a job posting... the browser could use an LLM to detect that the primary purpose of this webpage is a job posting, and then it could pass the text of the page through the LLM and ask the LLM to fill in common values like the job title, company name, salary range, and job requirements, and then the browser could offer a condensed interface with this information, as well as the option to save this information (along with the URL to the job posting) to your "job search" board with one click.

Now, it might be a little much to ask a browser to have special cases for just job postings, when there are so many similar things a user might want to save for later, so you could even let the user define new "boards" where they describe to a (hopefully larger) LLM the purpose of the board and the kinds of information you're looking for, and it would generate the search parameters and data extraction tasks that a smaller LLM would then do in the background as you browse, letting the browser present that information when it is available so that you can choose whether to save it to your board. The larger LLM could still potentially be on-device, but a more powerful LLM that occupies most of the RAM and processing on your device is something you'd only want to use for a foreground task, not eating up resources in the background.

LLMs are interesting because they make it possible to do things that traditional programming could not do in any practical sense. If something can be done without an LLM, then absolutely... do that. LLMs are very computationally intensive, and their accuracy is more like a human than a computer. There are plenty of drawbacks to LLMs, if you have another valid option.

TechDebtDevin · on June 6, 2024

Thanks for the response I have been genuinely curious about use cases for these little guys.

algo_trader · on June 6, 2024

what would you recommend as a maximum-2B model for coding/reasoning domain ?

selcuka · on June 7, 2024

> maximum-2B model

If you are resource limited, remember that you can also play with the quantization to fit more parameters into less amount of RAM. Phi-3-mini [1] (a 3.8B model) is 7.64GB with full (16-bit floating point) precision, but it is only 2.39GB when quantized to 4 bits.

That being said, I haven't personally tested it, but have heard good things for CodeGemma 2B [2].

[1] https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf...

[2] https://huggingface.co/google/codegemma-2b-GGUF

coder543 · on June 7, 2024

CodeGemma-2b does not come in the "-it" (instruction tuned) variant, so it can't be used in a chat context. It is just a base model designed for tab completion of code in an editor, which I agree it is pretty good at.

coder543 · on June 6, 2024

For reasoning tasks and coding tasks where you’re chatting with the model, there are no 2B models that I would recommend at this point.

danielhanchen · on June 7, 2024

For finetuning, I made a Colab notebook to finetune Qwen2 7b 2x faster and use 70% less VRAM than HF+FA2! https://colab.research.google.com/drive/1mvwsIQWDs2EdZxZQF9p...

GaggiX · on June 6, 2024

>that also makes good use of that full window?!

To be honest the "Needle In A Haystack" test is the most trivial test for a model that relies on full attention, it's expected to be easy to pass if the model was trained correctly.

falseAss · on June 8, 2024

i just hope people don't claim "X model support Y context window", when the evaluation is done on "Needle in a haystack" only. It creates so much unnecessary hype.

ai_what · on June 6, 2024

I agree. I personally don't have high hopes for the 0.5B model.

Phi-2 was 2.7B and it was already regularly outputting complete nonsense.

I ran the 0.5B model of the previous Qwen version (1.5) and it reminded me of one of those lorum ipsum word generators.

The other new Qwen models (7B and up) look good though.

refulgentis · on June 6, 2024

Phi-2 wasn't instruct/chat finetuned and it was very upfront about this, "I tried Phi-2 and it was bad" is a dilletante filter

3abiton · on June 6, 2024

I wonder if the 0.5B model would be usable for ML tasks like summarization, classification, or embeddings, replacing small models like Spacy usually used for embeddings.

ai_what · on June 6, 2024

It won't. Amazon kind of when that angle with MistralLite[1] (a 7B finetune), and it was barely passing in terms of being an effective summarizer. 0.5B are pretty much useless.

https://huggingface.co/amazon/MistralLite

coder543 · on June 6, 2024

The official Mistral-7B-v0.2 model added support for 32k context, and I think it's far better than MistralLite. Third-party finetunes are rarely amazing at the best of times.

Now, we have Mistral-7B-v0.3, which is supposedly an even better model:

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

z4y5f3 · on June 7, 2024

My experience is that < 500M models are pretty useful when fine-tuned on traditional NLP tasks, such as text classification and sentence/token level labeling. A modern LM with a 32K context window size could be a nice replacement for BERT, RoBERTa, BART.

moffkalast · on June 6, 2024

It might be fine tunable for specific tasks BERT-style, I really doubt it's coherent enough for anything out of the box.

rgbrgb · on June 6, 2024

haven't done anything with a model so small. what kind of stuff would you do with it?

minimaxir · on June 6, 2024

A properly finetuned model can perform better for a use case but even with PEFT/LoRAs, finetuning and managing "smaller" open-source LLMs (7B params) models like Llama3 is annoying. That's partially why the even-smaller ~2B Phi series of models took off.

A 0.5B model may not be that great out of the box but there's a lot of oppertunity if it's responsive to finetuning.

danielhanchen · on June 7, 2024

Ye smaller models are fantastic for finetuning and probably on device applications - they can act as like a first pass on most LLM applications, then if it requires a larger model to intervene, then it can pass if off to some larger one. I do have a Colab to finetune 0.5b 2x faster here for those interested: https://colab.research.google.com/drive/1-7tjDdMAyeCueyLAwv6...

GuB-42 · on June 7, 2024

Tiny LLMs (oxymoron?) could be used for text completion, predictive keyboards, compression, to improve OCR and speech transcription, etc...

For these applications, you don't need a super smart model, it just needs to give out hints. For completion, the user can just not use the suggestion if it is not what he wants, for compression, it will just lower the compression ratio a bit, and for transcription, it will only be used as disambiguation. For all these applications, a LLM is not needed, but it can improve the results.

Note that for OCR/transcription, I am a bit weary of engines that are a bit too smart, as a lot of them are today, and LLMs go in this direction. The results are often objectively better, that is more words are properly transcribed, but for those that are not, it is often subtly wrong, that is, it makes sense, but it is not the right transcription. Whereas with nonsense, we know it can't be trusted and we act accordingly, also, people can be quite good at filling the blanks themselves. It results in more effort, but maybe better understanding in the end.

GaggiX · on June 6, 2024

You can use it for speculative decoding for example, this would increase the speed of larger models.