Out of curiosity, how much memory can a single app actually use on the Vision Pro? I know it physically has 16GB of RAM but mobile OSes usually don't let an app use anything close to the entire memory, and that arbitrary limit will dictate how big of a model you can load.
Apparently 16GB iPads set the line at 5GB per app by default, and while that flag lets you request more they don't make any guarantees of how much extra quota you will get, so I'm not sure how useful that is for loading a big model if it might randomly fail. I suppose it's probably safe to assume the Vision Pro is similar.
>request more they don't make any guarantees of how much extra quota you will get
How is the logic made for what requests get what?
Maybe a new feature in future iOs would be a user setting for specific apps to slider how much the user wants to allot - and a toggle to "shed other apps as needed" and "shed other apps when temp reaches X"
I'm just about to ship an update to the iOS version my offline LLM app which will replace its current 3B default model (RedPajama Chat) with Stable LM 1.6B. Works extremely well even when quantized. I initially wanted to ship it with TinyLlama Chat, but TinyLlama and its fine tunes are quite subpar and many of my beta testers complained that it's much worse than even the old 3B model and then I found StableLM 2 Zephyr 1.6B. :)
Is this Flutter app something you created? If so, is it open source? I’m in that same space and I generally just like to learn from other people’s work.
If not, all good. I don’t have a Vision Pro myself but I got a similar app which runs on all platforms including iPadOS, thus I guess my app should work on that too. Thanks for the reminder!
Thanks for asking: Yes I did make it, but, no app tying it all together. At least, it isn't out yet.
The grunt work of getting it running on different platforms + nice easy OpenAI compatible interfaces x RAG x voice assistant is open source:
- FLLAMA: https://github.com/Telosnex/fllama
llama.cpp at core, openai compatible API, function call support, multimodal model support, Metal support. All platforms incl. web, but WASM is slow, def. not worth it except as a proof of concept.
- FONNX: https://github.com/Telosnex/fonnx
ONNX runtime at core, all platforms including web. Whisper, Silero VAD, Magika, and two embeddings models. (Mini LM L6 V3 is best for RAG)
EDIT: I knew I recognized your username! Aub.ai! Cheers, what you did with aub.ai convinced me it was possible to do llama.cpp in flutter with a high bar for engineering quality. Other stuff seemed a tad rushed, unstable, and not complete. Also congrats, just saw your recent update, been hoping something good came through and it did.
yes i'm working on this 3D avatar idea as well.
it's actually really mind blowing in my opinion, just need to bring your own imagination.
this is just the start, i will add memory, RAG, voice interface, and other features to this.
I've watched both movies. Her was an audio only chatbot and this is already doable. SillyTavern + OpenAI Whisper + Silero TTS and you've basically got Her. I've already done it and it works quite well, Whisper is much much better than the speech recognition Google offers even when running locally on a CPU.
Cool... My actual point was, Huma-Droid Relations:
Her: Human foolishly falls in love with an AI bot (already happened in real life)
E.M: AI bot gets a body, lies her way out of prison and releases her self on society (GPt already lied its way through Mechanical Turk Captchas.)
Point being, that the 3D avatar will be like the all the AI warning we have of Holographic Personal AI Assistants... and some people will fall in love with them... and some of the assistance will either be/be used for Evil...
I'm as high as a kite on this stuff and have to be, but I'm not sure you're actually using ex. vision API.
Also, Whisper isn't lower WER than Google unfortunately or even close, and that I know for a fact, I designed & implemented both the server/client side of the last big Assistant audio format change, and also the UI for the New Google Assistant™, i.e. Google's first offline model.
Whisper is still really good, even Whisper Tiny, and I'm happy to ship it.
yes, i have seen these, i believe we will co-evolve with ai, so that our definition of being human will evolve rapidly, and there will not be a threat from Ai, but rather, we will become more and more powerful
IMO, small LLMs are not good enough yet. I understand that people prefer running stuff on-device for privacy and cost reason, but a model that makes mistakes all the time is not worth the tradeoff.
This one is particularly bad. I tested the Tinyllama web chat [1] and got this result:
>> What is the population density of Nebraska?
> Yes, I can provide you with the population density of Nebraska:
> As of the 2020 United States Census, the population density in Nebraska was 107.9 people per square mile (41.9/km²).
>> What is the population of Nebraska?
> Yes, I can provide you with the population of Nebraska as of the 2020 United States Census:
> As of the 2020 United States Census, the population of Nebraska was 1,835,673.
>> What is the land area of Nebraska?
> Yes, I can provide you with the land area of Nebraska as of the 2020 United States Census:
> As of the 2020 United States Census, the land area of Nebraska was 11,761 square miles (30,590 square kilometers).
>> What is 1,835,673 / 11,761?
> Yes, 1,835,673 / 11,761 is approximately 15.46.
Per Wikipedia, the 2020 Population of Nebraska is 1,961,504 (close, but not quite), land area is 76,796 sq. mi. (way off), density is about 25/sq. mi. (way off).
And according to the standard rules of arithmetic, 1835673 / 11761 = 156.08, making this almost (but not quite) one order of magnitude off, and not even the erroneous answer of 15.46 is consistent with the other erroneous figure it gave for the population density of Nebraska (107.9).
This is anecdata but "good enough" is relative. I've finetuned TinyLlama with the same dataset and technique as Llama2 7B for on-device purposes (not for cost or privacy but for physical hardware that have to run offline and with low power consumption) and it produces higher task alignment in 1/4 the inference time. As a general purpose model it isn't great but small models have their place in the ecosystem.
Sure, very generally we're doing PEFT starting with insights from examples very much like this one [0] and have gradually built our own tooling and customized the approach a lot as the underlying Huggingface libraries have progressed even in the last 6 months.
I will say that one of the most important parts of the process that I've found is in the prompt structuring, the use of special tokens based on how the base models were trained and customizing the tokenizer where necessary. That work in particular is not covered adequately by the examples I was able to find when I started, in my opinion.
yes, i can see that, they are fun to play with though, many of the responses are interesting, and yes they will get more powerful fast, so swapping for another model will be possible and soon i will support this
we'll see how things play out after the new paper with 1.6 bits and no performance loss. This would mean being able to fit on device much bigger models
I don’t think it’s that… cool to try to pressure people to give their work away. Nobody is forcing anyone to buy. Pressuring for free or nearly-free gets you garbage like what we have today for mobile games, search, social media, etc.
It doesn’t look Apple Vision optimized, but this barebones app is free while also running local LLMs on VisionOS, iOS, and iPadOS: https://apps.apple.com/us/app/mlc-chat/id6448482937
I’ve messed with it mostly out of curiosity to see how fast Apple silicon can run inference on mobile.
No… not for an app focused on TinyLlama, which I haven’t been able to find a single use case for that an end user would care about. It’s essentially a toy, or optimistically a useful tool for LLM research at very small sizes.
Someone is developing an app call cnvrs, which I’ve been using through TestFlight, and it supports TinyLlama and many other models, currently for free. MLCChat is another free app that focuses on Mistral-7B, and that one is in the App Store for sure.
Neither is Vision Pro specific… but as someone who actually owns a Vision Pro, I’d rather have an iPad app with useful models than pay for a Vision Pro app with TinyLlama. And I also say this as someone who tried multiple checkpoints of TinyLlama as it was developed, and followed it closely. It was an awesome research project!
I'm also working on this, but with OpenAI BYOK in addition to local LLM via Llama.cpp: https://ChatOnMac.com for iOS/macOS and hopefully visionOS soon.
The entitlement of iOS, iPad and seemingly Vision Pro ecosystems is bizarre. Must be something about how the systems are designed that has devalued applications in the eyes of the users.
Mac and Windows ecosystems you'd have no issue even charging up to $7 a month for a AI frontend.
Now it's “entitlement” to say that I don’t want to pay for access to a language model that is completely useless in a chat format like this (a model that I have plenty of experience with), when I already have access to more useful models through other apps? Wow.
Your comment sets the bar for entitlement really low. So, surely you spend all of your money buying things that you know are useless out of some obligation to not seem entitled? You can see how ridiculous that sounds, so the most charitable interpretation of your comment is that you didn’t actually read my comment before responding.
The feedback I provided was a lot more useful than trying to guilt trip someone into spending money on something they know they won’t get any value out of. If the author switches to a better language model, it will make their app far more attractive to potential buyers, and they can do this, as shown by the existence of other apps that already have. We are fortunate that TinyLlama is not the best model available.
> Mac and Windows ecosystems you'd have no issue even charging up to $7 a month for a AI frontend.
I absolutely would have an issue with paying $7/mo for a TinyLlama-only frontend, no matter the platform. Maybe you’ve never actually used TinyLlama? What have you found it useful for in a general chat environment? How did the accuracy compare to models like Mistral-7B that run just fine on Vision Pro?
When I saw your comment I expected it cost $100... Bro it's $7. I pay more each month for email. Good software costs money and $7 is a pittance. What the heck were you expecting, a dollar??