Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: OfflineLLM – a Vision Pro app running TinyLlama on device (apps.apple.com)
126 points by codepixel 11 months ago | hide | past | favorite | 60 comments
Hey, I built this in a day while at Founders Inc Apple Vision Pro residency. Try it out, let me know what you think.



Out of curiosity, how much memory can a single app actually use on the Vision Pro? I know it physically has 16GB of RAM but mobile OSes usually don't let an app use anything close to the entire memory, and that arbitrary limit will dictate how big of a model you can load.


https://developer.apple.com/documentation/bundleresources/en... is how you get lots of RAM on iOS/iPadOS but it's not marked for visionOS so I can't tell


Apparently 16GB iPads set the line at 5GB per app by default, and while that flag lets you request more they don't make any guarantees of how much extra quota you will get, so I'm not sure how useful that is for loading a big model if it might randomly fail. I suppose it's probably safe to assume the Vision Pro is similar.

https://9to5mac.com/2021/06/25/apps-can-request-access-to-mo...


>request more they don't make any guarantees of how much extra quota you will get

How is the logic made for what requests get what?

Maybe a new feature in future iOs would be a user setting for specific apps to slider how much the user wants to allot - and a toggle to "shed other apps as needed" and "shed other apps when temp reaches X"


> Maybe a new feature in future iOS would be a user setting for specific apps to slider how much the user wants to allot

And Apple even has a UI design ready for this https://kalleboo.com/linked/system7memorymanagement.png


I’ve tested this. On an iPad it always works in my testing on devices with lots of RAM, unless your app is in split screen mode.


Apple will never do that. And the algo is undocumented.


Hello, System 7!


Your screenshots need a bit more thought into them you're essentially just showing some empty rectangles with no context as to what you're selling.

Also I'd be looking towards voice as the input/output for llm on Vision Pro.


okay fair, the actual ui is improved then on the screenshots, so i will be updating them asap


Should look at MLX optimisations too, Stable LM 1.6b which is about the same size and quantises to 4bit really well runs 100 tok/s on a M2 Mac mini.

https://x.com/awnihannun/status/1750986911827832992?s=20


I'm just about to ship an update to the iOS version my offline LLM app which will replace its current 3B default model (RedPajama Chat) with Stable LM 1.6B. Works extremely well even when quantized. I initially wanted to ship it with TinyLlama Chat, but TinyLlama and its fine tunes are quite subpar and many of my beta testers complained that it's much worse than even the old 3B model and then I found StableLM 2 Zephyr 1.6B. :)

https://imgur.com/a/Imd2l9o


You might also evaluate Gemma 2B and Qwen 0.5B as alternative tiny models, FWIW.


oooh thanks for the tip, will try this


Stable LM 3B Zephyr, it's the only model below 7B that can handle RAG: i.e. understand "hey those are documents, use them to answer these questions"

It'll work too, it was quite delightful to open Test Flight, install my Flutter app not designed for Vision Pro at all, and everything "just worked".


https://stability.ai/news/stablelm-zephyr-3b-stability-llm works absolutely fine on the M2 processor, like 40 tok/s https://x.com/EMostaque/status/1732912442282312099?s=20

Stable LM 2 1.6b runs even faster but not as good at RAG, great multilingual though, we are seeing it matching 70b models on other languages (new version soon) https://x.com/EMostaque/status/1763269238347673796?s=20

Can fit a lot in a gigabyte file it seems.


Is this Flutter app something you created? If so, is it open source? I’m in that same space and I generally just like to learn from other people’s work.

If not, all good. I don’t have a Vision Pro myself but I got a similar app which runs on all platforms including iPadOS, thus I guess my app should work on that too. Thanks for the reminder!


Thanks for asking: Yes I did make it, but, no app tying it all together. At least, it isn't out yet.

The grunt work of getting it running on different platforms + nice easy OpenAI compatible interfaces x RAG x voice assistant is open source:

- FLLAMA: https://github.com/Telosnex/fllama llama.cpp at core, openai compatible API, function call support, multimodal model support, Metal support. All platforms incl. web, but WASM is slow, def. not worth it except as a proof of concept.

- FONNX: https://github.com/Telosnex/fonnx ONNX runtime at core, all platforms including web. Whisper, Silero VAD, Magika, and two embeddings models. (Mini LM L6 V3 is best for RAG)

EDIT: I knew I recognized your username! Aub.ai! Cheers, what you did with aub.ai convinced me it was possible to do llama.cpp in flutter with a high bar for engineering quality. Other stuff seemed a tad rushed, unstable, and not complete. Also congrats, just saw your recent update, been hoping something good came through and it did.


Big downvote to Gemma for reasons that are already discussed.


discussed where?


nice, congrats on shipping


thanks! move fast and ship is my mantra


Love the spirit! App looks very cool, keep up the great work :)


I can't see much more benefit to this compared to SillyTavern on my desktop.

What i'd like to see in this space is an actual 3D avatar assistant that you can talk to using your voice as if they were another person.


yes i'm working on this 3D avatar idea as well. it's actually really mind blowing in my opinion, just need to bring your own imagination. this is just the start, i will add memory, RAG, voice interface, and other features to this.


This is the future of computing. Especially when tech like vision pro becomes the size of normal sunglasses.


Just make sure you watch "HER" and "Ex Machina" and that other new one about Huma-Driod relations, for inspiration and caution...


I've watched both movies. Her was an audio only chatbot and this is already doable. SillyTavern + OpenAI Whisper + Silero TTS and you've basically got Her. I've already done it and it works quite well, Whisper is much much better than the speech recognition Google offers even when running locally on a CPU.

Ex Machina was an actual physical robot. Not possible yet, but since GPTs became smart, huge investments are being made in robotics, the most recent annoucement today: https://futurism.com/the-byte/humanoid-robot-maker-deal-open...

Once this happens a robot will be basically able to do any job a human can do.


Cool... My actual point was, Huma-Droid Relations:

Her: Human foolishly falls in love with an AI bot (already happened in real life)

E.M: AI bot gets a body, lies her way out of prison and releases her self on society (GPt already lied its way through Mechanical Turk Captchas.)

Point being, that the 3D avatar will be like the all the AI warning we have of Holographic Personal AI Assistants... and some people will fall in love with them... and some of the assistance will either be/be used for Evil...

:-)

I didnt doubt you had seen them, though.


re: GPT x bot:

Absolutely not.

I'm as high as a kite on this stuff and have to be, but I'm not sure you're actually using ex. vision API.

Also, Whisper isn't lower WER than Google unfortunately or even close, and that I know for a fact, I designed & implemented both the server/client side of the last big Assistant audio format change, and also the UI for the New Google Assistant™, i.e. Google's first offline model.

Whisper is still really good, even Whisper Tiny, and I'm happy to ship it.


oh super intersting projects, SillyTavern and SileroTTS indeed i believe Her is possible now


yes, i have seen these, i believe we will co-evolve with ai, so that our definition of being human will evolve rapidly, and there will not be a threat from Ai, but rather, we will become more and more powerful


IMO, small LLMs are not good enough yet. I understand that people prefer running stuff on-device for privacy and cost reason, but a model that makes mistakes all the time is not worth the tradeoff.


This one is particularly bad. I tested the Tinyllama web chat [1] and got this result:

>> What is the population density of Nebraska?

> Yes, I can provide you with the population density of Nebraska:

> As of the 2020 United States Census, the population density in Nebraska was 107.9 people per square mile (41.9/km²).

>> What is the population of Nebraska?

> Yes, I can provide you with the population of Nebraska as of the 2020 United States Census:

> As of the 2020 United States Census, the population of Nebraska was 1,835,673.

>> What is the land area of Nebraska?

> Yes, I can provide you with the land area of Nebraska as of the 2020 United States Census:

> As of the 2020 United States Census, the land area of Nebraska was 11,761 square miles (30,590 square kilometers).

>> What is 1,835,673 / 11,761?

> Yes, 1,835,673 / 11,761 is approximately 15.46.

Per Wikipedia, the 2020 Population of Nebraska is 1,961,504 (close, but not quite), land area is 76,796 sq. mi. (way off), density is about 25/sq. mi. (way off).

And according to the standard rules of arithmetic, 1835673 / 11761 = 156.08, making this almost (but not quite) one order of magnitude off, and not even the erroneous answer of 15.46 is consistent with the other erroneous figure it gave for the population density of Nebraska (107.9).

[1]: https://huggingface.co/spaces/TinyLlama/tinyllama-chat


This is anecdata but "good enough" is relative. I've finetuned TinyLlama with the same dataset and technique as Llama2 7B for on-device purposes (not for cost or privacy but for physical hardware that have to run offline and with low power consumption) and it produces higher task alignment in 1/4 the inference time. As a general purpose model it isn't great but small models have their place in the ecosystem.


Care to elaborate on the finetune? It's surprisingly very hard to come across a useful finetuning examples.


Sure, very generally we're doing PEFT starting with insights from examples very much like this one [0] and have gradually built our own tooling and customized the approach a lot as the underlying Huggingface libraries have progressed even in the last 6 months.

I will say that one of the most important parts of the process that I've found is in the prompt structuring, the use of special tokens based on how the base models were trained and customizing the tokenizer where necessary. That work in particular is not covered adequately by the examples I was able to find when I started, in my opinion.

[0] https://medium.com/@kshitiz.sahay26/fine-tuning-llama-2-for-...


Some Mistral 7B fine tunes are borderline usable, but yes it's still very marginal.


yes, i can see that, they are fun to play with though, many of the responses are interesting, and yes they will get more powerful fast, so swapping for another model will be possible and soon i will support this


we'll see how things play out after the new paper with 1.6 bits and no performance loss. This would mean being able to fit on device much bigger models


> IMO, small LLMs are not good enough yet

Sure, but the only way they're going to get there is by people iterating on them while they're still crap


I think you should reevaluate the price target…


I don’t think it’s that… cool to try to pressure people to give their work away. Nobody is forcing anyone to buy. Pressuring for free or nearly-free gets you garbage like what we have today for mobile games, search, social media, etc.


GP doesn’t necessarily mean lower, I could see $15 being a fair price for this.


everyone assuming I mean cheaper...


Clarity is always welcome and appreciated in such comments. Others on the thread did specifically mean cheaper, though.


It doesn’t look Apple Vision optimized, but this barebones app is free while also running local LLMs on VisionOS, iOS, and iPadOS: https://apps.apple.com/us/app/mlc-chat/id6448482937 I’ve messed with it mostly out of curiosity to see how fast Apple silicon can run inference on mobile.


to be higher or lower?


When it comes to the $3k novelty face computer you can likely get away with higher, at least until someone else does it too


No… not for an app focused on TinyLlama, which I haven’t been able to find a single use case for that an end user would care about. It’s essentially a toy, or optimistically a useful tool for LLM research at very small sizes.

Someone is developing an app call cnvrs, which I’ve been using through TestFlight, and it supports TinyLlama and many other models, currently for free. MLCChat is another free app that focuses on Mistral-7B, and that one is in the App Store for sure.

Neither is Vision Pro specific… but as someone who actually owns a Vision Pro, I’d rather have an iPad app with useful models than pay for a Vision Pro app with TinyLlama. And I also say this as someone who tried multiple checkpoints of TinyLlama as it was developed, and followed it closely. It was an awesome research project!


I'm also working on this, but with OpenAI BYOK in addition to local LLM via Llama.cpp: https://ChatOnMac.com for iOS/macOS and hopefully visionOS soon.


The entitlement of iOS, iPad and seemingly Vision Pro ecosystems is bizarre. Must be something about how the systems are designed that has devalued applications in the eyes of the users.

Mac and Windows ecosystems you'd have no issue even charging up to $7 a month for a AI frontend.


Now it's “entitlement” to say that I don’t want to pay for access to a language model that is completely useless in a chat format like this (a model that I have plenty of experience with), when I already have access to more useful models through other apps? Wow.

Your comment sets the bar for entitlement really low. So, surely you spend all of your money buying things that you know are useless out of some obligation to not seem entitled? You can see how ridiculous that sounds, so the most charitable interpretation of your comment is that you didn’t actually read my comment before responding.

The feedback I provided was a lot more useful than trying to guilt trip someone into spending money on something they know they won’t get any value out of. If the author switches to a better language model, it will make their app far more attractive to potential buyers, and they can do this, as shown by the existence of other apps that already have. We are fortunate that TinyLlama is not the best model available.

> Mac and Windows ecosystems you'd have no issue even charging up to $7 a month for a AI frontend.

I absolutely would have an issue with paying $7/mo for a TinyLlama-only frontend, no matter the platform. Maybe you’ve never actually used TinyLlama? What have you found it useful for in a general chat environment? How did the accuracy compare to models like Mistral-7B that run just fine on Vision Pro?


Price is fine, I wouldn't stress it at this stage.


how much would you pay?


When I saw your comment I expected it cost $100... Bro it's $7. I pay more each month for email. Good software costs money and $7 is a pittance. What the heck were you expecting, a dollar??


$10-$15 actually.


[flagged]


seems to be unnecessary criticism?


Not really. It’s just a really simple hype-tech mashup and should be called out for it.


[flagged]


I bet most people who paid $3500 for a face computer are happy to pay $6.99 for an AI app.

Myself, I'll just keep running Ollama on my Linux laptop and let the Apple fans spend their money.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: