Show HN: OfflineLLM – a Vision Pro app running TinyLlama on device

jsheard · 2024-03-01T02:50:26 1709261426

Out of curiosity, how much memory can a single app actually use on the Vision Pro? I know it physically has 16GB of RAM but mobile OSes usually don't let an app use anything close to the entire memory, and that arbitrary limit will dictate how big of a model you can load.

wahnfrieden · 2024-03-01T02:59:30 1709261970

https://developer.apple.com/documentation/bundleresources/en... is how you get lots of RAM on iOS/iPadOS but it's not marked for visionOS so I can't tell

jsheard · 2024-03-01T03:09:24 1709262564

Apparently 16GB iPads set the line at 5GB per app by default, and while that flag lets you request more they don't make any guarantees of how much extra quota you will get, so I'm not sure how useful that is for loading a big model if it might randomly fail. I suppose it's probably safe to assume the Vision Pro is similar.

https://9to5mac.com/2021/06/25/apps-can-request-access-to-mo...

samstave · 2024-03-01T04:37:12 1709267832

>request more they don't make any guarantees of how much extra quota you will get

How is the logic made for what requests get what?

Maybe a new feature in future iOs would be a user setting for specific apps to slider how much the user wants to allot - and a toggle to "shed other apps as needed" and "shed other apps when temp reaches X"

kalleboo · 2024-03-01T14:14:59 1709302499

> Maybe a new feature in future iOS would be a user setting for specific apps to slider how much the user wants to allot

And Apple even has a UI design ready for this https://kalleboo.com/linked/system7memorymanagement.png

dannyw · 2024-03-01T09:32:50 1709285570

I’ve tested this. On an iPad it always works in my testing on devices with lots of RAM, unless your app is in split screen mode.

wahnfrieden · 2024-03-01T05:38:57 1709271537

Apple will never do that. And the algo is undocumented.

virusduck · 2024-03-01T14:15:38 1709302538

Hello, System 7!

whywhywhywhy · 2024-03-01T09:49:16 1709286556

Your screenshots need a bit more thought into them you're essentially just showing some empty rectangles with no context as to what you're selling.

Also I'd be looking towards voice as the input/output for llm on Vision Pro.

codepixel · 2024-03-01T10:03:29 1709287409

okay fair, the actual ui is improved then on the screenshots, so i will be updating them asap

emadm · 2024-03-01T13:03:32 1709298212

Should look at MLX optimisations too, Stable LM 1.6b which is about the same size and quantises to 4bit really well runs 100 tok/s on a M2 Mac mini.

https://x.com/awnihannun/status/1750986911827832992?s=20

woadwarrior01 · 2024-03-01T14:13:54 1709302434

I'm just about to ship an update to the iOS version my offline LLM app which will replace its current 3B default model (RedPajama Chat) with Stable LM 1.6B. Works extremely well even when quantized. I initially wanted to ship it with TinyLlama Chat, but TinyLlama and its fine tunes are quite subpar and many of my beta testers complained that it's much worse than even the old 3B model and then I found StableLM 2 Zephyr 1.6B. :)

https://imgur.com/a/Imd2l9o

LorenDB · 2024-03-01T02:34:02 1709260442

You might also evaluate Gemma 2B and Qwen 0.5B as alternative tiny models, FWIW.

codepixel · 2024-03-01T04:09:23 1709266163

oooh thanks for the tip, will try this

refulgentis · 2024-03-01T05:58:38 1709272718

Stable LM 3B Zephyr, it's the only model below 7B that can handle RAG: i.e. understand "hey those are documents, use them to answer these questions"

It'll work too, it was quite delightful to open Test Flight, install my Flutter app not designed for Vision Pro at all, and everything "just worked".

emadm · 2024-03-01T13:02:15 1709298135

https://stability.ai/news/stablelm-zephyr-3b-stability-llm works absolutely fine on the M2 processor, like 40 tok/s https://x.com/EMostaque/status/1732912442282312099?s=20

Stable LM 2 1.6b runs even faster but not as good at RAG, great multilingual though, we are seeing it matching 70b models on other languages (new version soon) https://x.com/EMostaque/status/1763269238347673796?s=20

Can fit a lot in a gigabyte file it seems.

BrutalCoding · 2024-03-01T07:03:53 1709276633

Is this Flutter app something you created? If so, is it open source? I’m in that same space and I generally just like to learn from other people’s work.

If not, all good. I don’t have a Vision Pro myself but I got a similar app which runs on all platforms including iPadOS, thus I guess my app should work on that too. Thanks for the reminder!

refulgentis · 2024-03-01T14:15:27 1709302527

Thanks for asking: Yes I did make it, but, no app tying it all together. At least, it isn't out yet.

The grunt work of getting it running on different platforms + nice easy OpenAI compatible interfaces x RAG x voice assistant is open source:

- FLLAMA: https://github.com/Telosnex/fllama llama.cpp at core, openai compatible API, function call support, multimodal model support, Metal support. All platforms incl. web, but WASM is slow, def. not worth it except as a proof of concept.

- FONNX: https://github.com/Telosnex/fonnx ONNX runtime at core, all platforms including web. Whisper, Silero VAD, Magika, and two embeddings models. (Mini LM L6 V3 is best for RAG)

EDIT: I knew I recognized your username! Aub.ai! Cheers, what you did with aub.ai convinced me it was possible to do llama.cpp in flutter with a high bar for engineering quality. Other stuff seemed a tad rushed, unstable, and not complete. Also congrats, just saw your recent update, been hoping something good came through and it did.

behnamoh · 2024-03-01T03:36:13 1709264173

Big downvote to Gemma for reasons that are already discussed.

supermatt · 2024-03-01T07:31:46 1709278306

discussed where?

jdamon96 · 2024-03-01T02:28:28 1709260108

nice, congrats on shipping

codepixel · 2024-03-01T04:11:06 1709266266

thanks! move fast and ship is my mantra

fosterfriends · 2024-03-01T04:20:32 1709266832

Love the spirit! App looks very cool, keep up the great work :)

aussieguy1234 · 2024-03-01T03:40:25 1709264425

I can't see much more benefit to this compared to SillyTavern on my desktop.

What i'd like to see in this space is an actual 3D avatar assistant that you can talk to using your voice as if they were another person.

codepixel · 2024-03-01T04:09:07 1709266147

yes i'm working on this 3D avatar idea as well. it's actually really mind blowing in my opinion, just need to bring your own imagination. this is just the start, i will add memory, RAG, voice interface, and other features to this.

aussieguy1234 · 2024-03-01T04:55:34 1709268934

This is the future of computing. Especially when tech like vision pro becomes the size of normal sunglasses.

samstave · 2024-03-01T04:39:21 1709267961

Just make sure you watch "HER" and "Ex Machina" and that other new one about Huma-Driod relations, for inspiration and caution...

aussieguy1234 · 2024-03-01T05:00:24 1709269224

I've watched both movies. Her was an audio only chatbot and this is already doable. SillyTavern + OpenAI Whisper + Silero TTS and you've basically got Her. I've already done it and it works quite well, Whisper is much much better than the speech recognition Google offers even when running locally on a CPU.

Ex Machina was an actual physical robot. Not possible yet, but since GPTs became smart, huge investments are being made in robotics, the most recent annoucement today: https://futurism.com/the-byte/humanoid-robot-maker-deal-open...

Once this happens a robot will be basically able to do any job a human can do.

samstave · 2024-03-01T05:28:14 1709270894

Cool... My actual point was, Huma-Droid Relations:

Her: Human foolishly falls in love with an AI bot (already happened in real life)

E.M: AI bot gets a body, lies her way out of prison and releases her self on society (GPt already lied its way through Mechanical Turk Captchas.)

Point being, that the 3D avatar will be like the all the AI warning we have of Holographic Personal AI Assistants... and some people will fall in love with them... and some of the assistance will either be/be used for Evil...

:-)

I didnt doubt you had seen them, though.

refulgentis · 2024-03-01T07:10:26 1709277026

re: GPT x bot:

Absolutely not.

I'm as high as a kite on this stuff and have to be, but I'm not sure you're actually using ex. vision API.

Also, Whisper isn't lower WER than Google unfortunately or even close, and that I know for a fact, I designed & implemented both the server/client side of the last big Assistant audio format change, and also the UI for the New Google Assistant™, i.e. Google's first offline model.

Whisper is still really good, even Whisper Tiny, and I'm happy to ship it.

codepixel · 2024-03-01T09:38:10 1709285890

oh super intersting projects, SillyTavern and SileroTTS indeed i believe Her is possible now

codepixel · 2024-03-01T09:35:32 1709285732

yes, i have seen these, i believe we will co-evolve with ai, so that our definition of being human will evolve rapidly, and there will not be a threat from Ai, but rather, we will become more and more powerful

d--b · 2024-03-01T07:18:47 1709277527

IMO, small LLMs are not good enough yet. I understand that people prefer running stuff on-device for privacy and cost reason, but a model that makes mistakes all the time is not worth the tradeoff.

Gormo · 2024-03-01T21:11:05 1709327465

This one is particularly bad. I tested the Tinyllama web chat [1] and got this result:

>> What is the population density of Nebraska?

> Yes, I can provide you with the population density of Nebraska:

> As of the 2020 United States Census, the population density in Nebraska was 107.9 people per square mile (41.9/km²).

>> What is the population of Nebraska?

> Yes, I can provide you with the population of Nebraska as of the 2020 United States Census:

> As of the 2020 United States Census, the population of Nebraska was 1,835,673.

>> What is the land area of Nebraska?

> Yes, I can provide you with the land area of Nebraska as of the 2020 United States Census:

> As of the 2020 United States Census, the land area of Nebraska was 11,761 square miles (30,590 square kilometers).

>> What is 1,835,673 / 11,761?

> Yes, 1,835,673 / 11,761 is approximately 15.46.

Per Wikipedia, the 2020 Population of Nebraska is 1,961,504 (close, but not quite), land area is 76,796 sq. mi. (way off), density is about 25/sq. mi. (way off).

And according to the standard rules of arithmetic, 1835673 / 11761 = 156.08, making this almost (but not quite) one order of magnitude off, and not even the erroneous answer of 15.46 is consistent with the other erroneous figure it gave for the population density of Nebraska (107.9).

[1]: https://huggingface.co/spaces/TinyLlama/tinyllama-chat

mmcwilliams · 2024-03-01T09:28:47 1709285327

This is anecdata but "good enough" is relative. I've finetuned TinyLlama with the same dataset and technique as Llama2 7B for on-device purposes (not for cost or privacy but for physical hardware that have to run offline and with low power consumption) and it produces higher task alignment in 1/4 the inference time. As a general purpose model it isn't great but small models have their place in the ecosystem.

eurekin · 2024-03-01T10:24:51 1709288691

Care to elaborate on the finetune? It's surprisingly very hard to come across a useful finetuning examples.

mmcwilliams · 2024-03-01T21:22:48 1709328168

Sure, very generally we're doing PEFT starting with insights from examples very much like this one [0] and have gradually built our own tooling and customized the approach a lot as the underlying Huggingface libraries have progressed even in the last 6 months.

I will say that one of the most important parts of the process that I've found is in the prompt structuring, the use of special tokens based on how the base models were trained and customizing the tokenizer where necessary. That work in particular is not covered adequately by the examples I was able to find when I started, in my opinion.

[0] https://medium.com/@kshitiz.sahay26/fine-tuning-llama-2-for-...

moffkalast · 2024-03-01T11:53:46 1709294026

Some Mistral 7B fine tunes are borderline usable, but yes it's still very marginal.

codepixel · 2024-03-01T09:34:39 1709285679

yes, i can see that, they are fun to play with though, many of the responses are interesting, and yes they will get more powerful fast, so swapping for another model will be possible and soon i will support this

Moldoteck · 2024-03-01T09:37:38 1709285858

we'll see how things play out after the new paper with 1.6 bits and no performance loss. This would mean being able to fit on device much bigger models

petesergeant · 2024-03-01T10:56:09 1709290569

> IMO, small LLMs are not good enough yet

Sure, but the only way they're going to get there is by people iterating on them while they're still crap

reactordev · 2024-03-01T01:51:03 1709257863

I think you should reevaluate the price target…

xp84 · 2024-03-01T02:47:44 1709261264

I don’t think it’s that… cool to try to pressure people to give their work away. Nobody is forcing anyone to buy. Pressuring for free or nearly-free gets you garbage like what we have today for mobile games, search, social media, etc.

dannyw · 2024-03-01T09:33:50 1709285630

GP doesn’t necessarily mean lower, I could see $15 being a fair price for this.

reactordev · 2024-03-01T13:42:46 1709300566

everyone assuming I mean cheaper...

xp84 · 2024-03-02T00:29:53 1709339393

Clarity is always welcome and appreciated in such comments. Others on the thread did specifically mean cheaper, though.

MarioMan · 2024-03-01T12:24:06 1709295846

It doesn’t look Apple Vision optimized, but this barebones app is free while also running local LLMs on VisionOS, iOS, and iPadOS: https://apps.apple.com/us/app/mlc-chat/id6448482937 I’ve messed with it mostly out of curiosity to see how fast Apple silicon can run inference on mobile.

arjvik · 2024-03-01T02:04:10 1709258650

to be higher or lower?

dghlsakjg · 2024-03-01T02:19:16 1709259556

When it comes to the $3k novelty face computer you can likely get away with higher, at least until someone else does it too

coder543 · 2024-03-01T02:29:04 1709260144

No… not for an app focused on TinyLlama, which I haven’t been able to find a single use case for that an end user would care about. It’s essentially a toy, or optimistically a useful tool for LLM research at very small sizes.

Someone is developing an app call cnvrs, which I’ve been using through TestFlight, and it supports TinyLlama and many other models, currently for free. MLCChat is another free app that focuses on Mistral-7B, and that one is in the App Store for sure.

Neither is Vision Pro specific… but as someone who actually owns a Vision Pro, I’d rather have an iPad app with useful models than pay for a Vision Pro app with TinyLlama. And I also say this as someone who tried multiple checkpoints of TinyLlama as it was developed, and followed it closely. It was an awesome research project!

wahnfrieden · 2024-03-01T03:00:54 1709262054

I'm also working on this, but with OpenAI BYOK in addition to local LLM via Llama.cpp: https://ChatOnMac.com for iOS/macOS and hopefully visionOS soon.

whywhywhywhy · 2024-03-01T09:48:13 1709286493

The entitlement of iOS, iPad and seemingly Vision Pro ecosystems is bizarre. Must be something about how the systems are designed that has devalued applications in the eyes of the users.

Mac and Windows ecosystems you'd have no issue even charging up to $7 a month for a AI frontend.

coder543 · 2024-03-01T14:17:57 1709302677

Now it's “entitlement” to say that I don’t want to pay for access to a language model that is completely useless in a chat format like this (a model that I have plenty of experience with), when I already have access to more useful models through other apps? Wow.

Your comment sets the bar for entitlement really low. So, surely you spend all of your money buying things that you know are useless out of some obligation to not seem entitled? You can see how ridiculous that sounds, so the most charitable interpretation of your comment is that you didn’t actually read my comment before responding.

The feedback I provided was a lot more useful than trying to guilt trip someone into spending money on something they know they won’t get any value out of. If the author switches to a better language model, it will make their app far more attractive to potential buyers, and they can do this, as shown by the existence of other apps that already have. We are fortunate that TinyLlama is not the best model available.

> Mac and Windows ecosystems you'd have no issue even charging up to $7 a month for a AI frontend.

I absolutely would have an issue with paying $7/mo for a TinyLlama-only frontend, no matter the platform. Maybe you’ve never actually used TinyLlama? What have you found it useful for in a general chat environment? How did the accuracy compare to models like Mistral-7B that run just fine on Vision Pro?

fosterfriends · 2024-03-01T04:21:16 1709266876

Price is fine, I wouldn't stress it at this stage.

codepixel · 2024-03-01T04:11:26 1709266286

how much would you pay?

bastawhiz · 2024-03-01T05:40:34 1709271634

When I saw your comment I expected it cost $100... Bro it's $7. I pay more each month for email. Good software costs money and $7 is a pittance. What the heck were you expecting, a dollar??

reactordev · 2024-03-01T21:20:52 1709328052

$10-$15 actually.

brrrrrm · 2024-03-01T02:26:11 1709259971

[flagged]

jdamon96 · 2024-03-01T02:28:07 1709260087

seems to be unnecessary criticism?

brrrrrm · 2024-03-01T04:44:03 1709268243

Not really. It’s just a really simple hype-tech mashup and should be called out for it.

btbuildem · 2024-03-01T02:23:26 1709259806

[flagged]

LorenDB · 2024-03-01T02:35:16 1709260516

I bet most people who paid $3500 for a face computer are happy to pay $6.99 for an AI app.

Myself, I'll just keep running Ollama on my Linux laptop and let the Apple fans spend their money.