Hacker News new | past | comments | ask | show | jobs | submit login
GPU-Accelerated LLM on an Orange Pi (mlc.ai)
214 points by tosh on Aug 15, 2023 | hide | past | favorite | 80 comments



I'm surprised we haven't seen dedicated boxes to self host your uncensored & private LLM yet.

A bit like you can self host your apps at home on a Umbrel box.

I wonder if the NVIDIA Jetson serie would be the hardware that makes the most sense?


Jetson is basically like Apple. Theoretically its good, but the models with enough RAM are too expensive.

Smartphones aside, little Ryzen 6000 boxes would be OK.

Used DDR5 laptops with a little discrete GPU would be even better. I have one with a broken screen that may be dedicated to this very task.


You can get an Orin with 8GB. Or the 32/64GB AGX module for 1.3/2.3k (3k for the dev kit). Not cheap but plenty of ram and 60W power target maximum.

You could maybe run something on the 4GB Jetson Nano?

But very slow: https://www.reddit.com/r/LocalLLaMA/comments/12c7w15/the_poi...


IIRC 32GB of shared RAM is not enough for llama 70B, and 8GB is just barely enough for llama 7B... So yeah, that value proposition is not good at all.

A 32GB+ ddr5 laptop with a dGPU and some RAM will (IIRC, just barely) do llama 70B for far less money and a similar TDP.



15,000$…


The apple lisa was 25k when adjusted for inflation…prices are for early adopters


Yes, and the Lisa failed, because there just weren’t enough adopters at that price.

I struggle to see any sizable market for the tinybox, but I wish them good luck.


Lessons learned on the Lisa were applied to the Mac for less money to sell to more people. I suspect the same thing will happen with their VR/AR/XR headset.


They only got a second crack at it because they were floating on sales of the Apple II. If the Lisa had been their first product, we'd be saying "Apple who?"

The Vision is a whole different universe… with their cash and position they could (and may) take 10 cracks at it.


Doesn't make the tactic any more or less relevant...let the people with money and motivation become the first real-world usability testers, then optimize for the rest of the population.


Organization run by Geohot and no actual product delivered yet...


$15k for a box with 144 GB GPU RAM is not bad, but I'm not clear on how they're going to run that from a single 1600W PSU. That would be 6x 24GB GPUs, and I'm pretty sure you'd need 2x 1600W PSUs and two separate 15amp circuits to run such a thing at home (in the US).


You can undervolt gpus without loosing that much performance.


With Nvidia this is (typically) setting the power limit[0]. Even with the default power limits of my RTX 4090s (480 watts)I don't think I've ever seen an ML workload get close to that and as the referenced article demonstrates you can more-or-less set your power limit to somewhere around 75% of max without losing much if any performance depending on workload.

It doesn't take much testing to come up with the ideal power limit for your given workload(s).

[0] - https://www.pugetsystems.com/labs/hpc/nvidia-gpu-power-limit...


But realistically, the 2x PSU case was a common crypto mining setup and yeah, you have your electrician install two circuits right next to each other. Or use a 240V PSU.


That's exactly what I imagine would need to happen, I'm just surprised that the specs are claiming all of that with a single 1600W PSU. It's the kind of thing that doesn't inspire a lot of confidence that someone has thought through the details of what they are claiming to sell as part of this pre-order.


Nitpick - almost all modern power supplies are auto-ranging between 100-240V. In 240V locales at 80% constant load on a 15amp circuit this is 2880 watts. In the US you would (ideally) drop a 240V circuit from the panel and make sure you have sufficient supply amperage from the utility provider.


You can build your own and you don’t need 6 GPUs unless you’re training


I believe you missed the beginning of this thread? We’re specifically discussing pre-built solutions.


For inference? Mobile phones (unless you need it always on). Kind of works already with 7B weights, will keep getting better


I've been bamboozled by the Jetson series over at least three generations on a variety of platforms (TX1, Nano, AGX, Orin Nano).

They have their "special place" for certain applications but the software is a mess (old driver and CUDA versions, Jetpack is still based on Ubuntu 20.04), the ARM cores are (very) weak relative to ARM flagships, and the performance/price ratio makes no sense unless you really need the form factor and energy efficiency. Oh yeah and the SD card storage is typically frustratingly slow and often unreliable.

A $500 Jetson Nano devkit with 8GB of shared RAM has roughly 10% of the performance of even ancient cards like the GTX 1070 (8GB VRAM alone) that you can throw in a random used x86_64 tower or whatever for $300 all-in. For the extra couple of hundred dollars difference you can get a more recent GPU with higher compute capability, extra storage, more system RAM, whatever. Significantly higher power usage and larger form factor but with power optimization, scaling, etc this approach makes very little difference in practice for occasional inference workloads for "typical home use". I have this configuration idling (with models loaded) at 20 watts, scaling up to around 100-150w for however long given inference loads take to execute, and then scaling back down.


I was thinking that powerful used phones will be extremely valuable in the next years, since they are fairly cheap and more powerful than these devices.


Vendors have been pushing hard for trade-in value when new phones are purchased to keep perfectly fine phones out of the ecosystem.


Not gonna happen. Perf/watt on phones is terrible compared to consumer GPUs


Orange Pi 5 has an NPU. I wonder if it'd be any faster than using GPU.


Its not supported by TVM yet, but there is support for Qualcomm Hexagon.

You can kinda see some of the supported backends gated behind flags in the cmake file: https://github.com/mlc-ai/relax/blob/mlc/CMakeLists.txt


It might have been more work to convert the model for the RK3588 NPU, even if Rockchip provides an SDK and an automated conversion tool that should help (the SDK includes a simulator for the NPU, so the converted model can be tried on a PC before being deployed on a board like Orange Pi):

https://wiki.t-firefly.com/en/ROC-RK3588S-PC/usage_npu.html


How many tokens per second do you think we can get out of this 6TFlops NPU?


For prompt ingestion... I dunno.

Unbatched token generation is basically RAM bandwidth limited, as the entire model has to be cycled through for each token. I bet theoretical performance is similar to the GPU, albeit with much lower power consumption.


Nice achievement.

How many users would realistically be able to use it at the same time when running on such a device? I am interested in its scalability.


That's a tricky question. You're going to have to multiplex the use of the device, but since these are mostly 'ping-pong' style uses you can use something called a 'utilization factor' to figure out what a reasonable upper bound is where you still get an answer to your query in acceptable time. The typical mechanism is an input queue with a single worker to use the device. The cut-off is when the queue becomes unacceptably long, in which case you would have to throw an error or be content with waiting (possibly much) longer for your answer. This is usually capped by some hard limit on the length of the queue (for instance: available memory) or the fact that the queue fills up faster than that it can empty even over a complete daily cycle. Once that happens you need more hardware.


Actually many inference systems instead batch all requests within a time period and submit them as a single shot. It increases the average latency but handles more requests per unit time. (at least, this is my understanding how production serving of expensive models that support batching work)


I've done a bunch of optimization for GPU code (in CUDA) and there are typically a few bottle necks that really matter:

- memory bandwidth

- interconnect bandwidth between the CPU and GPU

- interconnect bandwidth between GPUs

- thermals and power if you're doing a good job of optimizing the rest

I don't see how a batching mechanism would improve on any of those, superficially it looks as though that would make matters worse rather than better. Can you explain where the advantage comes from?


It's a latency vs. throughput tradeoff. I was surprised as well. But most GPUs can do 32 inferences in the same time as they can do 1 inference. They have all the parallel units required and there are significant setup costs that can be amortized since all the inferences share the same model, weights, etc.

https://groq.com/wp-content/uploads/2020/05/GROQP002_V2.2.pd... the "batching" section of https://docs.nvidia.com/deeplearning/tensorrt/archives/tenso... https://le.qun.ch/en/blog/2023/05/13/transformer-batching/


Very interesting, thank you. I will point one of my colleagues that is busy with this stuff to these and I thank you on his behalf as well, it is exactly the kind of thing they are engaged in.


I think in the case of LLM inference the main bottleneck is streaming the weights from VRAM to CU/SM/EU (whatever naming your GPU vendor of choice uses).

If you're doing inference on multiple prompts at the same time by doing batching, you don't take more time in streaming. But each streamed weights gets used for, say, 32 calculations instead of 1, making better use of the GPU's compute resources.


"Scalability" and "Single Board Computer" don't really belong in the same sentence. That said, today you can get a refurbished mini PC with a lot more power, for a lot less money than the higher end SBCs. But I didn't see any info on how portable this project is to other hardware.


I think the biggest advantage here is that you can run it on the GPU using shared memory, which I'm not sure how widespread it is on mini PCs (at least not intel NUCs).

You could run it using OpenVINO on IntelCPUs, but the performance would probably take a hit. It would be a lot easier though since you can just use ggml.


Given the low cost of the setup, I'd expect this to be a single-user solution. Maybe something enabling better smart home / smart device interactions?


If you can run it as a AI Horde worker, and the home usage is sporadic, you could definitely support more than one person.

Otherwise ~1.5 tokens/s is definitely the minimum you'd want streaming tokens to a single person.


Not many, since it's slow to begin with.

You'll get a log_2-based scaling efficiency with nearly any batchsize increase, pending some limitations (memory, etc).

That should be enough at least to roughly sketch it out.


I had to make a minor modification to the code to get the Rust compiler happy, just add a `.as_slice()` when the compilation fails. I'll submit a PR if it's not fixed already.


Ah please help us by submitting a PR! I noticed the rust build failed last night but didn’t get a chance to look into it


All PR's greatly appreciated! Is this on the Rust TVM bindings?


Best part is that they are using TVM.


This was the part that intrigued me. After years of silent begging to PyTorch to support ROCm, suddenly models in general supporting so many different platforms feels overwhelming, and really good.

I'm crediting llama.cpp of all things for being the boost to really up the ante on open source model compilation.

Whatever it is, at least, many of these open source things feel like they just 'happen', as an eventuality, but in order for that to happen it takes a lot of work from a lot of people! Really happy to see the dream of this particular kind of democratization opening widely! :)


Is it possible a pool of these running smaller sized models could somehow be cheaply combined into a MOE approach (as supposedly GPT-4 is) so create something cheap but higher quality?


I'm already getting 1.5tok on Ubuntu running on Android via UserLand w/ Llama.cpp(v2-Q4). Don't really see acceleration. If anything I need to see my phone do something actually useful at let's say 7-10toks


Human speech is in the 2-4 tokens per second range, I think that's about where my frustration limit is.


mlc should already be pretty fast on Vulkan


does mlc work for vision models? for example, the doc mentions --max-seq-len MAX_ALLOWED_SEQUENCE_LENGTH as a command line option. This seems to imply that it only accepts language models?

Also, it doesn't seem to say anything about the input model's format? pytorch weights? onnx?


Yup, here's their web stable diffusion repo: https://github.com/mlc-ai/web-stable-diffusion

The input is a model (weights + runtime lib) compiled via the mlc-llm project: https://mlc.ai/mlc-llm/docs/compilation/compile_models.html


Does the Mali GPU have shared RAM with the CPU on this device? Is this common for ARM devices?


Any idea how the GPU compares to the Nvidia's Jetson series?


It is difficult to find information about the performance of the ARM GPUs that can be compared with that of NVIDIA/AMD/Intel.

However it seems that the Mali-G610 MC4 is in about the same range as the cheaper models of the old Jetson Xavier.

The newer Jetson Orin models have much faster Ampere GPUs with between 1024 and 2048 FP32 ALUs. Nevertheless, the various Jetson Orin models have a price between 4 times and 13 times higher than a SBC with RK3588 and 16 GB DRAM (especially all the Orin models with more than 8 GB DRAM are very expensive) and the ratio between their prices is much greater than the ratio between their performances.

Any small computer with AMD Phoenix offers a much better GPU performance per dollar than any NVIDIA Orin. The use of NVIDIA Orin is justified only when one needs a device that is qualified for an automotive environment.


Thanks for the answer.

> Any small computer with AMD Phoenix offers a much better GPU performance per dollar than any NVIDIA Orin.

> The use of NVIDIA Orin is justified only when one needs a device that is qualified for an automotive environment.

Nvidia Orin will use significantly less energy though.


> Nvidia Orin will use significantly less energy though.

Not really.

A Ryzen 7 7840U has a GPU with 768 FP32 ALUs @ 2.7 GHz and a NPU that can do 10 TOPS and it has a default TDP of 28 W.

The top Jetson AGX Orin models consume up to 60 W or 75 W, but they are so expensive that it does not make sense to compare them with a computer with 7840U and 32 GB of LPDDR5x-7500 that costs 3 times less.

A comparison that makes more sense is with a Jetson Orin NX 16GB (still significantly more expensive), which has a GPU with 1024 FP32 ALUs @ 0.918 GHz and it has a default TDP of 25 W.

For graphics tasks, Jetson Orin NX would be several times slower than an AMD Phoenix, due to its low GPU clock frequency and much slower CPU cores. The same is true for any programs executed on the CPU cores.

On the other hand, for AI inference, Jetson Orin has very fast tensor cores, so it can be many times faster than an AMD GPU or an ARM GPU, i.e. Jetson Orin NX 16 GB is claimed to be able to do 100 TOPS, so if this is the main intended application it can be worthwhile. Nevertheless, the usefulness of the Jetson Orin models for AI inference is diminished by the fact that their price increases very steeply when more memory is desired.


I've been thinking of this. It's just fascinating to me to have a small device that you can converse with and knows almost everything. Perfect for preppers / survivalists. Store it in a faraday cage along with a solar generator.


> knows almost everything

It really doesn't. It doesn't even know what it knows and what it doesn't know. Without ways to check up on whether what it told you is true or not you may well end up in more trouble than where you were before.


How about a local wikipedia dump, with precalculated embeddings? Then you can perform a similarity search first and feed the results to the LLM.

It’s less likely to hallucinate this way.


That would make a lot more sense. That way at least you have a chance to check up on the output, lest your first meal of 'Hedysarum alpinum' ends up being your last.


The key aspect to this being a good solution is actually building a corpus representing as much possible reference knowledge needed in scenario. The idea that the answer is Wikipedia way underestimates the scope.

The Wikipedia patch doesn’t make much sense to me.

What percent of the important questions being asked in this doomsday scenario actually have their answer in Wikipedia?

If 50% of the time you are left trusting raw LLaMa, then you don’t really have a decent solution.

I do appreciate the sentiment tho that future or finetuned LLMS might fit on an RPi or whatever, and be good enough.


I think the theory is that an LLM can integrate the knowledge of Wikipedia and become something greater than the sum of its parts by applying reasoning that is explained in one article to situations in other topics where the reasoning might not be so well explained. Then you can ask naive questions in new scenarios where you might not have the background knowledge (or simply the mental energy) to figure out a right answer to a situation on your own but it powers through for you. AFAIK current LLMs are not this abstract. If one type of reasoning is more often applied in one scenario and another type of reasoning is applied to another, they don't have any context beyond the words and they know what topics humans usually jump to from other given topics.


this concept already exists and is in practice at many companies that require knowledge driven results. https://arxiv.org/abs/2005.11401


There was another paper out recently that adds to this: https://arxiv.org/pdf/2308.04430.pdf. Looks like a more flexible approach to document storage, and it outperforms retrieval in context.

They trained up their own LLM, but from the text it seems like it might be possible to use any LLaMA-style LM without retraining. Not sure though, need to give it a proper look.


> a local wikipedia dump

There exists (at least) a project to train and query an LLM on local documents: privateGPT - https://github.com/imartinez/privateGPT

It should provide links to the the source with the relevant content, to check the exact text:

> You'll need to wait 20-30 seconds (depending on your machine) while the LLM model consumes the prompt and prepares the answer. Once done, it will print the answer and the 4 sources it used as context from your documents

You will have noticed, in that first sentence, that it may not be practical, especially on an Orange Pi.


It would be great to see such project implemented, I wonder how good would it perform


Yes, especially having fact checked output of LLMs would be a nice step in the right direction. Throwing out the hallucinated bits and keeping the good stuff would make LLMs a lot more applicable.


Isn't that a bit of a holy grail though? If your software can fact check the output of LLMs and prevent hallucinations then why not use that as the AI to get the answers in the first place?


Because you - hopefully - have a check against something that is on average of higher quality than the combined input of an LLM.

I'm not sure if this can work or not but it would be nice to see a trial, you could probably do this by hand if you wanted to by breaking up the answer from an LLM into factoids and then to check each of those individually, and to assign a score to them based on the amount of supporting evidence for the factoid. I'd love that as a plug-in to a browser too.


https://arxiv.org/pdf/2308.04430.pdf is interesting from that point of view. They've tackled it from the perspective of avoiding copyright content in training, but including it in inference but I think it ought to mean less hallucination because they also (claim to) solve the attribution problem.


Nice one, thank you. Added to my 'read later today' list, the abstract looks very interesting.


My hypothesis is that including information in the LLM’s prompt to support its answer changes the task roughly from text generation, very hallucination prone, to text summarization, or reformulation with some reasoning, and this is less likely to hallucinate.

That was my personal experience in general with ChatGPT as well as LLaMa1/2.


A friend and colleague of mine just tried this and the first results are quite promising.


What is the current state of "correspondence between fed text and output text"? I.e. how much, when fed in the training e.g. that "Spain invested 6000$ in Columbus' first voyage", LLMs will repeat that notion exactly?

This, without taking into account reasoning and consistence. And already this notion that I picked randomly is not without issues: dollars how computed? And, it is not difficult to state that Columbus reached the Caribbeans in 1492; more complex to "decide" the year of the siege of Troy out of the many dates proposed.

But already at the simplified level of determined clear notions: if LLMs are told that "A is B", and in absence of inconsistency in the training corpus, what is the failure rate (i.e. then outputting something critically different)?

> ways to check up

Some LLMs work as search engines, outputting not just their tentative answer but linked references. A reasonably safe practice at this stage is to use LLMs that way: ask then use the output to check the reference.


Yes like a person


Depending on an LLM for survival is a good way to end up dead.

English Wikipedia will fit on an SD card. It's more valuable and more practical.


"I am confident those red berries you are describing are perfectly fine to eat"


The energy use alone would probably rule this out for survival, no?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: