Ask HN: Which LLMs can run locally on most consumer computers

MyFirstSass · 2024-05-22T17:28:10.000000Z

I've been curious as to when games would implement any kind of these new technologies, but i think they are simply too slow for now?

I think we're at least 10-15 years from being able to run low latency agents that "rag" themselves into the games they are a part of, where there are 100's of them, some of them NPC's other's controlling some game mechanic or checking if the output from other agents is acceptable or needs to be run again.

At the moment a macbook air 16 gb can run Phi-Medium 14gb, which is extremely impressive, but it's 7 tokens per second, way to slow for any kind of gaming, you need to 100x performance and we need 5+ generations before i can see this happening.

Unless there's some other application?

reaperman · 2024-05-22T18:18:01.000000Z

> for games: i think they are simply too slow for now?

I think it's two-fold. The primary one is that it's likely very difficult to maintain a designers storyline vision and desired "atmosphere / feel", because LLM's currently "go off the rails" too easily. The second is that the teams with enough funding to properly fine-tune generative AI to do dialog, level/environment-creation, character-generation, etc. that funding means they're generally making AAA or AAA-adjacent games, which already need so much of a consumer GPU VRAM that there's not a lot left over for large ML models to run in parallel.

I do think though that we should already be seeing indie games doing more with LLM's and 3D character/level/item generation than we are. Of course AI Dungeon has been trailblazing this for a long time but I just expected to see more widely-recognized success by now from many projects. I take this as a signal that it's hard to make a "good" game using AI generation. If anyone has any suggestions for open-world games with significant amount of AI generation that allows player interaction to significantly affect the in-game universe, I'd be very interested in play-testing them. Can be any genre / style / budget. I just want to see more of what people are accomplishing in this space.

My hope is that there will be space for both the current style of game where every aspect is created/designed by a human, as well as for games of various types where the world is given an overall narrative/aesthetic/vision by the creators, but the details are implemented by AI and allows true open-world play where you finally can just walk into any shop and use RAG/etc to allow complete continuity over months/years of play where characters remember your conversations/interactions/actions of you and anyone playing in the same world.

I do think there's something of an "end-game" for this where a game is released that has no game at all in it, but rather generates games for each player based on what they want to play that day, and creates them as you play them. But I'd like to imagine that this won't replace other games (even if it does take a bit of the air out of the room), but rather exist alongside games with human-curated experiences.

everforward · 2024-05-22T18:43:50.000000Z

I think any NPC with dialogue important to a goal (a quest, a tutorial, etc) is going to be hard to use generative AI for. It not only needs to be coherent with the story, but it needs to correctly include certain ideas. I.e. if the NPC gives a quest to go find some item at some location, it needs to say what the item is and where it is.

I think we're currently stuck in a local minima where AI isn't up to the task of making a coherent player-interactable world, but an incoherent or fragmented and non-interactable world isn't impressive enough (like No Man's Sky).

reaperman · 2024-05-22T19:48:25.000000Z

Agreed for current systems. I’m sure we’ll get models in the future which will facilitate this but for now LLMs don’t really stay on task like a professional human would.

And even in AI Dungeon the AI plays so fast and loose that it breaks immersion. Like if I’m doing a space trading roleplay, it doesn’t consider things like making sure the product I’m buying selling meets a specific spec, and often a vendor will start offering to buy Product X from me while I’m negotiating purchasing Product X from them. This "type" of continuity problem happens constantly in AI dungeon.

We’re just not there yet, but I have confidence we’ll get there. I think it’s possible even with our current model/training paradigms but we aren’t using RLHF for game applications yet.

everforward · 2024-05-23T00:19:13.000000Z

I totally think we'll get there, I just don't think we're there yet.

I really think the next step is a heavily AI-integrated version of D&D where the DM can serve as a "filter" for some of the more unhinged output (where appropriate; an intentionally incoherent goblin with some text-to-speech could be phenomenal).

I think that's about where we're at, and I'm expecting a wave of "AI-enhanced" D&D apps any day now. They probably already exist and I just haven't seen them. I would imagine there are still occasional issues with the AI utterly choking; I see it every once in a while on some of my more "fantasy" prompts where I get too specific and it just ignores what I asked.

professoretc · 2024-05-23T16:16:56.000000Z

> I think any NPC with dialogue important to a goal (a quest, a tutorial, etc) is going to be hard to use generative AI for. It not only needs to be coherent with the story, but it needs to correctly include certain ideas. I.e. if the NPC gives a quest to go find some item at some location, it needs to say what the item is and where it is.

That was my experience when I was experimenting with using current LLMs to generate quests. You can of course ask for both a human-readable quest description and also a JSON object (according to some schema) describing the important quest elements, but the failure rate of the results was too high. Maybe 10% of quests would have some important mismatch between the description and the JSON; the description would mention an important object but it would be left out of the JSON, or the JSON would mention an important NPC but the description wouldn't, etc.

As a player, I think it would get frustrating quickly if 10% of quests were unsolvable, especially since, as a player, you don't know when a quest is unsolvable; maybe you just haven't found the item/NPC yet.

everforward · 2024-05-23T16:35:29.000000Z

Yeah, 10% about jives with what I would expect under the assumption that the generated text needs to be non-deterministic (I.e. no careful prompt tuning and turning the temperature down to basically 0).

An interesting flip side I was just thinking about is the AI saying too much. NPCs keeping secrets until the player gets enough reputation or does a favor or whatever is pretty common. I wonder how good they are at keeping those secrets.

Prompt injection is one thing, and vaguely equivalent to cheat codes which is fine, but what is the likelihood that a player just asking for more info ends with the AI spitting out the secret without completing the quest? Will the AI know to unlock the next area or whatever, because there's no reason for the player to do that NPCs quest?

Should be neat stuff, I'm looking forward to how this all works together when the kinks get ironed out.

IXCoach · 2024-05-22T22:48:18.000000Z

To some degree, yes. But, theres a low value to cost ratio in that exact UX.

Take a single character in the game, and enable that character the depth and nuance of a true experience between a Zen Master / Inquiry facilitator, powered by AI. IXCoach.com can do a phenomenal job powering this, so literally the only code needed for an MPV is the mod + character api.

Then, the cost benefit ratio is 400x, and in a day of coding you have taken a game that is mostly pure entertainment, and provided a means for depth, nuance and personal development that literally leads the market.

I pinged the executive producer of CD Project Red on this, it's viable.

https://www.linkedin.com/in/danhernberg/

daemon_9009 · 2024-05-28T18:49:31.000000Z

Current games which are using LLMs only activate the model when the user is talking to the NPC, but in order to create a real dynamic story which is completely random but to the point, the agents need to interact with other as well,so lets say there are around 100 agents in the game they need to interact with each other to generate some emergent behavior. The form of interaction can be questioned here. will it be in natural lang? or just some embeddings or states.

But this thing still has a long way to go.

IXCoach · 2024-05-22T22:40:46.000000Z

I agree in the context of LLMs running locally. For API connected games, cloud support for nuanced conversations would be a tremendous value add. Take a hit like Cyberpunk, create a Mod that wires into a custom AI from ixcoach.com... we could literally integrate the most nuanced self inquiry practices into the top games this way.

Anyone working on top games through mods that wants to explore this, let me know, Next AI Labs would be interested in supporting such efforts.

wing-_-nuts · 2024-05-22T17:43:53.000000Z

There are mods for skyrim right now that run an NPC's dialog and lore through a small 7B model outputs text dialog. Heck if you wanted you could run a 2B whisper model and get reasonably decent voice output.

It's all very exciting, if a little janky.

pants2 · 2024-05-22T18:17:30.000000Z

If we're just talking about NPCs in a video game, I bet the game studios have the resources to train a very specific LLM optimized for NPCs. Lots of training data could probably be stripped out; after all your average quest-giver in Skyrim doesn't need to know how to implement Black Scholes in Rust.

imtringued · 2024-05-23T06:20:05.000000Z

The problem is that you need two GPUs and the AI one can't be from AMD. We aren't 15 years away. More like two or three. NPUs are coming and DDR6 plus quad channel memory would get you decent performance on small LLMs like llama3.

You're also forgetting that batch performance is already an order of magnitude better than single session inference.

FezzikTheGiant · 2024-05-22T18:18:18.000000Z

I agree on the most part, but I still think some pretty cool games can come up with local LLMs. Suck up for example, though not local afaik, is a pretty cool one.

phi-go · 2024-05-22T20:11:54.000000Z

There are a few games that use LLMs and voice, they are usually hilariously janky.

FezzikTheGiant · 2024-05-23T05:56:44.000000Z

Could you name some?

antisthenes · 2024-05-22T18:08:24.000000Z

How in the world would this be tested? Anything pertaining to game logic needs to be deterministic.

I can't see LLMs in games being used for anything more than some random NPC voice quips. And whose voice would be used? Would voice actors be okay with this?

There are already too many bad games, we certainly don't need thousands more with AI-generated drivel dialogue, although having human writers is not a panacea either way.

pants2 · 2024-05-22T18:19:23.000000Z

Have other AI agents test the game in thousands of scenarios. Voice actors are not needed, SOTA TTS systems can synthesize a brand new voice from a description.

ynniv · 2024-05-22T17:38:47.000000Z

See llamafile (https://github.com/Mozilla-Ocho/llamafile), a standalone packaging of llama.cpp that runs an LLM locally. It will use the GPU, but falls back on the CPU. CPU-only performance of small, quantized models is still pretty decent, and the page lists estimated memory requirements for currently popular models.

ultrasaurus · 2024-05-22T18:01:54.000000Z

+100 to this, I don't think many people reading this thread realize how easy they've made it to run a LLM locally. It's a great start if you want to kick multiple tires (be careful to clean up! the gigs add up).

> wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF...

> chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

> ./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile -ngl 999

https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/

blakesterz · 2024-05-22T17:01:56.000000Z

Maybe a dumb question, but I think anyone reading this question would know a good answer for me. If I have a big pile of PDFs and wanted to get an LLM to be really good at answering questions about what's in all those PDFs, would it be best for me to try running this locally? "Best" in this case would be I would want to get the best/smartest answers from my questions about these PDFs. They're all full-text PDFs, studies and results on a specific genetic condition that I'd like to understand better by asking something smart questions.

verdverm · 2024-05-22T17:41:30.000000Z

LlamaIndex can make this task possible in a very few (surprisingly few) lines of code: https://docs.llamaindex.ai/en/stable/understanding/putting_i...

You'll likely want to move beyond the first examples so you can choose models & methods. Either way, LI has tons of great documentation and was originally built for this purpose. They also have a commercial Parsing product with very generous free quotas (last I checked)

manishsharan · 2024-05-22T17:31:15.000000Z

If its just for you, may I suggest Open AI's python notebook examples. This was the one I used to get started.

https://cookbook.openai.com/examples/parse_pdf_docs_for_rag

There are several other examples like this .. but I got stuck in jargon of Langchain or LlamaIndex etc..

solardev · 2024-05-22T17:36:40.000000Z

Not self hosted, but Google Notebook LLM is OK at that: https://notebooklm.google.com/

You can also upload files to ChatGPT and ask questions about it.

keiferski · 2024-05-22T17:32:46.000000Z

Is there any validity to the idea of using a higher-level LLM to generate the initial data, and then copying that data to a lower-level LLM for actual use?

For example, another comment asked:

"If I have a big pile of PDFs and wanted to get an LLM to be really good at answering questions about what's in all those PDFs, would it be best for me to try running this locally?"

So what if you used a paid LLM to analyze these PDFs and create the data, but then moved that data to a weaker LLM in order to run question-answer sessions on it? The idea being that you don't need the better LLM at this point, as you've already extracted the data into a more efficient form.

StrauXX · 2024-05-22T17:39:27.000000Z

Maybe. You'd need to develop such a "more efficient" format. Turning unstructured text into knowledge graphs has gotten attention lately. Though I'm honestly skeptical of how useful those will turn out to be. Often times you just can't break down unstructured data into structured data without loosing a ton of information. Turning the data into an intermediary, not directly understandable by humans (say very-high density embeddings) format might be a more promising path.

abdullin · 2024-05-22T17:35:41.000000Z

Yes, this can work. I’ve done that in a few cases.

In fact, if you split data preprocessing in small enough steps, they could also be run on weaker LLMs. It would take a lot more time, but that is doable.

kkielhofner · 2024-05-22T18:06:25.000000Z

There is actually a specific approach of this concept for generating synthetic data for training datasets called UDAPDR[0].

It or something like it could likely be applied to any form of generation including what you are describing.

[0] - https://github.com/primeqa/primeqa/tree/4ae1b456dbe9f75276fe...

kevinkeller · 2024-05-22T17:43:49.000000Z

Yes, this model works in many cases.

For example, ask the (better, costlier) Claude Opus to generate high-quality prompts, which get fed into (worse, cheaper) Claude Sonnet.

thibaut_barrere · 2024-05-22T17:35:44.000000Z

Yes, that is what I am doing on some projects

Isuckatcode · 2024-05-21T07:18:00.000000Z

I was able to successfully run Llama 3 8B, mistral 7B, phi and other 7B models using Ollama [1] on my M1 MacBook Air.

[1] https://ollama.com

FezzikTheGiant · 2024-05-21T08:29:14.000000Z

Are they able to run at a good speed? I'm just wondering what the economics would look like if I want to create agents in my games. I don't think many are going to be willing to get with usage based / token based pricing. That's the biggest roadblock with building LLM-based games right now.

Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?

What I think is, from my limited understanding about this field, if smaller models can run on consumer hardware reliably and speedily that would be a game changer.

verdverm · 2024-05-22T17:51:06.000000Z

> Are they able to run at a good speed?

Not on most consumer computers, which likely lack a dedicated GPU. My M2 struggles (only thing that makes it warm) with a 7B model, but token speed is unbearable. I switched to remote APIs for the speed.

If you are targeting gamers with a GPU, the answer may change, but as others have pointed out, there are numerous issues here.

> This would virtually make inference free right?

Yes-ish, if you are only counting your dollars, however it will slow their computer down and have slow response time, which will impact adoption of your game.

If you want to go this route, I'd start with a 2B sized model, and not worry about shipping it nicely. Get some early users to see if this is the way forward.

I suspect that remote LLM calls with sophisticated caching (cross user / convo / pre-gen'd) is something worth exploring as well. IIRC, people suspected gtp3-turbo was caching common queries and avoided the LLM when it could, for the speed

sharpshadow · 2024-05-22T17:04:03.000000Z

You could also ship a couple of them and let the game/user choose which one to run depending on the hardware.

FezzikTheGiant · 2024-05-22T18:07:57.000000Z

This is something I was considering as well - thanks

imtringued · 2024-05-23T06:34:50.000000Z

There isn't. For games you would need vLLM, because batch size is more important than latency. Something that people don't seem to understand is that an NPC doesn't need to generate tokens faster than its TTS can speak. You only need to minimize the time to first token.

alexvitkov · 2024-05-22T17:25:13.000000Z

The biggest roadblock is not running the model on the user's machine, that's barely an issue with 7B models on a gaming PC. The difficulty is in getting the NPC to take interesting actions with a tangible effect on the game world as a result of their conversation with the player.

FezzikTheGiant · 2024-05-22T18:21:08.000000Z

The generative agents paper takes a pretty decent shot at this I think

Isuckatcode · 2024-05-21T19:43:02.000000Z

Here [1] is a reference to the token/sec of Llama 3 on different apple hardware. You can evaluate if this is an acceptable performance for your agents. I would assume the token/sec would be much lower if the LLM agent is running along the side as the game would also be using a portion of the CPU and GPU. I think this is something that you need to test out on your own to determine its usability.

You can also look into lower parameter models (3B for example) to determine if the balance between accuracy and performance fits under your usecase.

>Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?

I don't have any knowledge on game dev so I can comment on this but yes, packaging it locally would make the inference free.

[1] https://github.com/ggerganov/llama.cpp/discussions/4167

FezzikTheGiant · 2024-05-22T04:26:34.000000Z

Thanks! This is helpful. I was thinking about the phi models - those might be useful for this task. Will look into how those can be run locally as well

jlokier · 2024-05-22T18:33:58.000000Z

I just ran phi3:mini[1] with Ollama on an Apple M3 Max laptop, on battery set to "Low power" (mentioned because that makes some things run more slowly). phi3:mini output roughly 15-25 words/second. The token rate is higher but I don't have an easy way to measure that.

Then llama3:8b[2]. It output 28 words/second. This is higher despite the larger model, perhaps because llama3 obeyed my request to use short words.

Then mixtral:8x7b[3]. That output 10.5 words/second. It looked like 2 tokens/word, as the pattern was quite repetitive and visible, but again I have no easy way to measure it.

That was on battery, set to "Low power" mode, and I was impressed that even with mixtral:8x7b, the fans didn't come on at all for the first 2 minutes of continuous output. Total system power usage peaked at 44W, of which about 38W was attributable to the GPU.

[1] https://ollama.com/library/phi3 [2] https://ollama.com/library/llama3 [3] https://ollama.com/library/mixtral

smarm52 · 2024-05-21T19:49:50.000000Z

Well since OP doesn't seem to want to: Thank you for your response.

I came across this thread while doing some research, and it's been helpful.

(I hate how common Tragedy of the Commons is. =/)

FezzikTheGiant · 2024-05-22T04:25:05.000000Z

What? chill out buddy - there's such a thing as timezones I was just sleeping

wing-_-nuts · 2024-05-22T17:41:44.000000Z

The general rule is that VRAM == parameter count in billions (I'm generalizing gguf finetunes here)

8GB vram cards can run 7B models

16GB vram cards can run 13B models

24GB vram cards can run up to 33B models

Now to your question, what can most computers run? You need to look at the tiny but specialized models. I would think 3B models could be ran reasonably well even on the CPU. Intellij has a absolutely microscopic < 1B model that it uses for code completion locally. It's quite good and I don't notice any delay.

noboostforyou · 2024-05-22T17:43:29.000000Z

Perhaps there's a simple explanation but why does 24GB of VRAM offer such a large relative uplift in parameter count? (is memory bandwidth a factor rather than just the total memory amount?)

wing-_-nuts · 2024-05-22T17:50:24.000000Z

So, this is a bit misleading. For whatever reason the models tend to be released in certain parameter sizes. 7B models are popular. The next highest is 13B. There are few in between (some 11B). Likewise the jump from 13 is straight to 33B. You can run finetunes of a 33B model that have been cut down a little and fit them in a 24GB card. Likewise those 13B models running on 16GB cards have a lot of head room. You don't need to run as cut down a model, and you can run it with more context (i.e. the amount of your chat it can hold in memory)

I hope that helps, it's not 1:1, and it's a bit confusing

noboostforyou · 2024-05-23T16:03:43.000000Z

Thank you, that's helpful context.

wkat4242 · 2024-05-22T17:51:22.000000Z

Probably quantisation.

I own a 4090 and I can only run very heavily quantised 33B models. It's not really worth it.

My LLM server with 16gb gpu mainly runs llama3 with expanded context window which also costs much more memory.

wing-_-nuts · 2024-05-22T18:14:21.000000Z

Yeah, i have a 3090 and 64gb of ram. I can run a 8x7B and get pretty decent performance out of it with partial offloading.

wkat4242 · 2024-05-22T18:55:33.000000Z

Really?? For me it's terrible doing that. I also have 64GB RAM but meh. It's so bad when I can no longer offload everything. The tokens literally drizzle in. With full offloading they appear faster than I can read (8B llama3 with 8 bit quant). On a Radeon Pro VII with 16GB (HBM2 memory!)

wing-_-nuts · 2024-05-22T19:13:58.000000Z

Oh man, I hate to say it, but it's likely your amd card. Yes, they can run LLMs and SD, but badly. Larger models are usable for me with partial offloading, but you're right that full loading the model in vram is really preferable.

wkat4242 · 2024-05-22T19:22:15.000000Z

I don't think so, because when I run it on the 4090 I get the same issue (in a system with 5800X3D and 64GB RAM also). I just don't use the 4090 for LLM because I have it for playing VR games and I don't want to tie it up for a 24/7 LLM server :) Also, it's very power-hungry. I do run that one on Windows and the Radeon server is Linux but I don't think that matters a lot. Using the same software stack too (ollama).

In fact the Radeon which cost me only 300 bucks new performs almost as well running LLMs as the 4090 which really surprised me! I think the fast memory (the Radeon has the same 1TB/s memory bandwidth as the 4090!) helps a lot there.

When I run a local model (significantly) bigger than the 24GB VRAM on the 4090 it won't even load for 15 minutes while the 4090 is pegged at 100% all the time. Eventually I just gave up.

wing-_-nuts · 2024-05-22T19:46:56.000000Z

>When I run a local model (significantly) bigger than the 24GB VRAM on the 4090 it won't even load for 15 minutes while the 4090 is pegged at 100% all the time. Eventually I just gave up.

Yeah the key here is partial offloading. If you're trying to offload more layers than your GPU has memory for, you're gonna have a bad time. I find it kind of infuriating that this is still kind of a black art. There's definitely room for better tooling here.

Regardless, with 24GB of vram, I try to limit my offloading to 20GB and let the rest go to ram. Maybe it's the nature of the 8x7B model I run that makes it better at offloading than other large models. I'm not sure. I wouldn't try the 70B models for sure.

onion2k · 2024-05-22T17:05:56.000000Z

I run Mistral 7b and Llama 3 locally using jani.ai on a 32GB Dell laptop and get about 6 tokens per second with a context window of 8k. It's definitely usable if you're patient. I'm glad I also have a Hugging Face account though.

Liquix · 2024-05-22T17:19:12.000000Z

seconded - IMHO Jan has the cleanest UI and most straightforward setup out of all LLM frontends available now.

https://jan.ai/

https://github.com/janhq/jan

bryanlarsen · 2024-05-22T17:30:26.000000Z

Related question: what's the minimum GPU that's roughly equivalent to Microsoft's Copilot+ spec NPU?

I imagine that Copilot+ will become the target minimum spec for many local LLM products and that most local LLM vendors will use GPU instead of NPU if a good GPU is available.

tda · 2024-05-22T17:40:29.000000Z

I was looking out for a new laptop but was wondering the same. This NPU thing might be one of Microsoft's bets that pays off, and makes all pre-NPU hardware obsolete quickly. Though of course they have doubled down on various failed projexts before (arm Windows, windows phones, etc)

kevinkeller · 2024-05-22T17:40:00.000000Z

The NPU in the Snapdragon SoC used by the Windows Surface laptops was quoted to be ~ 40 trillion ops/s (TOPS).

Nvidia 4070 Ti has roughly the same performance: https://www.techpowerup.com/gpu-specs/geforce-rtx-4070-ti.c3...

Of course, I'm massively oversimplifying, but it should be in the ballpark.

artemisart · 2024-05-22T21:37:31.000000Z

No, the Nvidia 4070 Ti has much higher performance, TOPS is for integer operations, the 4070 Ti has ~40 float32 TFLOPS and 641 TOPS https://www.nvidia.com/fr-fr/geforce/graphics-cards/40-serie... (which I would say would be peak TOPS for int4 operations, comparing it to the 4080 datasheet, and a bit more than half that for int8 operations) https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvid... page 34. I did not find the datasheet for 4070 Ti.

imtringued · 2024-05-23T06:40:17.000000Z

Basically any GPU with at least 32GB RAM and 12 TFLOPs.

andy_ppp · 2024-05-22T17:13:03.000000Z

“Caniuse” equivalent for LLMs depending on machine specs would be extremely useful!

abdullin · 2024-05-22T17:37:41.000000Z

There are too many variables at play, unfortunately.

One can ran local LLMs even on RaspberryPi, although it will be horribly slow.

andy_ppp · 2024-05-22T20:52:42.000000Z

Maybe it wouldn’t be an algorithm, maybe it would be a reporting site where you can review your experience if there’s no way to calculate it.

abdullin · 2024-05-22T21:14:10.000000Z

LocalLLaMA subreddit usually has some interesting benchmarks and reports.

Here is one example, testing performance of different GPUs and Macs with various flavours of Llama:

https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...

Terretta · 2024-05-23T12:21:47.000000Z

LM Studio on MacOS provides an estimate of whether a model will run on the GPU, also lets you partially offload.

The underlying CLI tools do this, the app makes it easier to see and manage.

spmurrayzzz · 2024-05-22T17:05:12.000000Z

Running them at the edge is definitely possible on most hardware, but not ideal by any means. You'll have to set latency and throughput expectations fairly low if you don't have a GPU to utilize. This is why I'd disagree with your statement re: viability — its really going to be most viable if you centralize the inference in a distributed cloud environment off-device.

Thankfully, between llama 3 8b [1] and mistral 7b [2] you have two really capable generic instruction models you can use out of the box that could run locally for many folks. And the base models are straightforward to finetune if you need different capabilities more specific to your game use cases.

CPU/sysmem offloading is an option with gguf-based models but will hinder your latency and throughput significantly.

The quantized versions of the above models do fit easily in many consumer grade gpus (4-5GB for the weights themselves quantized at 4bpw), but it really depends on how much of your vram overhead you want to dedicate to the model weights vs actually running your game.

[1] https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

[2] https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

psynister · 2024-05-22T17:36:16.000000Z

Check out Ollama, it's built to run models locally. Llama3 8b runs great locally for me, 70b is very slow. Plenty of options.

b5n · 2024-05-22T17:23:45.000000Z

Quantized 6-8b models run well on consumer GPUs. My concern would be vram limits given you'll likely be expecting the card to do compute _and_ graphics.

Without a GPU I think it will likely be a poor experience, but it won't be long until you'll have to go out of your way to buy consumer hardware that doesn't integrate some kind of TPU.

xyc · 2024-05-27T06:30:53.000000Z

I have been using local LLM as a daily driver. Built https://recurse.chat for it. I've used Llama 3, WizardLM 2, Mistral mostly, and sometimes just trying out models from hugging face (Recently added support for adding it from Hugging Face https://x.com/recursechat/status/1794132295781322909)

pshc · 2024-05-22T17:13:48.000000Z

Quantized 4/5-bit 8b models with medium-short context might be shippable. Still, it’s going to require a nice GPU for all that RAM. Plus you would have to support AMD—I would experiment with llama.cpp as it runs on many architectures.

Hope your game doesn’t have a big texture budget.

root_axis · 2024-05-22T18:15:03.000000Z

Seems like there is high potential for some NPC text generation from LLMs, especially a model that is trained to produce NPC dialog alongside discrete data that can be processed to correlate the content of the speech with the state of the game. This is going to be a tough challenge with a lot of room for research and creative approaches to producing immersive experiences. Unfortunately, only single-player and cooperative experiences will be practical for the foreseeable future since its trivial to totally break the immersion with some prompt poisoning.

Even more than LLMs, I'm curious about how transformers can be used to produce more convincing game AI in the areas where they are notoriously bad like 4x games.

talldayo · 2024-05-22T17:10:47.000000Z

Gemma 2B and Phi-3 3B, if you run them at Q4 quantization. I wouldn't bother with anything larger than 4B parameters; you're just not going to be able to reliably expect an end-user to run that size of model on a phone yet.

calculito · 2024-05-21T06:07:03.000000Z

I assume the question is rather which LLM can cover most of the tasks while delivering decent quality. I would prefer an architecture using different LLM for different tasks rather like 'specialists' instead of simple 'agents'. I used to take the main task and divide it in smaller tasks and see what can I use to solve the problem. Sometimes rule-based approaches can be already enough for a sub-task and LLM would be not only overkill but also more difficult to implement and maintain.

rahimrezgui · 2024-05-21T06:29:28.000000Z

so what is your answer to the question?

calculito · 2024-05-21T16:03:53.000000Z

Depends of what you want to do!? Just for testing most of the 7B model are a good compromise between quality and performance (speak execution time)

FezzikTheGiant · 2024-05-21T18:15:46.000000Z

Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?

What I think is, from my limited understanding about this field, if smaller models can run on consumer hardware reliably and speedily that would be a game changer.

talldayo · 2024-05-22T17:13:11.000000Z

> This would virtually make inference free right?

Not really. Inference is never "free" unless you cache the result (which is just a static output) or unless you reduce complexity (which yields procedurally less-usable outputs).

FezzikTheGiant · 2024-05-22T18:04:49.000000Z

Can you explain further? Why would it not be free if it's running locally

talldayo · 2024-05-22T18:30:33.000000Z

I thought you meant "free" in terms of computational cost; running local is technically free of charge, but also requires a lot of processing power. Inferecing a properly-sized LLM will potentially starve the rest of your software from GPU/CPU/memory access, so you have to plan accordingly.

jsheard · 2024-05-22T17:21:11.000000Z

I imagine you would have to solve some tricky scheduling issues to run an LLM on the GPU while it's also busy rendering the game. Frames need to be rendered at a more or less consistent rate no matter what, but the LLM would likely have erratic, spiky GPU utilisation depending on what the agents are doing, so you would have to throttle the LLM execution very carefully. Probably doable but I don't think there's any existing framework support for that.

callwhendone · 2024-05-22T17:23:25.000000Z

or have 2 gpus

jsheard · 2024-05-22T17:24:37.000000Z

That also works but approximately zero gamers have two discrete GPUs. You can't even rely on users to have an integrated GPU and a discrete GPU, there's a lot of systems which only have one or the other.

ilaksh · 2024-05-22T18:30:49.000000Z

You can 100% do that with quantized models that are 8b and below. Take a look at ollama to experiment. For incorporating in a game I would probably use llama.cpp or candle.

The game itself is not going to have much VRAM to work with though on older GPUs. Unless you use something fairly tiny like phi3-mini.

There are a lot more options if you can establish that the user has a 3090 or 4090.

sn0wr8ven · 2024-05-22T17:38:51.000000Z

There definitely are smaller LLMs that can run on consumer computers, but as for their performance... You would be lucky to get a full sentence. On the other hand, sending and receiving responses as text is probably the fastest and most realistic way to implement these things in games.

imtringued · 2024-05-23T06:47:01.000000Z

I've gone past the 8k context window with very good text generation on llama3. I don't know what you're smoking.

winwang · 2024-05-22T17:38:45.000000Z

Check out this subreddit for a decent "source of truth": reddit.com/r/localllama

resource_waste · 2024-05-22T17:46:02.000000Z

Nah, too many fanboys thinking their CPU testing is actually using LLMs.

They will say things like "Its a GPU inside a CPU". No that is the marketers telling you about integrated GPUs.

There is a huge divide between CPU and GPU people. GPU people are doing application. CPU people are... happy that they got anything to run.

Terretta · 2024-05-23T12:19:59.000000Z

Macbook Pro with 128GB RAM runs Llama 3 70B entirely in memory and on GPU. It's remarkable to have a performant LLM that smart and that fast on a (pro)sumer laptop.

jaggs · 2024-05-21T09:06:46.000000Z

Mistral is pretty good, and delivers solid results.

FezzikTheGiant · 2024-05-21T09:08:35.000000Z

Interesting - is it viable do you think to package a llm like that with an existing game and run it locally - I assume it will be intensive to run but wouldn't that eliminate inference costs?

Werewolf255 · 2024-05-22T17:01:34.000000Z

It would be intensive but it's very doable. You could use koboldcpp or something like that with an exposed endpoint just on the local machine and use that. You'll likely run into issues with GPU vendors and ensuring that you've got the right software versions running, but with some checking, it should be viable. Maybe include a fallback in case the system can't produce results in a timely manner.

jaggs · 2024-05-21T10:02:08.000000Z

Why would you get costs with a local model?

FezzikTheGiant · 2024-05-21T10:12:26.000000Z

yeah that's what I'm saying - it would eliminate inference costs. What I was asking is how feasible is it to package these local llms with another standalone app. For ex. a game

jaggs · 2024-05-21T10:27:03.000000Z

Oh sorry. Hm..I actually have no idea. It sounds like a neat idea though. :)