For running LLMs, I think most people just dive into https://www.reddit.com/r/Lo...

For running LLMs, I think most people just dive into https://www.reddit.com/r/LocalLLaMA/ and start reading.

Not sure what the equivalent is for image generation; it's either https://www.reddit.com/r/StableDiffusion/ or one of the related subreddits it links to.

Sadly, I've yet to find anyone doing "daily ML-hobbyist news" content creation, summarizing the types of articles that appear on these subreddits. (Which is a surprise to me, as it's really easy to find e.g. "daily homelab news" content creators. Please, someone, start a "daily ML-hobbyist news" blog/channel! Given that the target audience would essentially be "people who will get an itch to buy a better GPU soon", the CPM you'd earn on ad impressions would be really high...)

---

That being said, just to get you started, here's a few things to know at present about "what you can run locally":

1. Most models (of the architectures people care about today) will probably fit on a GPU which has something like 1.5x the VRAM of the model's parameter-weights size. So e.g. a "7B" (7 billion parameter-weights) model, will fit on a GPU that has 12GB of VRAM. (You can potentially squeeze even tighter if you have a machine with integrated graphics + dedicated GPU, and you're using the integrated graphics as graphics, leaving the GPU's VRAM free to only hold the model.)

2. There are models that come in all sorts of sizes. Many open-source ML models are huge (70B, 120B, 144B — things you'd need datacenter-class GPUs to run), but then versions of these same models get released which have been heavily cut down (pruned and/or quantized), to force them to fit into smaller VRAM sizes. There are 5B, 3B, 1B, even 0.5B models (although the last two are usually special-purpose models.)

3. Surprisingly, depending on your use-case, smaller models (or small quants of larger models) can "mostly" work perfectly well! They just have more edge-cases where something will send them off the rails spiralling into nonsense — so they're less reliable than their larger cousins. You might have to give them more prompting, and try regenerating their output from the same prompt several times, to get good results.

4. Apple Silicon Macs have a GPU and TPU that read from/write to the same unified memory that the CPU does. While this makes these devices slower for inference than "real" GPUs with dedicated VRAM, it means that if you happen to own a Mac with 16GB of RAM, then you own something that can run 7B models. AS Macs are, oddly enough, the "cheapest" things you can buy in terms of model-capacity-per-dollar. (Unlike a "real" GPU, they won't be especially quick and won't have any capacity for concurrent model inference, so you'd never use one as a server backing an Inference-as-a-Service business. But for home use? No real downsides.)