It's awesome that stuff like this is open source, but even if you have a basement rig with 4 NVIDIA GeForce RTX 5090 graphic cards ($15-20k machine), can it even run with any reasonable context window that isn't like a crawling 10/tps?
Frontier models are far exceeding even the most hardcore consumer hobbyist requirements. This is even further
~20 tokens/second is actually pretty good. I see he's using the q5 version of the model. I wonder how it scales with the larger contexts. And the same guy published the video today with the new 3.2 version: https://www.youtube.com/watch?v=b6RgBIROK5o
As pointed out by a sibling comment. MOE consists of a router and a number of experts (eg 8). These experts can be imagined as parts of the brain with specialization, although in reality they probably don't work exactly like that. These aren't separate models, they are components of a single large model.
Typically, input gets routed to a number of of experts eg. top 2, leaving the others inactive. This reduces number of activation / processing requirements.
Mistral is an example of a model that's designed like this. Clever people created converters to transform dense models to MOE models. These days many popular models are also available in MOE configuration
that's not really a good summary of what MoEs are. you can more consider it like sublayers that get routed through (like how the brain only lights up certain pathways) rather than actual separate models.
The gains from MoE is that you can have a large model that's efficient, it lets you decouple #params and computation cost. I don't see how anthropomorphizing MoE <-> brain affords insight deeper than 'less activity means less energy used'. These are totally different systems, IMO this shallow comparison muddies the water and does a disservice to each field of study. There's been loads of research showing there's redundancy in MoE models, ie cerebras has a paper[1] where they selectively prune half the experts with minimal loss across domains -- I'm not sure you could disable half the brain and notice a stupefying difference.
> I don't see how anthropomorphizing MoE <-> brain affords insight deeper than 'less activity means less energy used'.
I'm not saying it is a perfect analogy, but it is by far the most familiar one for people to describe what sparse activation means. I'm no big fan of over-reliance on biological metaphor in this field, but I think this is skewing a bit on the pedantic side.
re: your second comment about pruning, not to get in the weeds but I think there have been a few unique cases where people did lose some of their brain and the brain essentially routed around it.
As someone with a basement rig of 6x 3090s, not really. It's quite slow, as with that many params (685B) it's offloading basically all of it into system RAM. I limit myself to models with <144B params, then it's quite an enjoyable experience. GLM 4.5 Air has been great in particular
Home rigs like that are no longer cost effective. You're better off buying an rtx pro 6000 outright. This holds both for the sticker price, the supporting hardware price, the electricity cost to run it and cooling the room that you use it in.
I was just watching this video about a Chinese piece of industrial equipment, designed for replacing BGA chips such as flash or RAM with a good deal of precision:
Eh, I don't see the risk, no pun intended. It's not collimated, and it's not going to be in focus anywhere but on-target. It's also probably in the long-wave range >>1000 nm that's not focused by the eye. At the end of the day it's no different from any other source of spot heating. I get more nervous around some of the LED flashlights you can buy these days.
It's 45w of lasing power. I have a scar on my hand that's 15 years old from running one of those at 10% power and getting a reflection from a bare metal sheet.
This will absolutely scar, if not char, your cornea faster than you can blink.
That's (again) less energy than a flashlight puts out these days, so the beam had to be tightly focused in your case. That isn't how these things work.
There is nothing special about "lasing power." It amounts to a 45-watt light bulb, nothing more and nothing less.
A 45 watt light bulb spreads the energy in all directions - at 1 meter away that's about 3 watts in every square meter or roughly 0.000003 watts per square millimeter. The laser is putting 45 watts into that same square millimeter at the same distance.
Of course the laser is tightly focused. That's pretty much one of the defining properties of laser devices. How else do you think the laser is heating the microprocessors in the video?
They will be using a beam spreader to conform to the size of the targeted IC, which is usually on the order of 5x5 mm and up. For smaller parts they will be reducing the power.
They shouldn't be focusing it to a point under any conditions. Whether it's as safe as it could be is a different question, of course. For instance, you'd like to think that the act of configuring it for a smaller beam footprint would reduce the power at the same time, as opposed to requiring a separate adjustment that might be overlooked by the operator. Would have been nice if the video had addressed that and other safety considerations, for sure.
A lot depends on the exact wavelength. 1400 nm and longer is much less worrisome than near-visible IR.
That's obviously not a good-faith or technically-accurate description of what's happening here, or else everybody in that video would be carrying a white cane, along with everybody who uses this type of equipment in the phone repair business.
About all we can agree on, I think, is that neither of us knows enough about the product to argue about it usefully.
Yeah, the pricing for the rtx pro 6000 is surprisingly competitive with the gamer cards (at actual prices, not MSRP). A 3x5090 rig will require significant tuning/downclocking to be run from a single North American 15A plug, and the cost of the higher powered supporting equipment (cooling, PSU, UPS, etc) needed will pay for the price difference, not to mention future expansion possibilities.
I run it all the time, token generation is pretty good. Just large contexts are slow but you can hook a DGX Spark via Exo Labs stack and outsource token prefill to it. Upcoming M5 Ultra should be faster than Spark in token prefill as well.
> I run it all the time, token generation is pretty good.
I feel like because you didn't actually talk about prompt processing speed or token/s, you aren't really giving the whole picture here. What is the prompt processing tok/s and the generation tok/s actually like?
I addressed both points - I mentioned you can offload token prefill (the slow part, 9t/s) to DGX Spark. Token generation is at 6t/s which is acceptable.
6 tok/sec might be acceptable for a dense model that doesn't do thinking, but for something like DeepSeek 3.2 that does do reasoning, 6 tok/sec isn't acceptable for anything else but async/batched stuff, sadly. Even for a response with just 100 tokens we're talking a minute for it to just write the response, for anything except the smallest of prompts you'll easily be hitting 1000 tokens (600 seconds!).
Maybe my 6000 Pro spoiled me, but for actual usage, 6 or even 9 tok/sec is too slow for a reasoning/thinking model. To be honest, kind of expected on CPU though. I guess it's cool that it can run on Apple hardware, but it isn't exactly a pleasant experience at least today.
Dunno, DeepSeek on MacStudio doesn't feel much slower than when using it directly on deepseek.com; 6t/s is still around 24 characters per second which is faster than many people could read. I also have 6000 Pro but you won't fit any large model there and to be able to run DeepSeek R1/3.1/3.2 671B at Q4 you'd need 5-6 of them depending on the communication overhead. MacStudio is the simplest solution to run it locally.
> 6t/s is still around 24 characters per second which is faster than many people could read.
But again, not if you're using thinking/reasoning, which if you want to use this specific model properly, you are. Then you have a huge delay before the actual response comes through.
> MacStudio is the simplest solution to run it locally.
Obviously, that's Apple's core value proposition after all :) One does not acquire a state-of-the-art GPU and then expect simple stuff, especially when it's a fairly uncommon and new one. You cannot really be afraid of diving into CUDA code and similar fun rabbit holes. Simply two very different audiences for the two alternatives, and the Apple way is the simpler one, no doubt about it.
People with basement rigs generally aren't the target audience for these gigantic models. You'd get much better results out of an MoE model like Qwen3's A3B/A22B weights, if you're running a homelab setup.
Reproducibility of results are also important in some cases.
There are consumer-ish hardware that can run large models like DeepSeek 3.x slowly. If you're using LLMs for a specific purpose that is well-served by a particular model, you don't want to risk AI companies deprecating it in a couple months and push you to a newer model (that may or may not work better in your situation).
And even if the AI service providers nominally use the same model, you might have cases where reproducibility requires you use the same inference software or even hardware to maintain high reproducibility of the results.
If you're just using OpenAI or Anthropic you just don't get that level of control.
Who is the target audience of these free releases? I don't mind free and open information sharing but I have wondered what's in it for the people that spent unholy amounts of energy on scraping, developing, and training
There are plenty of 3rd party and big cloud options to run these models by the hour or token. Big models really only work in that context, and that’s ok. Or you can get yourself an H100 rack and go nuts, but there is little downside to using a cloud provider on a per-token basis.
> There are plenty of 3rd party and big cloud options to run these models by the hour or token.
Which ones? I wanted to try a large base model for automated literature (fine-tuned models are a lot worse at it) but I couldn't find a provider which makes this easy.
Lambda.ai used to offer per-token pricing but they have moved up market. You can still rent a B200 instance for sub $5/hr which is reasonable for experimenting with models.
https://app.hyperbolic.ai/models
Hyperbolic offers both GPU hosting and token pricing for popular OSS models. It’s easy with token based options because usually are a drop-in replacement for OpenAI API endpoints.
You have you rent a GPU instance if you want to run the latest or custom stuff, but if you just want to play around for a few hours it’s not unreasonable.
I don't see any large base models there. A base model is a pretrained foundation model without fine tuning. It just predicts text.
> Lambda.ai used to offer per-token pricing but they have moved up market. You can still rent a B200 instance for sub $5/hr which is reasonable for experimenting with models.
A B200 is probably not enough: it has just 192 GB RAM while DeepSeek-V3.2-Exp-Base, the base model for DeepSeek-V3.2, has 685 billion BF16 parameters. Though I assume they have larger options. The problem is that all the configuration work is then left to the user, which I'm not experienced in.
Thanks. They do indeed have a single base model: Llama 3.1 405B BASE. This one is a bit older (July 2024) and probably not as good as the base model for the new DeepSeek release. But that might the the best one can do, as there don't seem to be any inference providers which have deployed a DeepSeek or even Kimi base model.
That's the final, fine-tuned model. The base model (pretraining only, no instruction SFT, RLHF, RLVR etc) is this one: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp-Base
It's apparently not offered at any inference provider, nor are older DeepSeek base models.
I run a bunch of smaller models on a 12gb vram 3060 and it's quite good. For larger open models ill use open router. I'm looking into on-
demand instances with cloud/vps providers, but haven't explored the space too much.
I feel like private cloud instances that run on demand is still in the spirit of consumer hobbyist. It's not as good as having it all local, but the bootstrapping cost plus electricity to run seems prohibitive.
I'm really interested to see if there's a space for consumer TPUs that satisfy usecases like this.
FWIW it looks like OpenRouter's two providers for this model (one of whom being Deepseek itself) are only running the model around 28tps at the moment.
Frontier models are far exceeding even the most hardcore consumer hobbyist requirements. This is even further