Hacker Newsnew | past | comments | ask | show | jobs | submit | more FloatArtifact's commentslogin

How does this keyword spotting compare versus grammar or intent approach for speech recognition commands with dictation?

How does keyword spotting handle complex phrases as commands?


Self-Hosting like it's 2025...uhhgg...

Don't get me wrong I love some of the software suggested. However yet a another post that does not take backups as seriously as the rest of the self-hosting stack.

Backups are stuck in 2013. We need plug and play backups for containers! No more roll your own with zfs datasets, back up data on the filesystem level (using sanoid/syncoid to manage snapshots or any other alternatives.


Why not zfs snapshots? Besides using Hyper-V machine snapshots, that's been the easiest way, by far, for me. No need to worry about the 20 different proprietary tools that go with each piece of software.

Each VM or container gets a data mount on a zvol. Containers go to OS mount and each OS has its own volume (so most VMs end up with 2 volumes attached)


Well, one argument not to use ZFS is simply the resources it takes. It eats up a lot of ram. Also I'm under the impression that one should never live-snapshot a database without risk of corruption.


Best decision of last year for my homelab: run everything in Proxmox VMs/containers and back up to a separate Proxmox Backup Server instance.

Fully automated, incremental, verified backups, and restoring is one click of a button.


Yes, I'm considering that if I can't find a solution that is plug-and-play for containers Independent of the OS and file system. Although I don't mind something abstracting on top of ZFS. ZFS Mental overhead through the snapshot paradigm can lead to its own complexities. A traditional backup and restorer front end would be great.

I find it strange that, especially with a docker which already knows your volumes, app data, and config, can't automatically backup and restore databases, and configs. Jeez, they could have built it right into docker.


rclone is great for this.

One could set up a Docker Compose service that uses rclone to gzip and back up your docker volumes to something durable to get this done. An even more advanced version of this would automate testing the backups by restoring them into a clean environment and running some tests with BATS or whatever testing framework you want.


Rclone won't take a consistent snapshot so you either need to shutdown the thing or use some other tool to export the data first


zfs/btrfs snapshot and then rclone that snapshot?


I think that'd break deleting incremental snapshots unless you tried uploading a gigantic blob of the entire filesystem, wouldn't it?

Meaning you'd need to upload full snapshots on a fixed interval


Well, I see your post now.


I've made a very recent (1hr ago) Show HN post and it's not visible. Should be that one: https://news.ycombinator.com/item?id=43519517


You won't spend a cent but it'll cost you more than a dollar. Spend time with people in their day-to-day life intentionally. That's truly a gift that can impact anyone not just people with low income. It might just change you, too. You won't be able to do with many people, but do it with a few.


Don't make it an AI only note taking product.

Make it a hybrid product. Let the end user also mark elements such as video, slides, original audio, and transcription. Especially of aligning speech-to-text to the generated audio, video and slides. Allowing the user to scrub through the timeline and mark what they believe is important.

This allows the user to be at the center of the product, yet scale their use case and manual input as needed with generative AI. In addition, this provides additional context for the AI to produce tailored output based on the user's input.


Main concept to build a tool that record the meeting which is not locked into one LLM and not isolated from current workflows.


My comment is not about LLM provider lock-in. It's about merging a traditional note-taking app with an LLM augmented approach. It's up to you if you want your product to stand out.


Got it.


Will the battery be user-replaceable?


Seems like Apple's M2 is a sweet spot for AI performance at 800 GB/s of memory bandwidth which can be added under $1,500 refurbished for 65 gigs of RAM.


Where for $1500?


Not on Apple Refurbs. That would cost you about $2200.

And the M2 Max has a memory bandwidth of 400GB/s.


Whoops, I got confused between the Max and the Ultra for memory bandwidth. But I have seen on occasion, months ago, reverbs for that price.


I’m guessing a reference to M2 Ultra? Not sure about that price though…


M2 Ultra refurb was over $4,000, last I checked.


What we need is a platform for benchmarking hardware for AI models. With X hardware you can get X amount of tokens with X amount of latency For context token pre-filled. So, standard testing methodology per model with user-supplied benchmarks. Yes, I recognize there's going to be some variability based on different versions of the software stack and encoders.

End user experience should start by selecting the models of interest to run and output hardware builds with price tracking for components.


Agreed. I opened the comments to write that nearly all of those articles spend very few words on the hardware, its cost and the performances compared to using a web service. The result is that I'm left with the feeling that I have to spend about $1,000 plus setup time (HW, SW) and power to get something that could be slower and less accurate than the current free plan of ChatGPT.


They didn't increase the memory bandwidth. You can get the same memory bandwidth, which is available on the M2 Studio. Yes, yes, of course you can get 512 gigabytes of uRAM for 10 grand.

The the question is if a llm will run with usable performance at that scale? The point is there's diminishing returns despite having enough uRAM with the same amount of memory bandwidth even with increased processing speed of the new chip for AI.

So there must be a min-max performance ratio between memory bandwidth and the size of the memory pool in relation to the processing power.


Since no one specifically answered your question yet, yes, you should be able to get usable performance. A Q4_K_M GGUF of DeepSeek-R1 is 404GB. This is a 671B MoE that "only" has 37B activations per pass. You'd probably expect in the ballpark of 20-30 tok/s (depends on how much actually MBW can be utilized) for text generation.

From my napkin math, the M3 Ultra TFLOPs is still relatively low (around 43 FP16 TFLOPs?), but it should be more than enough to handle bs=1 token generation (should be way <10 FLOPs/byte for inference). Now as far is its prefill/prompt processing speed... well, that's another matter.


I actually think it’s not a coincidence and they specifically built this M3 Ultra for DeepSeek R1 4-bit. They also highlight in their press release that they tested it with 600B class LLMs (DeepSeek R1 without referring to it by name). And they specifically did not stop at 256 GB RAM to make this happen. Maybe I’m reading too much into it.


Pretty sure this has absolutely nothing to do with Deepseek and even local LLM at large, which has been a thing for a while and an obvious use case original Llama leak and llama.cpp coming around.

Fact is Mac Pros in the Intel days supported 1.5TB RAM in some configurations[1] and that was 6 years ago expectations of their high end customer base. They needed to address the gap for those customers so they would have shipped such a product regardless. Local LLM is cherry-on-top. Deepseek in particular almost certainly had nothing to do with it. They will still need to double their supported RAM in their SoC to get there. Perhaps in a Mac Pro or a different quad-Max-glued chip.

[1]: https://support.apple.com/en-us/101639


The thing that people are excited about here is unified memory that the GPU can address. Mac Pro had discrete GPUs with their own memory.


I understand why they are excited about it—just pointing out it is a happy coincidence. They would have and should have made such a product to address the need of RAM users alone, not VRAM in particular, before they have a credible case to cut macOS releases on Intel.


Intel integrated graphics, technically also used unified memory with the standard dram


Those also have terrible performance and worse bandwidth. I am not sure they are really relevant, to be honest.


Did the Xeons in the Mac Pro even have integrated graphics?


So did the Amiga, almost 40 years ago...


You mean this? ;) http://de.wikipedia.org/wiki/Datei:Amiga_1000_PAL.jpg

RIP Jay Miner who watched his unified memory daughters Agnus, Denise and Paula be slowly murdered by Jack Tramiel's vengeance against Irving Gould. [Why couldn't the shareholders have stormed their boardroom 180 days before the company ran out of cash, installed interim management who, in turn, would have brought back the megalomaniac Founder that would, until his dying breath, keep spreading their cash to the super brilliant geniuses that made all the magic chips happen and then turn the resulting empire over to ops people to make their workplace so uncomfortable they all retire early and live happily ever after on tropical islands and snowy mountain tops?]


Yep! Though one could argue the Amiga wasn't true unified memory due to the chip RAM limitations. Depending on the Agnus revision, you'd be limited to 512, 1 meg, or 2 meg max of RAM addressable by the custom chips ("chip RAM".)


fun fact: M-series that are configured to use more than 75% of shared memory for GPU can make the system go boom...something to do with assumptions macOS makes that can be fixed by someone with a "private key" to access kernel mode (maybe not a hardware limit).


I messed around with that setting on one of my Macs. I wanted to load a large LLM model and it needed more than 75% of shared memory.


That or it's the luckiest coincidence! In all seriousness, Apple is fairly consistent about not pushing specs that don't matter and >256GB is just unnecessary for most other common workloads. Factors like memory bandwidth, core count and consumption/heat would have higher impact.

That said, I doubt it was explicitly for R1, but rather based the industry a few years ago when GPT 3s 170B was SOTA, but the industry was still looking larger. "As much memory as possible" is the name of the game for AI in a way that's not true for other workloads. It may not be true for AI forever either.


The high end Intel Macs supported over a TB of RAM, over 5 years ago. It's kinda crazy Apple's own high end chips didn't support more RAM. Also, the LLM use case isn't new... Though DeepSeek itself may be. RAM requirements always go up.


Just to clarify. There is an important difference between unified memory, meaning accessible by both CPU and GPU, and regular RAM that is only accessible by CPU.


As mentioned elsewhere in this thread, unified memory has existed long before Apple released the M1 CPU, and in fact many Intel processors that Apple used before supported it (though the Mac pros that supported 1.5TB of RAM did not, as they did not have integrated graphics).

The presence of unified memory does not necessarily make a system better. It’s a trade off: the M-series systems have high memory bandwidth thanks to the large number of memory channels, and the integrated GPUs are faster than most others. But you can’t swap in a faster GPU, and when using large LLMs even a Mac Studio is quite slow compared to using discrete GPUs.


Design work on the Ultra would have started 2-3 years ago, and specs for memory at least 18 months ago. I’m not sure they had that kind of inside knowledge for what Deepseek specifically was doing that far in advance. Did Deepseek even know that long ago?


> they specifically built this M3 Ultra for DeepSeek R1 4-bit

Which came out in what, mid January? Yeah, there's no chance Apple (or anyone) has built a new chip in the last 45 days.


Don't they build these Macs just-in-time? The bandwidth doesn't change with the RAM, so surely it couldn't have been that hard to just... use higher capacity RAM modules?


"No chance?" But it has been reported that the next generation of Apple Silicon started production a few weeks ago. Those deliveries may enable Apple to release its remaining M3 Ultra SKUs for sale to the public (because it has something Better for its internal PCC build-out).

It also may point to other devices ᯅ depending upon such new Apple Silicon arriving sooner, rather than later. (Hey, I should start a YouTube channel or religion or something. /s)


No one is saying they built a new chip.

But the decision to come to market with a 512GB sku may have changed from not making sense to “people will buy this”.


Dies are designed in years.

This was just a coincidence.


What part of “no one is saying they designed a new chip” is lost here?


Sorry, non of us a fan boys trying to shape apple is great narratives


I don’t think you understand hardware timelines if you think this product had literally anything to do with anything DeepSeek.


Chip? Yes. Product? Not necessarily...

It's not completely out of the question that the 512gb version of M3 Ultra was built for their internal Apple silicon servers powering Private Compute Cloud, but not intended for consumer release, until a compelling use case suddenly arrived.

I don't _think_ this is what happened, but I wouldn't go as far as to call it impossible.


DeepSeek R1 came out Jan 20.

Literally impossible.


The scenario is that the 512gb M3 Ultra was validated for the Mac Studio, and in volume production for their servers, but a business decision was made to not offer more than a 256gb SKU for Mac Studio.

I don't think this happened, but it's absolutely not "literally impossible". Engineering takes time, artificial segmentation can be changed much more quickly.


From “internal only” to “delivered to customers” in 6 weeks is literally impossible.


This change is mostly just using higher density ICs on the assembly line and printing different box art with a SKU change. It does not take much time, especially if they had planned it as a possible product just in case management changed its mind.


That's absurd. Fabing custom silicon is not something anybody does for a few thousand internal servers. The unit economics simply don't work. Plus Apple is using OpenAI to provide its larger models anyway, so the need never even existed.


Apple is positively building custom servers, and quantities are closer to the 100k range than 1000 [0]

But I agree they are not using m3 ultra for that. It wouldn’t make any sense.

0. https://www.theregister.com/AMP/2024/06/11/apple_built_ai_cl...


That could be why they're also selling it as the Mac Studio M3 Ultra


My thoughts too. This product was in the pipeline maybe 2-3 years ago. Maybe with LLMs getting popular a year ago they tried to fit more memory but it’s almost impossible to do that that close to a launch. Especially when memory is fused not just a module you can swap.


Your conclusion is correct but to be clear the memory is not "fused." It's soldered close to the main processor. Not even a Package-on-Package (two story) configuration.

See photo without heatspreader here: https://wccftech.com/apple-m2-ultra-soc-delidded-package-siz...


I think by fuse I mean't its stuck on to the SOC module, not part of the SOC as I may have worded. While you could maybe still add NANDs later in the manufacturing process, it's probably not easy, especially if you need more NANDs and a larger module which might cause more design problems. The NAND is closer cause the controller is in the SOC. So the memory controller probably would also change with higher memory sizes which would mean this cannot be a last minute change.


Sheesh, the...comments on that link.


$10k to run a 4 bit quantized model. Ouch.


That's today. What about tomorrow?


The M4 MacBook Pro 128GB can run a 32B perimeter model with an 8 bit quantized model just fine


[flagged]


I'm downvoting you because your use of language is so annoying, not because I work for Apple.


So, Microsoft?


what?


Sorry, an apostrophe got lost in "PO's"


[flagged]


are you comparing the same models? How did you calculate the TOPS for M3 Ultra?


An M3 Ultra is two M3 Max chips connected via fabric, so physics.

Did not mean to shit on anyone's parade, but it's a trap for novices, with the caveat that you reportedly can't buy a GB10 until "May 2025" and the expectation that it will be severely supply constrained. For some (overfunded startups running on AI monkey code? Youtube Influencers?), that timing is an unacceptable risk, so I do expect these things to fly off the shelves and then hit eBay this Summer.


> they specifically built this M3 Ultra for DeepSeek R1 4-bit.

This makes sense. They started gluing M* chips together to make Mac Studios three years ago, which must have been in anticipation of DeepSeek R1 4-bit


Any ideas on power consumption? I wonder how much power would that use. It looks like it would be more efficient than everything else that currently exists.


Looks like up to 480W listed here

https://www.apple.com/mac-studio/specs/


Thanks!!


The M2 Ultra Mac Pro could reach a maximum of 330W according to Apple:

https://support.apple.com/en-us/102839

I assume it is similar.


I would be curious about context window size that would be expected when generating ballpark 20 to 20 tokens per second using Deepseek-R1 Q4 on this hardware?


Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.


> Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

"Memory bandwidth usage should be limited to the 37B active parameters."

Can someone do a deep dive above quote. I understand having the entire model loaded into RAM helps with response times. However, I don't quite understand the memory bandwidth to active parameters.

Context window?

How much the model can actively be processed despite being fully loaded into memory based on memory bandwidth?


With a mixture of experts model you only need to read a subset of the weights from memory to compute the output of each layer. The hidden dimensions are usually smaller as well so that reduces the size of the tensors you write to memory.


What people who did not actually work with this stuff in practice don't realize is the above statement only holds for batch size 1, sequence size 1. For processing the prompt you will need to read all the weights (which isn't a problem, because prefill is compute-bound, which, in turn is a problem on a weak machine like this Mac or an "EPYC build" someone else mentioned). Even for inference, batch size greater than 1 (more than one inference at a time) or sequence size of greater than 1 (speculative decoding), could require you to read the entire model, repeatedly. MoE is beneficial, but there's a lot of nuance here, which people usually miss.


No one should be buying this for batch inference obviously.

I remember right after OpenAI announced GPT3 I had a conversation with someone where we tried to predict how long it would be before GPT3 could run on a home desktop. This mac studio that has enough VRAM to run the full 175B parameter GPT3 with 16bit precision, and I think that’s pretty cool.


Sure, nuance.

This is why Apple makes so much fucking money: people will craft the wildest narratives about how they’re going to use this thing. It’s part of the aesthetics of spending $10,000. For every person who wants a solution to the problem of running a 400b+ parameter neural network, there are 19 who actually want an exciting experience of buying something, which is what Apple really makes. It has more in common with a Birkin bag than a server.


Birkin bags appreciate in value. This is more like a Lexus. It's a well-crafted luxury good that will depreciate relatively slowly.


Have you seen prices on Lexus LFAs now? They haven't depreciated ha ha. And for those that don't know: https://www.youtube.com/watch?v=fWdXLF9unOE


Computers don't usually depreciate slowly


Relatively, as in a Mac or a Lexus will depreciate slower than other computers/cars.


It used to be very true, but with Apple's popularity the second-hand market is quite saturated (especially since there are many people buying them impulsively).

Unless you have a specific configuration, depreciation isn't much better than an equivalently priced PC. In fact, my experience is that the long tail value of the PC is better if you picked something that was high-end.


I don't know. Can't imagine it's easy to sell a used Windows laptop directly to begin with, and those big resellers probably offer very little. Even refurbished Dell Latitudes seem to go for cheap on eBay. I've had an easy time selling old Macs, or high-end desktop market might be simple too.


Macs are easy to sell if they are BtO with custom configuration, in that case you may not lose too much. But the depreciation hit hard on the base models, the market is flooded because people who buy those machines tend to change them often or are people who were trying, confused, etc.

Low end PCs (mostly laptops) don't keep value very well but then again you probably got them for cheap on a deal or something like that, so your depreciation might actually not be as bad as an equivalent Mac. The units you are talking about are entreprise stuff that are swapped every 3 years or so, for accounting reasons mostly, but it's not the type of stuff I would advise anyone to buy brand new (the strategy would actually be to pick up a second-hand unit).

High-end PCs, laptops or big desktops keep their value pretty well because they are niche by definition and very rare. Depending on your original choice you may actually have a better depreciation than an equivalently priced Mac because there are fewer of them on sale at any given time.

It all depends on your personal situation, strength of local market, ease of reselling through platforms that provides trust and many other variables.

What I meant is that it's not the early 2000's anymore, where you could offload a relatively new Mac (2-3 years) very easily; while not being hit by big depreciation because they were not very common.

In my medium sized town, there is a local second-hand electronic shop where they have all kinds of Mac at all kind of price points. High-end Razers sell for more money and are a rare sight. It's pretty much the same for iPhones, you see 3 years old models hit very hard with depreciation while some niche Android phones take a smaller hit.

Apple went through a weird strategy where at the same time they went for luxury pricing by overcharging for things that makes the experience much better (RAM/storage) but also tried to make it affordable to the mass (by largely compromising on things they shouldn't have).

Apple's behavior created a shady second-hand market with lots of moving part (things being shipped in and out of china) and this is all their doing.


Well these listed prices are asks, not bids, so they only give an upper bound on the value. I've tried to sell obscure things before where there are few or 0 other sellers, and no matter what you list it for, you might never find the buyer who wants that specific thing.

And the electronic shop is probably going to fetch a higher price than an individual seller would, due to trust factor as you mentioned. So have you managed to sell old Windows PCs for decent prices in some local market?


Pretty much. In addition, PyTorch on the Mac is abysmally bad. As is Jax. Idk why Apple doesn't implement proper support, seems important. There's MLX which is pretty good, but you can't really port the entire ecosystem of other packages to MLX this far along in the game. Apple's best bet to credibly sell this as "AI hardware" is to make PyTorch support on the Mac excellent. Right now, as far as AI workloads are concerned, this is only suitable for Ollama.


This is true. Not sure why you are getting downvoted. I say this as someone who ordered a maxed out model. I know I will never have a need to run a model locally, I just want to know I can.


I run Mistral Large locally on two A6000's, in 4 bits. It's nice, but $10K in GPUs buys a lot of subscriptions. Plus some of the strongest LLMs are now free (Grok, DeepSeek) for web use.


I hear you. I make these decisions for a public company.

When engineers tell me they want to run models on the cloud, I tell them they are free to play with it, but that isn’t a project going into the roadmap. OpenAI/Anthropic and others are much cheaper in terms of token/dollar thanks to economies of scale.

There is still value in running your models for privacy issues however, and that’s the reason why I pay attention to efforts in reducing the cost to run models locally or in your cloud provider.


No one who is using this for home use cares about anything except batch size 1 sequence size 1.


What if you're doing bulk inference? The efficiency and throughput of bs=1 s=1 is truly abysmal.


People want to talk to their computer, not service requests for a thousand users.


For decode, MoE is nice for either bs=1 (decoding for a single user), or bs=<very large> (do EP to efficiently serve a large amount of users).

Anything in between suffers.


Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.


Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.


> The the question is if a llm will run with usable performance at that scale?

This is the big question to have answered. Many people claim Apple can now reliably be used as a ML workstation, but from the numbers I've seen from benchmarks, the models may fit in memory, but the performance for tok/sec is so slow to not feel worth it, compared to running it on NVIDIA hardware.

Although it be expensive as hell to get 512GB of VRAM with NVIDIA today, maybe moves like this from Apple could push down the prices at least a little bit.


It is much slower than nVidia, but for a lot of personal-use LLM scenarios, it's very workable. And it doesn't need to be anywhere near as fast considering it's really the only viable (affordable) option for private, local inference, besides building a server like this, which is no faster: https://news.ycombinator.com/item?id=42897205


It's fast enough for me to cancel monthly AI services on a mac mini m4 max.


Could you maybe share a lightweight benchmark where you share the exact model (+ quantization if you're using that) + runtime + used settings and how much tokens/second you're getting? Or just like a log of the entire run with the stats, if you're using something like llama.cpp, LMDesktop or ollama?

Also, would be neat if you could say what AI services you were subscribed to, there is a huge difference between paid Claude subscription and the OpenAI Pro subscription for example, both in terms of cost and the quality of responses.


Hm, the AI services over 5 years cost half of m4 max minimal configuration which can barely run severely lobotomized LLaMA 70B. And they provide significantly better models.


Sure, with something like Kagi you even get many models to choose from for a relatively low price, but not everybody likes to send over their codebase and documents to OpenAI.


It's probably much worse than that, with the falling prices of compute.


Smaller, dumber models are faster than bigger, slower ones.

What model do you find fast enough and smart enough?


Not OP but I am finding the Qwen 2.5 32b distilled with DeepSeek R1 model to be a good speed/smartness ratio on the M4 Pro Mac Mini.


I'm running the same exact models.


How much RAM?


It takes between 22GB-37GB depending on the context size etc. from what I've observed.


Thanks!


I presume you're using the Pro, not the Max.

Anyways, what ram config, and what model are you using?


How much RAM are you running on?


Do we know if is it slower because of hardware is not as well suited for the task or is it mostly a software issue -- the code hasn't been optimized to run on Apple Silicon?


AFAICT the neural engine has accelerators for CNNs and integer math, but not the exact tensor operations in popular LLM transformer architectures that are well-supported in GPUs.


The neural engine is perfectly capable of accelerating matmults. It's just that autoregressive decoding in single batch LLM inference is memory bandwidth constrained, so there are no performance benefits to using the ANE for LLM inference (although, there's a huge power efficiency benefit). And the only way to use the neural engine is via CoreML. Using the GPU with MLX or MPS is often easier.


I have to assume they’re doing something like that in the lab for 4 years from now.


Memory bandwidth is the issue


> The question is if a llm will run with usable performance at that scale?

For the self-attention mechanism, memory bandwidth requirements scale ~quadratically with the sequence length.


Someone has got to be working on a better method than that. Hundreds of billions are at stake.


Guess what? I'm on a mission to completely max out all 512GB of mem...maybe by running DeepSeek on it. Pure greed!


You could always just open a few Chrome tabs…


It may not be Firefox in terms of hundreds or thousands of tabs but Chrome has gotten a lot more memory efficient since around 2022.


[flagged]


I downvote all Reddit-style memes, jokes, reference humor, catchphrases, and so on. It’s low-effort content that doesn’t fit the vibe of HN and actively makes the site worse for its intended purpose.


>Edit: WTF, someone downvoted "Enjoy the upvotes?" Pathetic.

You should read HN posting Guidelines if you want to understand why. Although I guess mostly in this case it is someone fat thumbed downvote.


Give Cities Skylines 2 a try.


It doesn't support Macs yet


Any idea what the sRAM to uRAM ratio is on these new GPUs ? If they have meaningfully higher sRAM than the Hopper GPUs, it could lead to meaningful speedups in large model training.

If they didn't increase the memory bandwidth, then 512GB will enable longer context lengths and that's about it right? No speedups

For any speedups You may need some new variant of FlashAttention3 or something along similar lines to be purpose built for Apple GPUs.


I don't know what you mean by s and u, but there is only one kind of memory in the machine, that's what unified memory means.


I assume they mean SRAM versus unified (D)RAM?


Yeah they did? The M4 has a max memory bandwidth of 546GBps, the M3 Ultra bumps that up to a max of 819GBps.

(and the 512GB version is $4,000 more rather than $10,000 - that's still worth mocking, but it's nowhere near as much)


Not that dramatic of an increase actually - the M2 Max already had 400GB/s and M2 Ultra 800GB/s memory bandwidth, so the M3 Ultra's 819GB/s is just a modest bump. Though the M4's additional 146GB/s is indeed a more noticeable improvement.


Also should note that 800/819GB/s of memory bandwidth is actually VERY usable for LLMs. Consider that a 4090 is just a hair above 1000GB/s


Does it work like that though at this larger scale? 512GB of VRAM would be across multiple NVIDIA cards, so the bandwidth and access is parallelized.

But here it looks more of a bottleneck from my (admittedly naive) understanding.


For inference the bandwidth is generally not parallelized because the weights need to go through the model layer by layer. The most common model splitting method is done by assigning each GPU a subset of the LLM layers and it doesn't take much bandwidth to send model weights via PCIE to the next GPU.


My understanding is that the GPU must still load its assigned layer from VRAM into registers and L2 cache for every token, because those aren’t large enough to hold a significant portion. So naively, for a 24GB layer, you‘d need to move up to 24GB for every token.


But the memory bandwidth is only part of the equation; the 4090 is at least several times faster at compute compared to the fastest Apple CPU/GPU.


They didn't increase the memory bandwidth. You can get the same memory bandwidth, which is available on the M2 Studio. Yes, yes, of course you can get 512 gigabytes of uRAM for 10 grand.

The the question is if a llm will run with usable performance at that scale? The point is there's diminishing returns despite having enough uRAM with the same amount of memory bandwidth even with increased processing speed of the new chip m3 for AI.


> if a llm will run with usable performance at that scale?

Yes.

The reason: MoE. They are able to run at a good speed because they don't load all of the weights into the GPU cores.

For instance, DeepSeek R1 uses 404 GB in Q4 quantization[0], containing 256 experts of which 8 are routed to[1] (very roughly 13 GB per forward pass). With a memory bandwidth of 800 GB/s[3], the Mac Studio will be able to output 800/13 = 62 tokens per second.

[0]: https://ollama.com/library/deepseek-r1

[1]: https://arxiv.org/pdf/2412.19437

[2]: https://www.apple.com/newsroom/2025/03/apple-unveils-new-mac...


This doesn’t sound correct.

You don’t know which expert you’ll need for each layer, so you either keep them all loaded in memory or stream them from disk


In RAM, yes. But if you compute an activation, you need to load the weights from RAM to the GPU core.


Got you, yeah I misread you commend the first time around


Note that 404 < 512


You seem like you know what you are talking about... mind if I ask what your thoughts on quantization are? Its unclear to me if quantization affects quality... I feel like I've heard yes and no arguments


There is no question that quantization degrades quality. The GGUF R1 uses Q4_K_M, which, on Llama-3-8B, increases the perplexity by 0.18[0]. Many plots show increasing degradation as you quantize more[1].

That said, it is possible to train a model in a quantization-aware way[2][3], which improves the quality a bit, although not higher than the raw model.

Also, a loss in quality may not be perceptible in a specific use-case. Famously LMArena.ai tested Llama 3.1 405B with bf16 and fp8, and the latter was only 2 Elo points below, well within measurement error.

[0]: https://github.com/ggml-org/llama.cpp/blob/master/examples/q...

[1]: https://github.com/ggml-org/llama.cpp/discussions/5063#discu...

[2]: https://pytorch.org/blog/quantization-aware-training/

[3]: https://mistral.ai/news/ministraux


I don't know what I'm talking about but when I first asked your question this https://gist.github.com/Artefact2/b5f810600771265fc1e3944228... helped start me on a path to understanding. I think.

But if you don't already know the question your asking is not at all something I could distill down into a sentence or to that would make sense to a lay-person. Even then I know I couldn't distill it at all sorry.

Edit: I found this link I referenced above on quantized models by bartowski on huggingface https://huggingface.co/bartowski/Qwen2.5-Coder-14B-GGUF#whic...


I did my own experiments and it looks like (surprisingly) Q4KM models often outperforms Q6 and Q8 quantised models.

For bigger models (in range of 8B - 70B) the Q4KM is very good, there are no any degradation compared to full FP16 models.


I returned an M2 Max studio with 96GB RAM, unquantized llama 70B 3.1 was dog slow, not an interactive pace. I'm interested in offline LLM but couldn't see how it was going to produce $3,000 ROI.


It would be really cool if there was awebsite "we there yet" for reasonable offline AI.

It could track different hardware configurations and reasonably standardized benchmark performance per model. I know there's benchmarks buried in github Llama repository.


There seems to be a LOT of interest in such a site in the comments here. There seem to be multiple IP issues with sharing your code repo with an online service so I feel a lot of folks are waiting for the hardware to make this possible.

We need a SWE-bench for open source LLM's and for each model to have 3Dmark like benchmarks on various hardware setups.

I did find this which seems very helpful but is missing the latest models and hardware options. https://kamilstanuch.github.io/LLM-token-generation-simulato...


Looks like he bases the benchmarks off of https://github.com/ggml-org/llama.cpp/discussions/4167

I get why he calls it a simulator, as it can simulate token output. It's an important aspect for evaluating use case if you need to get a sense of how much token output is relevant beyond the simple tokens per second text.


The M3 Ultra is the only configuration that supports 512GB and it has memory bandwidth of 819GB/s.


True, I also noticed that bigger models run slower at the same memory bandwidth (makes sense).


Yeah, I don’t think RAM is the bottleneck. Which is unfortunate. It feels like a missed opportunity for them. I think Apple partly became popular because it enabled creatives and developers.


> I don’t think RAM is the bottleneck

Not the size/amount, but the memory bandwidth usually is.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: