More

FloatArtifact · 2025-04-01T12:38:22 1743511102

How does this keyword spotting compare versus grammar or intent approach for speech recognition commands with dictation?

How does keyword spotting handle complex phrases as commands?

FloatArtifact · 2025-04-01T12:06:40 1743509200

Self-Hosting like it's 2025...uhhgg...

Don't get me wrong I love some of the software suggested. However yet a another post that does not take backups as seriously as the rest of the self-hosting stack.

Backups are stuck in 2013. We need plug and play backups for containers! No more roll your own with zfs datasets, back up data on the filesystem level (using sanoid/syncoid to manage snapshots or any other alternatives.

nijave · 2025-04-01T14:23:51 1743517431

Why not zfs snapshots? Besides using Hyper-V machine snapshots, that's been the easiest way, by far, for me. No need to worry about the 20 different proprietary tools that go with each piece of software.

Each VM or container gets a data mount on a zvol. Containers go to OS mount and each OS has its own volume (so most VMs end up with 2 volumes attached)

FloatArtifact · 2025-04-03T18:57:30 1743706650

Well, one argument not to use ZFS is simply the resources it takes. It eats up a lot of ram. Also I'm under the impression that one should never live-snapshot a database without risk of corruption.

marceldegraaf · 2025-04-01T12:47:27 1743511647

Best decision of last year for my homelab: run everything in Proxmox VMs/containers and back up to a separate Proxmox Backup Server instance.

Fully automated, incremental, verified backups, and restoring is one click of a button.

FloatArtifact · 2025-04-01T13:49:20 1743515360

Yes, I'm considering that if I can't find a solution that is plug-and-play for containers Independent of the OS and file system. Although I don't mind something abstracting on top of ZFS. ZFS Mental overhead through the snapshot paradigm can lead to its own complexities. A traditional backup and restorer front end would be great.

I find it strange that, especially with a docker which already knows your volumes, app data, and config, can't automatically backup and restore databases, and configs. Jeez, they could have built it right into docker.

nunez · 2025-04-01T14:20:23 1743517223

rclone is great for this.

One could set up a Docker Compose service that uses rclone to gzip and back up your docker volumes to something durable to get this done. An even more advanced version of this would automate testing the backups by restoring them into a clean environment and running some tests with BATS or whatever testing framework you want.

nijave · 2025-04-01T14:34:37 1743518077

Rclone won't take a consistent snapshot so you either need to shutdown the thing or use some other tool to export the data first

auxym · 2025-04-01T17:31:36 1743528696

zfs/btrfs snapshot and then rclone that snapshot?

nijave · 2025-04-01T22:33:01 1743546781

I think that'd break deleting incremental snapshots unless you tried uploading a gigantic blob of the entire filesystem, wouldn't it?

Meaning you'd need to upload full snapshots on a fixed interval

FloatArtifact · 2025-03-29T23:58:22 1743292702

Well, I see your post now.

codingmoh · 2025-03-30T00:08:55 1743293335

I've made a very recent (1hr ago) Show HN post and it's not visible. Should be that one: https://news.ycombinator.com/item?id=43519517

FloatArtifact · 2025-03-29T01:07:58 1743210478

You won't spend a cent but it'll cost you more than a dollar. Spend time with people in their day-to-day life intentionally. That's truly a gift that can impact anyone not just people with low income. It might just change you, too. You won't be able to do with many people, but do it with a few.

FloatArtifact · 2025-03-26T22:41:45 1743028905

Don't make it an AI only note taking product.

Make it a hybrid product. Let the end user also mark elements such as video, slides, original audio, and transcription. Especially of aligning speech-to-text to the generated audio, video and slides. Allowing the user to scrub through the timeline and mark what they believe is important.

This allows the user to be at the center of the product, yet scale their use case and manual input as needed with generative AI. In addition, this provides additional context for the AI to produce tailored output based on the user's input.

whitemyrat · 2025-03-26T22:55:59 1743029759

Main concept to build a tool that record the meeting which is not locked into one LLM and not isolated from current workflows.

FloatArtifact · 2025-03-27T10:25:36 1743071136

My comment is not about LLM provider lock-in. It's about merging a traditional note-taking app with an LLM augmented approach. It's up to you if you want your product to stand out.

whitemyrat · 2025-03-27T21:06:18 1743109578

Got it.

FloatArtifact · 2025-03-19T10:37:20 1742380640

Will the battery be user-replaceable?

FloatArtifact · 2025-03-14T12:50:08 1741956608

Seems like Apple's M2 is a sweet spot for AI performance at 800 GB/s of memory bandwidth which can be added under $1,500 refurbished for 65 gigs of RAM.

crazystar · 2025-03-14T13:22:17 1741958537

Where for $1500?

runjake · 2025-03-14T13:39:57 1741959597

Not on Apple Refurbs. That would cost you about $2200.

And the M2 Max has a memory bandwidth of 400GB/s.

FloatArtifact · 2025-03-15T00:58:02 1742000282

Whoops, I got confused between the Max and the Ultra for memory bandwidth. But I have seen on occasion, months ago, reverbs for that price.

sroussey · 2025-03-14T16:37:26 1741970246

I’m guessing a reference to M2 Ultra? Not sure about that price though…

runjake · 2025-03-14T16:58:52 1741971532

M2 Ultra refurb was over $4,000, last I checked.

FloatArtifact · 2025-03-11T16:26:04 1741710364

What we need is a platform for benchmarking hardware for AI models. With X hardware you can get X amount of tokens with X amount of latency For context token pre-filled. So, standard testing methodology per model with user-supplied benchmarks. Yes, I recognize there's going to be some variability based on different versions of the software stack and encoders.

End user experience should start by selecting the models of interest to run and output hardware builds with price tracking for components.

pmontra · 2025-03-11T16:39:32 1741711172

Agreed. I opened the comments to write that nearly all of those articles spend very few words on the hardware, its cost and the performances compared to using a web service. The result is that I'm left with the feeling that I have to spend about $1,000 plus setup time (HW, SW) and power to get something that could be slower and less accurate than the current free plan of ChatGPT.

FloatArtifact · 2025-03-05T15:51:17 1741189877

They didn't increase the memory bandwidth. You can get the same memory bandwidth, which is available on the M2 Studio. Yes, yes, of course you can get 512 gigabytes of uRAM for 10 grand.

The the question is if a llm will run with usable performance at that scale? The point is there's diminishing returns despite having enough uRAM with the same amount of memory bandwidth even with increased processing speed of the new chip for AI.

So there must be a min-max performance ratio between memory bandwidth and the size of the memory pool in relation to the processing power.

lhl · 2025-03-05T18:54:55 1741200895

Since no one specifically answered your question yet, yes, you should be able to get usable performance. A Q4_K_M GGUF of DeepSeek-R1 is 404GB. This is a 671B MoE that "only" has 37B activations per pass. You'd probably expect in the ballpark of 20-30 tok/s (depends on how much actually MBW can be utilized) for text generation.

From my napkin math, the M3 Ultra TFLOPs is still relatively low (around 43 FP16 TFLOPs?), but it should be more than enough to handle bs=1 token generation (should be way <10 FLOPs/byte for inference). Now as far is its prefill/prompt processing speed... well, that's another matter.

lynguist · 2025-03-05T23:07:48 1741216068

I actually think it’s not a coincidence and they specifically built this M3 Ultra for DeepSeek R1 4-bit. They also highlight in their press release that they tested it with 600B class LLMs (DeepSeek R1 without referring to it by name). And they specifically did not stop at 256 GB RAM to make this happen. Maybe I’m reading too much into it.

tgma · 2025-03-06T06:02:47 1741240967

Pretty sure this has absolutely nothing to do with Deepseek and even local LLM at large, which has been a thing for a while and an obvious use case original Llama leak and llama.cpp coming around.

Fact is Mac Pros in the Intel days supported 1.5TB RAM in some configurations[1] and that was 6 years ago expectations of their high end customer base. They needed to address the gap for those customers so they would have shipped such a product regardless. Local LLM is cherry-on-top. Deepseek in particular almost certainly had nothing to do with it. They will still need to double their supported RAM in their SoC to get there. Perhaps in a Mac Pro or a different quad-Max-glued chip.

[1]: https://support.apple.com/en-us/101639

saagarjha · 2025-03-06T07:15:29 1741245329

The thing that people are excited about here is unified memory that the GPU can address. Mac Pro had discrete GPUs with their own memory.

tgma · 2025-03-06T16:24:24 1741278264

I understand why they are excited about it—just pointing out it is a happy coincidence. They would have and should have made such a product to address the need of RAM users alone, not VRAM in particular, before they have a credible case to cut macOS releases on Intel.

water9 · 2025-03-06T07:38:59 1741246739

Intel integrated graphics, technically also used unified memory with the standard dram

kergonath · 2025-03-06T09:05:50 1741251950

Those also have terrible performance and worse bandwidth. I am not sure they are really relevant, to be honest.

McDaveNZ · 2025-03-06T08:18:37 1741249117

Did the Xeons in the Mac Pro even have integrated graphics?

icedchai · 2025-03-06T23:38:47 1741304327

So did the Amiga, almost 40 years ago...

vaxman · 2025-03-07T01:51:42 1741312302

You mean this? ;) http://de.wikipedia.org/wiki/Datei:Amiga_1000_PAL.jpg

RIP Jay Miner who watched his unified memory daughters Agnus, Denise and Paula be slowly murdered by Jack Tramiel's vengeance against Irving Gould. [Why couldn't the shareholders have stormed their boardroom 180 days before the company ran out of cash, installed interim management who, in turn, would have brought back the megalomaniac Founder that would, until his dying breath, keep spreading their cash to the super brilliant geniuses that made all the magic chips happen and then turn the resulting empire over to ops people to make their workplace so uncomfortable they all retire early and live happily ever after on tropical islands and snowy mountain tops?]

icedchai · 2025-03-07T03:26:39 1741317999

Yep! Though one could argue the Amiga wasn't true unified memory due to the chip RAM limitations. Depending on the Agnus revision, you'd be limited to 512, 1 meg, or 2 meg max of RAM addressable by the custom chips ("chip RAM".)

vaxman · 2025-03-12T08:45:40 1741769140

fun fact: M-series that are configured to use more than 75% of shared memory for GPU can make the system go boom...something to do with assumptions macOS makes that can be fixed by someone with a "private key" to access kernel mode (maybe not a hardware limit).

icedchai · 2025-03-15T20:43:52 1742071432

I messed around with that setting on one of my Macs. I wanted to load a large LLM model and it needed more than 75% of shared memory.

kmacdough · 2025-03-06T02:17:07 1741227427

That or it's the luckiest coincidence! In all seriousness, Apple is fairly consistent about not pushing specs that don't matter and >256GB is just unnecessary for most other common workloads. Factors like memory bandwidth, core count and consumption/heat would have higher impact.

That said, I doubt it was explicitly for R1, but rather based the industry a few years ago when GPT 3s 170B was SOTA, but the industry was still looking larger. "As much memory as possible" is the name of the game for AI in a way that's not true for other workloads. It may not be true for AI forever either.

icedchai · 2025-03-06T23:43:46 1741304626

The high end Intel Macs supported over a TB of RAM, over 5 years ago. It's kinda crazy Apple's own high end chips didn't support more RAM. Also, the LLM use case isn't new... Though DeepSeek itself may be. RAM requirements always go up.

teknologist · 2025-03-08T05:27:00 1741411620

Just to clarify. There is an important difference between unified memory, meaning accessible by both CPU and GPU, and regular RAM that is only accessible by CPU.

angoragoats · 2025-03-10T20:50:51 1741639851

As mentioned elsewhere in this thread, unified memory has existed long before Apple released the M1 CPU, and in fact many Intel processors that Apple used before supported it (though the Mac pros that supported 1.5TB of RAM did not, as they did not have integrated graphics).

The presence of unified memory does not necessarily make a system better. It’s a trade off: the M-series systems have high memory bandwidth thanks to the large number of memory channels, and the integrated GPUs are faster than most others. But you can’t swap in a faster GPU, and when using large LLMs even a Mac Studio is quite slow compared to using discrete GPUs.

brookst · 2025-03-06T12:26:17 1741263977

Design work on the Ultra would have started 2-3 years ago, and specs for memory at least 18 months ago. I’m not sure they had that kind of inside knowledge for what Deepseek specifically was doing that far in advance. Did Deepseek even know that long ago?

happyopossum · 2025-03-06T16:30:10 1741278610

> they specifically built this M3 Ultra for DeepSeek R1 4-bit

Which came out in what, mid January? Yeah, there's no chance Apple (or anyone) has built a new chip in the last 45 days.

tempaccount420 · 2025-03-10T16:02:19 1741622539

Don't they build these Macs just-in-time? The bandwidth doesn't change with the RAM, so surely it couldn't have been that hard to just... use higher capacity RAM modules?

vaxman · 2025-03-07T01:18:48 1741310328

"No chance?" But it has been reported that the next generation of Apple Silicon started production a few weeks ago. Those deliveries may enable Apple to release its remaining M3 Ultra SKUs for sale to the public (because it has something Better for its internal PCC build-out).

It also may point to other devices ᯅ depending upon such new Apple Silicon arriving sooner, rather than later. (Hey, I should start a YouTube channel or religion or something. /s)

SV_BubbleTime · 2025-03-06T17:27:32 1741282052

No one is saying they built a new chip.

But the decision to come to market with a 512GB sku may have changed from not making sense to “people will buy this”.

cyanydeez · 2025-03-06T23:33:27 1741304007

Dies are designed in years.

This was just a coincidence.

SV_BubbleTime · 2025-03-07T16:01:19 1741363279

What part of “no one is saying they designed a new chip” is lost here?

cyanydeez · 2025-03-07T19:03:26 1741374206

Sorry, non of us a fan boys trying to shape apple is great narratives

forrestthewoods · 2025-03-05T23:52:54 1741218774

I don’t think you understand hardware timelines if you think this product had literally anything to do with anything DeepSeek.

reitzensteinm · 2025-03-06T05:23:55 1741238635

Chip? Yes. Product? Not necessarily...

It's not completely out of the question that the 512gb version of M3 Ultra was built for their internal Apple silicon servers powering Private Compute Cloud, but not intended for consumer release, until a compelling use case suddenly arrived.

I don't _think_ this is what happened, but I wouldn't go as far as to call it impossible.

forrestthewoods · 2025-03-06T06:07:12 1741241232

DeepSeek R1 came out Jan 20.

Literally impossible.

reitzensteinm · 2025-03-06T06:32:13 1741242733

The scenario is that the 512gb M3 Ultra was validated for the Mac Studio, and in volume production for their servers, but a business decision was made to not offer more than a 256gb SKU for Mac Studio.

I don't think this happened, but it's absolutely not "literally impossible". Engineering takes time, artificial segmentation can be changed much more quickly.

forrestthewoods · 2025-03-06T07:54:21 1741247661

From “internal only” to “delivered to customers” in 6 weeks is literally impossible.

ryao · 2025-03-06T14:33:19 1741271599

This change is mostly just using higher density ICs on the assembly line and printing different box art with a SKU change. It does not take much time, especially if they had planned it as a possible product just in case management changed its mind.

jahewson · 2025-03-06T09:15:49 1741252549

That's absurd. Fabing custom silicon is not something anybody does for a few thousand internal servers. The unit economics simply don't work. Plus Apple is using OpenAI to provide its larger models anyway, so the need never even existed.

brookst · 2025-03-06T12:31:50 1741264310

Apple is positively building custom servers, and quantities are closer to the 100k range than 1000 [0]

But I agree they are not using m3 ultra for that. It wouldn’t make any sense.

0. https://www.theregister.com/AMP/2024/06/11/apple_built_ai_cl...

teknologist · 2025-03-08T03:02:18 1741402938

That could be why they're also selling it as the Mac Studio M3 Ultra

bustling-noose · 2025-03-06T02:14:13 1741227253

My thoughts too. This product was in the pipeline maybe 2-3 years ago. Maybe with LLMs getting popular a year ago they tried to fit more memory but it’s almost impossible to do that that close to a launch. Especially when memory is fused not just a module you can swap.

tgma · 2025-03-06T06:13:33 1741241613

Your conclusion is correct but to be clear the memory is not "fused." It's soldered close to the main processor. Not even a Package-on-Package (two story) configuration.

See photo without heatspreader here: https://wccftech.com/apple-m2-ultra-soc-delidded-package-siz...

bustling-noose · 2025-03-08T05:06:47 1741410407

I think by fuse I mean't its stuck on to the SOC module, not part of the SOC as I may have worded. While you could maybe still add NANDs later in the manufacturing process, it's probably not easy, especially if you need more NANDs and a larger module which might cause more design problems. The NAND is closer cause the controller is in the SOC. So the memory controller probably would also change with higher memory sizes which would mean this cannot be a last minute change.

fennecfoxy · 2025-03-10T11:49:07 1741607347

Sheesh, the...comments on that link.

nightski · 2025-03-06T04:00:12 1741233612

$10k to run a 4 bit quantized model. Ouch.

OriginalMrPink · 2025-03-06T13:03:42 1741266222

That's today. What about tomorrow?

water9 · 2025-03-06T07:40:27 1741246827

The M4 MacBook Pro 128GB can run a 32B perimeter model with an 8 bit quantized model just fine

vaxman · 2025-03-06T06:40:37 1741243237

[flagged]

titanomachy · 2025-03-06T18:42:14 1741286534

I'm downvoting you because your use of language is so annoying, not because I work for Apple.

vaxman · 2025-03-07T01:22:11 1741310531

So, Microsoft?

fredoliveira · 2025-03-06T15:19:20 1741274360

what?

vaxman · 2025-03-06T19:39:18 1741289958

Sorry, an apostrophe got lost in "PO's"

vaxman · 2025-03-06T08:56:57 1741251417

[flagged]

1R053 · 2025-03-06T09:39:21 1741253961

are you comparing the same models? How did you calculate the TOPS for M3 Ultra?

vaxman · 2025-03-06T19:10:19 1741288219

An M3 Ultra is two M3 Max chips connected via fabric, so physics.

Did not mean to shit on anyone's parade, but it's a trap for novices, with the caveat that you reportedly can't buy a GB10 until "May 2025" and the expectation that it will be severely supply constrained. For some (overfunded startups running on AI monkey code? Youtube Influencers?), that timing is an unacceptable risk, so I do expect these things to fly off the shelves and then hit eBay this Summer.

jrflowers · 2025-03-06T21:48:14 1741297694

> they specifically built this M3 Ultra for DeepSeek R1 4-bit.

This makes sense. They started gluing M* chips together to make Mac Studios three years ago, which must have been in anticipation of DeepSeek R1 4-bit

a1o · 2025-03-05T23:53:07 1741218787

Any ideas on power consumption? I wonder how much power would that use. It looks like it would be more efficient than everything else that currently exists.

j45 · 2025-03-06T02:18:37 1741227517

Looks like up to 480W listed here

https://www.apple.com/mac-studio/specs/

a1o · 2025-03-06T16:27:24 1741278444

Thanks!!

ryao · 2025-03-06T01:56:51 1741226211

The M2 Ultra Mac Pro could reach a maximum of 330W according to Apple:

https://support.apple.com/en-us/102839

I assume it is similar.

drited · 2025-03-05T20:32:12 1741206732

I would be curious about context window size that would be expected when generating ballpark 20 to 20 tokens per second using Deepseek-R1 Q4 on this hardware?

valine · 2025-03-05T16:26:21 1741191981

Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

FloatArtifact · 2025-03-05T16:32:43 1741192363

> Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

"Memory bandwidth usage should be limited to the 37B active parameters."

Can someone do a deep dive above quote. I understand having the entire model loaded into RAM helps with response times. However, I don't quite understand the memory bandwidth to active parameters.

Context window?

How much the model can actively be processed despite being fully loaded into memory based on memory bandwidth?

valine · 2025-03-05T16:41:09 1741192869

With a mixture of experts model you only need to read a subset of the weights from memory to compute the output of each layer. The hidden dimensions are usually smaller as well so that reduces the size of the tensors you write to memory.

ein0p · 2025-03-05T17:38:01 1741196281

What people who did not actually work with this stuff in practice don't realize is the above statement only holds for batch size 1, sequence size 1. For processing the prompt you will need to read all the weights (which isn't a problem, because prefill is compute-bound, which, in turn is a problem on a weak machine like this Mac or an "EPYC build" someone else mentioned). Even for inference, batch size greater than 1 (more than one inference at a time) or sequence size of greater than 1 (speculative decoding), could require you to read the entire model, repeatedly. MoE is beneficial, but there's a lot of nuance here, which people usually miss.

valine · 2025-03-05T18:20:36 1741198836

No one should be buying this for batch inference obviously.

I remember right after OpenAI announced GPT3 I had a conversation with someone where we tried to predict how long it would be before GPT3 could run on a home desktop. This mac studio that has enough VRAM to run the full 175B parameter GPT3 with 16bit precision, and I think that’s pretty cool.

doctorpangloss · 2025-03-05T17:49:42 1741196982

Sure, nuance.

This is why Apple makes so much fucking money: people will craft the wildest narratives about how they’re going to use this thing. It’s part of the aesthetics of spending $10,000. For every person who wants a solution to the problem of running a 400b+ parameter neural network, there are 19 who actually want an exciting experience of buying something, which is what Apple really makes. It has more in common with a Birkin bag than a server.

jonfromsf · 2025-03-05T23:53:11 1741218791

Birkin bags appreciate in value. This is more like a Lexus. It's a well-crafted luxury good that will depreciate relatively slowly.

fennecfoxy · 2025-03-10T11:59:18 1741607958

Have you seen prices on Lexus LFAs now? They haven't depreciated ha ha. And for those that don't know: https://www.youtube.com/watch?v=fWdXLF9unOE

hot_gril · 2025-03-06T01:21:32 1741224092

Computers don't usually depreciate slowly

km3r · 2025-03-06T03:00:48 1741230048

Relatively, as in a Mac or a Lexus will depreciate slower than other computers/cars.

seec · 2025-03-06T15:11:57 1741273917

It used to be very true, but with Apple's popularity the second-hand market is quite saturated (especially since there are many people buying them impulsively).

Unless you have a specific configuration, depreciation isn't much better than an equivalently priced PC. In fact, my experience is that the long tail value of the PC is better if you picked something that was high-end.

hot_gril · 2025-03-06T22:32:16 1741300336

I don't know. Can't imagine it's easy to sell a used Windows laptop directly to begin with, and those big resellers probably offer very little. Even refurbished Dell Latitudes seem to go for cheap on eBay. I've had an easy time selling old Macs, or high-end desktop market might be simple too.

seec · 2025-03-10T18:25:17 1741631117

Macs are easy to sell if they are BtO with custom configuration, in that case you may not lose too much. But the depreciation hit hard on the base models, the market is flooded because people who buy those machines tend to change them often or are people who were trying, confused, etc.

Low end PCs (mostly laptops) don't keep value very well but then again you probably got them for cheap on a deal or something like that, so your depreciation might actually not be as bad as an equivalent Mac. The units you are talking about are entreprise stuff that are swapped every 3 years or so, for accounting reasons mostly, but it's not the type of stuff I would advise anyone to buy brand new (the strategy would actually be to pick up a second-hand unit).

High-end PCs, laptops or big desktops keep their value pretty well because they are niche by definition and very rare. Depending on your original choice you may actually have a better depreciation than an equivalently priced Mac because there are fewer of them on sale at any given time.

It all depends on your personal situation, strength of local market, ease of reselling through platforms that provides trust and many other variables.

What I meant is that it's not the early 2000's anymore, where you could offload a relatively new Mac (2-3 years) very easily; while not being hit by big depreciation because they were not very common.

In my medium sized town, there is a local second-hand electronic shop where they have all kinds of Mac at all kind of price points. High-end Razers sell for more money and are a rare sight. It's pretty much the same for iPhones, you see 3 years old models hit very hard with depreciation while some niche Android phones take a smaller hit.

Apple went through a weird strategy where at the same time they went for luxury pricing by overcharging for things that makes the experience much better (RAM/storage) but also tried to make it affordable to the mass (by largely compromising on things they shouldn't have).

Apple's behavior created a shady second-hand market with lots of moving part (things being shipped in and out of china) and this is all their doing.

hot_gril · 2025-03-11T06:01:46 1741672906

Well these listed prices are asks, not bids, so they only give an upper bound on the value. I've tried to sell obscure things before where there are few or 0 other sellers, and no matter what you list it for, you might never find the buyer who wants that specific thing.

And the electronic shop is probably going to fetch a higher price than an individual seller would, due to trust factor as you mentioned. So have you managed to sell old Windows PCs for decent prices in some local market?

ein0p · 2025-03-06T04:12:24 1741234344

Pretty much. In addition, PyTorch on the Mac is abysmally bad. As is Jax. Idk why Apple doesn't implement proper support, seems important. There's MLX which is pretty good, but you can't really port the entire ecosystem of other packages to MLX this far along in the game. Apple's best bet to credibly sell this as "AI hardware" is to make PyTorch support on the Mac excellent. Right now, as far as AI workloads are concerned, this is only suitable for Ollama.

DevKoala · 2025-03-05T23:07:44 1741216064

This is true. Not sure why you are getting downvoted. I say this as someone who ordered a maxed out model. I know I will never have a need to run a model locally, I just want to know I can.

ein0p · 2025-03-06T04:02:45 1741233765

I run Mistral Large locally on two A6000's, in 4 bits. It's nice, but $10K in GPUs buys a lot of subscriptions. Plus some of the strongest LLMs are now free (Grok, DeepSeek) for web use.

DevKoala · 2025-03-06T04:41:28 1741236088

I hear you. I make these decisions for a public company.

When engineers tell me they want to run models on the cloud, I tell them they are free to play with it, but that isn’t a project going into the roadmap. OpenAI/Anthropic and others are much cheaper in terms of token/dollar thanks to economies of scale.

There is still value in running your models for privacy issues however, and that’s the reason why I pay attention to efforts in reducing the cost to run models locally or in your cloud provider.

Der_Einzige · 2025-03-05T18:33:47 1741199627

No one who is using this for home use cares about anything except batch size 1 sequence size 1.

ein0p · 2025-03-05T20:30:41 1741206641

What if you're doing bulk inference? The efficiency and throughput of bs=1 s=1 is truly abysmal.

saagarjha · 2025-03-06T07:18:11 1741245491

People want to talk to their computer, not service requests for a thousand users.

rfoo · 2025-03-05T18:08:47 1741198127

For decode, MoE is nice for either bs=1 (decoding for a single user), or bs=<very large> (do EP to efficiently serve a large amount of users).

Anything in between suffers.

bick_nyers · 2025-03-05T17:30:28 1741195828

Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.

valine · 2025-03-05T18:25:07 1741199107

Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.

diggan · 2025-03-05T16:33:15 1741192395

> The the question is if a llm will run with usable performance at that scale?

This is the big question to have answered. Many people claim Apple can now reliably be used as a ML workstation, but from the numbers I've seen from benchmarks, the models may fit in memory, but the performance for tok/sec is so slow to not feel worth it, compared to running it on NVIDIA hardware.

Although it be expensive as hell to get 512GB of VRAM with NVIDIA today, maybe moves like this from Apple could push down the prices at least a little bit.

johnmaguire · 2025-03-05T16:50:53 1741193453

It is much slower than nVidia, but for a lot of personal-use LLM scenarios, it's very workable. And it doesn't need to be anywhere near as fast considering it's really the only viable (affordable) option for private, local inference, besides building a server like this, which is no faster: https://news.ycombinator.com/item?id=42897205

bastardoperator · 2025-03-05T17:43:02 1741196582

It's fast enough for me to cancel monthly AI services on a mac mini m4 max.

diggan · 2025-03-05T17:49:24 1741196964

Could you maybe share a lightweight benchmark where you share the exact model (+ quantization if you're using that) + runtime + used settings and how much tokens/second you're getting? Or just like a log of the entire run with the stats, if you're using something like llama.cpp, LMDesktop or ollama?

Also, would be neat if you could say what AI services you were subscribed to, there is a huge difference between paid Claude subscription and the OpenAI Pro subscription for example, both in terms of cost and the quality of responses.

lostmsu · 2025-03-05T18:45:24 1741200324

Hm, the AI services over 5 years cost half of m4 max minimal configuration which can barely run severely lobotomized LLaMA 70B. And they provide significantly better models.

Matl · 2025-03-05T19:40:38 1741203638

Sure, with something like Kagi you even get many models to choose from for a relatively low price, but not everybody likes to send over their codebase and documents to OpenAI.

nomel · 2025-03-05T19:08:46 1741201726

It's probably much worse than that, with the falling prices of compute.

staticman2 · 2025-03-05T18:03:03 1741197783

Smaller, dumber models are faster than bigger, slower ones.

What model do you find fast enough and smart enough?

Matl · 2025-03-05T19:42:46 1741203766

Not OP but I am finding the Qwen 2.5 32b distilled with DeepSeek R1 model to be a good speed/smartness ratio on the M4 Pro Mac Mini.

bastardoperator · 2025-03-06T16:44:49 1741279489

I'm running the same exact models.

a1o · 2025-03-05T23:55:03 1741218903

How much RAM?

Matl · 2025-03-06T21:19:06 1741295946

It takes between 22GB-37GB depending on the context size etc. from what I've observed.

a1o · 2025-03-07T13:33:23 1741354403

Thanks!

jamesy0ung · 2025-03-06T07:44:15 1741247055

I presume you're using the Pro, not the Max.

Anyways, what ram config, and what model are you using?

fetus8 · 2025-03-05T17:49:42 1741196982

How much RAM are you running on?

hangonhn · 2025-03-05T17:30:02 1741195802

Do we know if is it slower because of hardware is not as well suited for the task or is it mostly a software issue -- the code hasn't been optimized to run on Apple Silicon?

titzer · 2025-03-05T17:37:19 1741196239

AFAICT the neural engine has accelerators for CNNs and integer math, but not the exact tensor operations in popular LLM transformer architectures that are well-supported in GPUs.

woadwarrior01 · 2025-03-05T20:16:01 1741205761

The neural engine is perfectly capable of accelerating matmults. It's just that autoregressive decoding in single batch LLM inference is memory bandwidth constrained, so there are no performance benefits to using the ANE for LLM inference (although, there's a huge power efficiency benefit). And the only way to use the neural engine is via CoreML. Using the GPU with MLX or MPS is often easier.

kridsdale1 · 2025-03-05T19:33:50 1741203230

I have to assume they’re doing something like that in the lab for 4 years from now.

azinman2 · 2025-03-05T19:31:55 1741203115

Memory bandwidth is the issue

bob1029 · 2025-03-05T19:06:39 1741201599

> The question is if a llm will run with usable performance at that scale?

For the self-attention mechanism, memory bandwidth requirements scale ~quadratically with the sequence length.

kridsdale1 · 2025-03-05T19:36:01 1741203361

Someone has got to be working on a better method than that. Hundreds of billions are at stake.

cxie · 2025-03-05T15:59:49 1741190389

Guess what? I'm on a mission to completely max out all 512GB of mem...maybe by running DeepSeek on it. Pure greed!

swivelmaster · 2025-03-05T16:55:16 1741193716

You could always just open a few Chrome tabs…

ksec · 2025-03-06T00:00:13 1741219213

It may not be Firefox in terms of hundreds or thousands of tabs but Chrome has gotten a lot more memory efficient since around 2022.

DidYaWipe · 2025-03-05T22:56:02 1741215362

[flagged]

umanwizard · 2025-03-06T02:29:23 1741228163

I downvote all Reddit-style memes, jokes, reference humor, catchphrases, and so on. It’s low-effort content that doesn’t fit the vibe of HN and actively makes the site worse for its intended purpose.

ksec · 2025-03-05T23:58:53 1741219133

>Edit: WTF, someone downvoted "Enjoy the upvotes?" Pathetic.

You should read HN posting Guidelines if you want to understand why. Although I guess mostly in this case it is someone fat thumbed downvote.

petepete · 2025-03-06T08:30:22 1741249822

Give Cities Skylines 2 a try.

zactato · 2025-03-07T23:29:03 1741390143

It doesn't support Macs yet

deepGem · 2025-03-05T20:02:42 1741204962

Any idea what the sRAM to uRAM ratio is on these new GPUs ? If they have meaningfully higher sRAM than the Hopper GPUs, it could lead to meaningful speedups in large model training.

If they didn't increase the memory bandwidth, then 512GB will enable longer context lengths and that's about it right? No speedups

For any speedups You may need some new variant of FlashAttention3 or something along similar lines to be purpose built for Apple GPUs.

astrange · 2025-03-06T00:53:02 1741222382

I don't know what you mean by s and u, but there is only one kind of memory in the machine, that's what unified memory means.

saagarjha · 2025-03-06T07:21:33 1741245693

I assume they mean SRAM versus unified (D)RAM?

TheRealPomax · 2025-03-05T17:42:28 1741196548

Yeah they did? The M4 has a max memory bandwidth of 546GBps, the M3 Ultra bumps that up to a max of 819GBps.

(and the 512GB version is $4,000 more rather than $10,000 - that's still worth mocking, but it's nowhere near as much)

okanesen · 2025-03-05T17:50:17 1741197017

Not that dramatic of an increase actually - the M2 Max already had 400GB/s and M2 Ultra 800GB/s memory bandwidth, so the M3 Ultra's 819GB/s is just a modest bump. Though the M4's additional 146GB/s is indeed a more noticeable improvement.

choilive · 2025-03-05T18:03:23 1741197803

Also should note that 800/819GB/s of memory bandwidth is actually VERY usable for LLMs. Consider that a 4090 is just a hair above 1000GB/s

hereonout2 · 2025-03-05T18:59:30 1741201170

Does it work like that though at this larger scale? 512GB of VRAM would be across multiple NVIDIA cards, so the bandwidth and access is parallelized.

But here it looks more of a bottleneck from my (admittedly naive) understanding.

choilive · 2025-03-05T19:12:27 1741201947

For inference the bandwidth is generally not parallelized because the weights need to go through the model layer by layer. The most common model splitting method is done by assigning each GPU a subset of the LLM layers and it doesn't take much bandwidth to send model weights via PCIE to the next GPU.

manmal · 2025-03-05T21:19:14 1741209554

My understanding is that the GPU must still load its assigned layer from VRAM into registers and L2 cache for every token, because those aren’t large enough to hold a significant portion. So naively, for a 24GB layer, you‘d need to move up to 24GB for every token.

angoragoats · 2025-03-07T13:50:17 1741355417

But the memory bandwidth is only part of the equation; the 4090 is at least several times faster at compute compared to the fastest Apple CPU/GPU.

FloatArtifact · 2025-03-05T15:49:43 1741189783

They didn't increase the memory bandwidth. You can get the same memory bandwidth, which is available on the M2 Studio. Yes, yes, of course you can get 512 gigabytes of uRAM for 10 grand.

The the question is if a llm will run with usable performance at that scale? The point is there's diminishing returns despite having enough uRAM with the same amount of memory bandwidth even with increased processing speed of the new chip m3 for AI.

espadrine · 2025-03-05T16:13:17 1741191197

> if a llm will run with usable performance at that scale?

Yes.

The reason: MoE. They are able to run at a good speed because they don't load all of the weights into the GPU cores.

For instance, DeepSeek R1 uses 404 GB in Q4 quantization[0], containing 256 experts of which 8 are routed to[1] (very roughly 13 GB per forward pass). With a memory bandwidth of 800 GB/s[3], the Mac Studio will be able to output 800/13 = 62 tokens per second.

[0]: https://ollama.com/library/deepseek-r1

[1]: https://arxiv.org/pdf/2412.19437

[2]: https://www.apple.com/newsroom/2025/03/apple-unveils-new-mac...

_aavaa_ · 2025-03-05T16:33:00 1741192380

This doesn’t sound correct.

You don’t know which expert you’ll need for each layer, so you either keep them all loaded in memory or stream them from disk

espadrine · 2025-03-05T16:39:43 1741192783

In RAM, yes. But if you compute an activation, you need to load the weights from RAM to the GPU core.

_aavaa_ · 2025-03-05T16:44:28 1741193068

Got you, yeah I misread you commend the first time around

kgwgk · 2025-03-05T16:36:11 1741192571

Note that 404 < 512

fullstackchris · 2025-03-05T16:20:14 1741191614

You seem like you know what you are talking about... mind if I ask what your thoughts on quantization are? Its unclear to me if quantization affects quality... I feel like I've heard yes and no arguments

espadrine · 2025-03-05T16:38:31 1741192711

There is no question that quantization degrades quality. The GGUF R1 uses Q4_K_M, which, on Llama-3-8B, increases the perplexity by 0.18[0]. Many plots show increasing degradation as you quantize more[1].

That said, it is possible to train a model in a quantization-aware way[2][3], which improves the quality a bit, although not higher than the raw model.

Also, a loss in quality may not be perceptible in a specific use-case. Famously LMArena.ai tested Llama 3.1 405B with bf16 and fp8, and the latter was only 2 Elo points below, well within measurement error.

[0]: https://github.com/ggml-org/llama.cpp/blob/master/examples/q...

[1]: https://github.com/ggml-org/llama.cpp/discussions/5063#discu...

[2]: https://pytorch.org/blog/quantization-aware-training/

[3]: https://mistral.ai/news/ministraux

sosuke · 2025-03-05T16:27:25 1741192045

I don't know what I'm talking about but when I first asked your question this https://gist.github.com/Artefact2/b5f810600771265fc1e3944228... helped start me on a path to understanding. I think.

But if you don't already know the question your asking is not at all something I could distill down into a sentence or to that would make sense to a lay-person. Even then I know I couldn't distill it at all sorry.

Edit: I found this link I referenced above on quantized models by bartowski on huggingface https://huggingface.co/bartowski/Qwen2.5-Coder-14B-GGUF#whic...

Ambix · 2025-03-06T10:23:50 1741256630

I did my own experiments and it looks like (surprisingly) Q4KM models often outperforms Q6 and Q8 quantised models.

For bigger models (in range of 8B - 70B) the Q4KM is very good, there are no any degradation compared to full FP16 models.

jazzyjackson · 2025-03-05T16:33:52 1741192432

I returned an M2 Max studio with 96GB RAM, unquantized llama 70B 3.1 was dog slow, not an interactive pace. I'm interested in offline LLM but couldn't see how it was going to produce $3,000 ROI.

FloatArtifact · 2025-03-05T18:20:03 1741198803

It would be really cool if there was awebsite "we there yet" for reasonable offline AI.

It could track different hardware configurations and reasonably standardized benchmark performance per model. I know there's benchmarks buried in github Llama repository.

robbomacrae · 2025-03-05T19:33:32 1741203212

There seems to be a LOT of interest in such a site in the comments here. There seem to be multiple IP issues with sharing your code repo with an online service so I feel a lot of folks are waiting for the hardware to make this possible.

We need a SWE-bench for open source LLM's and for each model to have 3Dmark like benchmarks on various hardware setups.

I did find this which seems very helpful but is missing the latest models and hardware options. https://kamilstanuch.github.io/LLM-token-generation-simulato...

FloatArtifact · 2025-03-06T17:03:24 1741280604

Looks like he bases the benchmarks off of https://github.com/ggml-org/llama.cpp/discussions/4167

I get why he calls it a simulator, as it can simulate token output. It's an important aspect for evaluating use case if you need to get a sense of how much token output is relevant beyond the simple tokens per second text.

slama · 2025-03-05T16:48:04 1741193284

The M3 Ultra is the only configuration that supports 512GB and it has memory bandwidth of 819GB/s.

wkat4242 · 2025-03-05T16:02:53 1741190573

True, I also noticed that bigger models run slower at the same memory bandwidth (makes sense).

memhole · 2025-03-05T16:02:55 1741190575

Yeah, I don’t think RAM is the bottleneck. Which is unfortunate. It feels like a missed opportunity for them. I think Apple partly became popular because it enabled creatives and developers.

throw-qqqqq · 2025-03-05T16:25:52 1741191952

> I don’t think RAM is the bottleneck

Not the size/amount, but the memory bandwidth usually is.