More

JustFinishedBSG · 2025-07-11T16:54:05 1752252845

This doesn’t change the VRAM usage, only the compute requirements.

selfhoster11 · 2025-07-11T19:11:08 1752261068

It does not have to be VRAM, it could be system RAM, or weights streamed from SSD storage. Reportedly, the latter method achieves around 1 token per second on computers with 64 GB of system RAM.

R1 (and K2) is MoE, whereas Llama 3 is a dense model family. MoE actually makes these models practical to run on cheaper hardware. DeepSeek R1 is more comfortable for me than Llama 3 70B for exactly that reason - if it spills out of the GPU, you take a large performance hit.

If you need to spill into CPU inference, you really want to be multiplying a different set of 32B weights for every token compared to the same 70B (or more) instead, simply because the computation takes so long.

refulgentis · 2025-07-11T19:31:12 1752262272

The amount of people who will be using it at 1 token/sec because there's no better option, and have 64 GB of RAM, is vanishingly small.

IMHO it sets the local LLM community back when we lean on extreme quantization & streaming weights from disk to say something is possible*, because when people try it out, it turns out it's an awful experience.

* the implication being, anything is possible in that scenario

selfhoster11 · 2025-07-12T06:34:07 1752302047

Good. Vanishingly small is still more than zero. Over time, running such models will become easier too, as people slowly upgrade to better hardware. It's not like there aren't options for the compute-constrained either. There are lots of Chinese models in the 3-32B range, and Gemma 3 is particularly good too.

I will also point out that having three API-based providers deploying an impractically-large open-weights model beats the pants of having just one. Back in the day, this was called second-sourcing IIRC. With proprietary models, you're at the mercy of one corporation and their Kafkaesque ToS enforcement.

refulgentis · 2025-07-12T12:06:19 1752321979

You said "Good." then wrote a nice stirring bit about how having a bad experience with a 1T model will force people to try 4B/32B models.

That seems separate from the post it was replying to, about 1T param models.

If it is intended to be a reply, it hand waves about how having a bad experience with it will teach them to buy more expensive hardware.

Is that "Good."?

The post points out that if people are taught they need an expensive computer to get 1 token/second, much less try it and find out it's a horrible experience (let's talk about prefill), it will turn them off against local LLMs unnecessarily.

Is that "Good."?

jimjimwii · 2025-07-13T17:18:48 1752427128

Had you posted this comment in the early 90s about linux instead of local models, it would have made about the same amount of sense but aged just as poorly as this comment will.

I'll remain here happily using 2.something tokens / second model.

apitman · 2025-07-14T03:29:09 1752463749

But local aka desktop Linux is still an awful experience for most people. I use Arch btw

selfhoster11 · 2025-07-15T08:15:52 1752567352

I'd rather use Arch over a genuine VT100 than touch Windows 11, so the analogy remains valid - at least you have a choice at all, even if you are in a niche of a niche.

homarp · 2025-07-11T21:56:27 1752270987

agentic loop can run all night long. It's just a different way to work: prepare your prompt queue, set it up, check result in the morning, adjust. 'local vibe' in 10h instead of 10mn is still better than 10 days of manual side coding.

hereme888 · 2025-07-12T09:14:39 1752311679

Right on! Especially if its coding abilities are better than Claude 4 Opus. I spent thousands on my PC in anticipation of this rather than to play fancy video games.

Now, where's that spare SSD...

maven29 · 2025-07-11T16:56:34 1752252994

You can probably run this on CPU if you have a 4090D for prompt processing, since 1TB of DDR4 only comes out to around $600.

For GPU inference at scale, I think token-level batching is used.

zackangelo · 2025-07-11T17:45:36 1752255936

Typically a combination of expert level parallelism and tensor level parallelism is used.

For the big MLP tensors they would be split across GPUs in a cluster. Then for the MoE parts you would spread the experts across the GPUs and route to them based on which experts are active (there would likely be more than one if the batch size is > 1).

t1amat · 2025-07-11T17:13:21 1752254001

With 32B active parameters it would be ridiculously slow at generation.

selfhoster11 · 2025-07-11T19:15:28 1752261328

DDR3 workstation here - R1 generates at 1 token per second. In practice, this means that for complex queries, the speed of replying is closer to an email response than a chat message, but this is acceptable to me for confidential queries or queries where I need the model to be steerable. I can always hit the R1 API from a provider instead, if I want to.

Given that R1 uses 37B active parameters (compared to 32B for K2), K2 should be slightly faster than that - around 1.15 tokens/second.

CamperBob2 · 2025-07-12T18:56:30 1752346590

That's pretty good. Are you running the real 600B+ parameter R1, or a distill, though?

selfhoster11 · 2025-07-14T01:54:07 1752458047

The full thing, 671B. It loses some intelligence at 1.5 bit quantisation, but it's acceptable. I could actually go for around 3 bits if I max out my RAM, but I haven't done that yet.

apitman · 2025-07-14T03:32:05 1752463925

I've seen people say the models get more erratic at higher (lower?) quantization levels. What's your experience been?

selfhoster11 · 2025-07-15T08:14:36 1752567276

If you mean clearly, noticeably erratic or incoherent behaviour, then that hasn't been my experience for >=4-bit inference of 32B models, or in my R1 setup. I think the others might have been referring to this happening with smaller models (sub-24B), which suffer much more after being quantised below 4 or 5 bits.

My R1 most likely isn't as smart as the output coming from an int8 or FP16 API, but that's just a given. It still holds up pretty well for what I did try.

JustFinishedBSG · 2025-07-01T06:34:11 1751351651

Very uncommon / inexistant in private appartements except I guess if you live in a very very very upscale appartement.

Installing AC is actually not allowed in many places ( because of urbanism laws)

Only possible AC is those single hose mobile units which are wildly inefficient and close to useless while burning energy.

xoa · 2025-07-01T17:52:55 1751392375

>Only possible AC is those single hose mobile units which are wildly inefficient and close to useless while burning energy.

FWIW at least in the US (and I can't imagine they wouldn't be available worldwide) there are also dual hose portable AC units which can perform fairly decently, at least far better than single hose. I needed to use one for awhile at an old office (I think it was a Whynter model) and it was effective. There are also more exotic portable units that use water as the fluid dump, but that requires having a sufficient water source that you can utilize, and probably isn't going to be doable in a residential unit in a city. We had a couple at the chemistry lab I worked in 10-15 years ago that hooked into the lab water lines.

JustFinishedBSG · 2025-06-23T06:58:27 1750661907

Is Typst’s typesetting quality on par with « bare » LaTeX ? with LaTeX + microtype ?

It may be stupid and vain but for me if it doesn’t at least match the former it’s a no-go

Paapaa · 2025-06-23T07:52:33 1750665153

Yes, it uses a very similar algorith as LaTeX. It also incorporates already some microtype features out of the box. So the typesetting quality is very good and easily comparable to LaTeX. Working with Typst is so much easier and faster than with LaTeX so you will be more productive. Many things can be done without resorting to external packages and scripting is a breeze compared to LaTeX.

Just try it out. It is free, open source and very easy to setup. Just install the extension Tinymist on VSCode, that is all you need.

thezoq2 · 2025-06-23T09:05:00 1750669500

Until 0.13 it wasn't quite as good as latex in my experience, it mainly inserted more hyphens than LaTeX.

As of this version, it would be very hard to tell a difference in my experience

JustFinishedBSG · 2025-06-05T15:15:49 1749136549

Yes but an app that never pushes update can also do that

JustFinishedBSG · 2025-06-05T15:14:04 1749136444

I'm sure shipping a >150GB file to every user is perfectly fine and sound engineering.

NicolaiS · 2025-06-05T19:05:49 1749150349

Parent comment writes: "ship[ing] the tree root hash", for a merkle tree ("bitcoin style") this would just be a single (small) hash no matter the tree size, i.e. 32 bytes is enough.

VWWHFSfQ · 2025-06-05T15:25:20 1749137120

It's not _that_ far off from shipping a 3GB chrome webapp disguised as a desktop app (cough electron)

kstrauser · 2025-06-05T15:34:39 1749137679

What’s a couple orders of magnitude between friends?

JustFinishedBSG · 2025-06-03T16:22:26 1748967746

I am quite honestly baffled by all these posts. I'm not a hardcore skeptic, at least I don't think so, I have tried most of the vibe coding tools, and I regularly try new ones and continue using many; and I do NOT have the experience described in all these posts.

So two possibilities come to mind:

1. I, somehow, use every single one of those tools wrong.

2. Or; you and the other people raving about Claude code are solving trivial problems.

Of course, I'm human so I'm inclined to believe I do not use all tools wrong. But reading your experience and those similar to yours I can't help but think two things:

- you are solving incredibly rote, already solved, problem that the LLMs are just able to spit out by heart. I do too, and yet I don't have the same AI successes so damn your problems must be especially trivial.

- You are so wildly overpaid that paying hundreds (I even saw people "boasting" about spending thousands a day !) of dollar a month is perfectly justifiable.

I don't think I'm wrong on the last point, and if I can see it then employers will be able to see it too. Why on earth would anyone pay 600$/h (your rates according to one of your post), when apparently I can just pay 200$/mo and get the same thing?

And don't tell me "because it takes skill to use Claude Code", you just wrote a blog post telling it mostly doesn't (apparently).

Don't take this whole post as a mean criticism (and if it reads like it, I'm sorry), I am just truly flabergasted.

zparky · 2025-06-03T18:20:58 1748974858

Disclaimer: I haven't experimented with AI coding

Every time I see a post like yours, I see the reply saying "yeah you're using it wrong". I can't say either 1 or 2 is true, but if you saw the recent post[0] about cloudflare making an API entirely with Claude, it could be more of a case of not using the tools to their potential.

[0] https://news.ycombinator.com/item?id=44159166

yuck39 · 2025-06-04T15:29:22 1749050962

Don't rule out possibility 1.

Sometimes you have to learn how to frame the problem in a way to get the results that you want. These tools need lots of context, not just about the rest of the code base but the problem itself. You can think of it a bit like how the early adopters of high level programming languages had to fight against compilers to get the assembly output that they wanted.

For example, if I tell an LLM to generate a python script that finds the square of a number I might want: def square(x): return x * x

but it may give me: print("Enter a number:") x = int(input()) print("The square is", x * x)

This is a very very simple example but I think it illustrates my point. If you provide enough context to the exact problem you want to solve the results are astronomically better.

jdance · 2025-06-04T05:21:48 1749014508

Yesterday I asked both claude and gemini to downgrade a package on manjaro and both failed miserably

I use these tools a lot and I would say they fail 50% of the time for my use cases and I baby them every step

”Is it just me” is such a flawed mindset. The reality is staring you in the face

JustFinishedBSG · 2025-05-21T10:19:09 1747822749

Kind of expensive but there isn't actually a lot of choice for fonts with matching math fonts so for my PhD thesis I used Minion 3 + MinionMath.

For mono fonts there are a lot of nice choices but I used PragmataPro for no other reasons that I own it and it provided a nice readable contrast.

Otherwise for the free options, Palatino+mathpazo or StixTwoText + StixTwoMath are quite good options. Honestly anything but ComputerCM is a good option; it's imho not a very good font nowadays; it's way too thin. It was designed with the assumption that it will be printed on old, fairly bad, printers with significant ink overspill.

PJ_Maybe · 2025-05-21T12:14:20 1747829660

MLModern [0] is a thicker version of Donald Knuth's Computer Modern and the Latin Modern project, in case you are interested.

[0] https://ctan.org/pkg/mlmodern

SubjectToChange · 2025-05-21T14:22:55 1747837375

+1 for Minion Pro, Minion Math, and PragmataPro. Those fonts have been my preferred defaults for years now. While they are expensive, it is worth it to write in a beautiful font and to compensate the artistic work that went into making them.

trueismywork · 2025-05-21T15:35:46 1747841746

Does MinionMath cover all ligature? My reason for mostly sticking with Latin Modern Math is because I don't have to worry about random characters being unprintable.

JustFinishedBSG · 2025-05-23T14:45:51 1748011551

MinionMath only covers the Unicode Math part. It doesn't cover everything but it's definitely extensive ( 3300 symbols ! ): http://www.typoma.com/data/MinionMath_Release_1_026.pdf

MinionPro / Minion 3 is Adobe's flagship font so it basically support every language and ligature under the sun

trueismywork · 2025-05-24T03:50:18 1748058618

Thanks.

JustFinishedBSG · 2025-05-15T12:48:47 1747313327

It would "know" the same way it "knows" anything else: The probability of the sequence "I don't know" would be higher than the probability of any other sequence.

Sharlin · 2025-05-15T22:05:13 1747346713

Exactly. It's easy to imagine a component in the net that the model is steered towards when nothing else has a high enough activation.

JustFinishedBSG · 2025-04-29T20:51:43 1745959903

Watt isn’t a measure of energy. Without how long it takes for a human and ChatGPT to solve the task then the comparison doesn’t teach anything

roschdal · 2025-04-30T04:48:47 1745988527

You're absolutely right — watt is a unit of power, not energy. To make a meaningful comparison, we need to estimate how much energy (in joules) each system uses to solve the same task.

Let’s define a representative task: answering a moderately complex question.

1. Human Brain Power use: ~20 watts (on average while awake)

Time to think: ~10 seconds to answer a question

Energy used: 20

watts × 10

seconds = 200

joules 20watts×10seconds=200joules

2. ChatGPT (GPT-4) Estimate per query: Based on research and datacenter estimates, GPT-4 may use:

Around 2–3 kWh per 1000 queries, which is 7.2–10.8 megajoules

Per query: 7.2

MJ 1000 = 7200

joules 1000 7.2MJ

=7200joules per response (lower bound) 10.8

MJ 1000 = 10 , 800

joules 1000 10.8MJ

=10,800joules per response (upper bound)

Comparison Human: ~200 joules

ChatGPT: ~7,200 to 10,800 joules

Conclusion: The human brain is about 36–54 times more energy-efficient than ChatGPT at answering a single question.

Or in percent: 3,600% to 5,400% more efficient

JustFinishedBSG · 2025-04-22T11:26:27 1745321187

Ofc Mistral model is a lot worse but for $14.99 you get access to AFP news. So OpenAI for $60 for the same things would be a huge joke