Hacked Nvidia 4090 GPU driver to enable P2P

andersa · 2024-04-12T14:08:14 1712930894

Incredible! I'd been wondering if this was possible. Now the only thing standing in the way of my 4x4090 rig for local LLMs is finding time to build it. With tensor parallelism, this will be both massively cheaper and faster for inference than a H100 SXM.

I still don't understand why they went with 6 GPUs for the tinybox. Many things will only function well with 4 or 8 GPUs. It seems like the worst of both worlds now (use 4 GPUs but pay for 6 GPUs, don't have 8 GPUs).

georgehotz · 2024-04-12T17:15:58 1712942158

tinygrad supports uneven splits. There's no fundamental reason for 4 or 8, and work should almost fully parallelize on any number of GPUs with good software.

We chose 6 because we have 128 PCIe lanes, aka 8 16x ports. We use 1 for NVMe and 1 for networking, leaving 6 for GPUs to connect them in full fabric. If we used 4 GPUs, we'd be wasting PCIe, and if we used 8 there would be no room for external connectivity aside from a few USB3 ports.

andersa · 2024-04-12T19:57:56 1712951876

That is very interesting if tinygrad can support it! Every other library I've seen had the limitation on dividing the heads, so I'd (perhaps incorrectly) assumed that it's a general problem for inference.

spmurrayzzz · 2024-04-13T16:10:03 1713024603

There are some interesting hacks you can do like replicating the K/V weights by some factor which allows them to be evenly divisible by whatever number of gpus you have. Obviously there is a memory cost there, but it does work.

Emerson1 · 2024-04-27T07:42:17 1714203737

how could you go about testing this on say llama3 70b with two 4090s - vllm supports tensor parallelism, would the expectation be that inference would be faster with p2p? How would you update the nvidia driver tho? Thanks any thought appreciated

WanderPanda · 2024-04-13T00:17:46 1712967466

Did you at least front run the market and stocked up of 4090ies before this release? Also gamers are probably not too happy about these developments :D

tinco · 2024-04-13T13:10:44 1713013844

4090's have consistently been around 2000 dollars. I don't think there's many gamers who would be affected by price fluctuations of the 4090 or even the 4080.

presides · 2024-04-13T16:27:39 1713025659

This is out of touch; they were mad before and they will be mad again. Lots of people spend a huge chunk of their modest disposable income on high end gaming gear, and the only upside of these issues for them is that eventually, YEARS down the line, capacity/supply issues MIGHT calm down in a way that yields some benefits.

They're going to realize soon enough that they've basically just been told that the extremely shitty problem they thought they'd moved beyond is back with a vengeance and the next generation of gaming cards has the potential to make the past few rounds of scalping shit-shows look tame.

swalsh · 2024-04-14T19:38:29 1713123509

Gamers have a TON of really good really affordable options. But you kind of need 24gb min unless you're using heavy quantization. So 3090 and 4090's are what local llm people are building with (mostly 3090's as you can get then for about $700, and they're dang good)

davidzweig · 2024-04-12T18:06:09 1712945169

Is it possible a similar patch would work for P2P on 3090s?

btw, I found a Gigabyte board on Taobao that is unlisted on their site: MZF2-AC0, costs $900. 2 socket Epyc and 10 PCIE slots, may be of interest. A case that should fit, with 2x 2000W Great Wall PSUs and PDU is 4050 RMB (https://www.toploong.com/en/4GPU-server-case/644.html). You still need blower GPUs.

georgehotz · 2024-04-12T18:25:16 1712946316

It should if your 3090s have Resizable BAR support in the VBIOS. AFAIK most card manufacturers released BIOS updates enabling this.

Re: 3090 NVLink, that only allows pairs of cards to be connected. PCIe allows full fabric switch of many cards.

Ratiofarmings · 2024-04-12T21:15:46 1712956546

In cases where they didn't, the techpowerup vBIOS collection solves the problem.

learnedmachine · 2024-04-15T19:12:56 1713208376

Thanks for the amazing work! I tried the driver with some 3090s (all of which show the 32G line with lspci -s 01:00.0 -v) and while torch says I have p2p access, I can't get it to work with anything as I get illegal memory access errors.

davidzweig · 2024-04-13T04:16:30 1712981790

Update, I checked with the case company, toploong, they say that board is a 5mm too big or so for the case.

cjbprime · 2024-04-12T18:23:40 1712946220

Doesn't nvlink work natively on 3090s? I thought it was only removed (and here re-enabled) in 4090.

qeternity · 2024-04-12T19:40:52 1712950852

This not not nvlink.

doctorpangloss · 2024-04-12T17:30:06 1712943006

Have you compared 3x 3090-3090 pairs over NVLink?

IMO the most painful thing is that since these hardware configurations are esoteric, there is no software that detects them and moves things around "automatically." Regardless of what people thing device_map="auto" does, and anyway, Hugging Face's transformers/diffusers are all over the place.

AnthonyMouse · 2024-04-12T21:01:56 1712955716

Is there any reason you couldn't use 7? 8 PCIe lanes each seems more than sufficient for NVMe and networking.

Tepix · 2024-04-12T14:42:45 1712932965

6 GPUs because they want fast storage and it uses PCIe lanes.

Besides the goal was to run a 70b FP16 model (requiring roughly 140GB VRAM). 6*24GB = 144GB

andersa · 2024-04-12T15:07:20 1712934440

That calculation is incorrect. You need to fit both the model (140GB) and the KV cache (5GB at 32k tokens FP8 with flash attention 2) * batch size into VRAM.

If the goal is to run a FP16 70B model as fast as possible, you would want 8 GPUs with P2P, for a total of 192GB VRAM. The model is then split across all 8 GPUs with 8-way tensor parallelism, letting you make use of the full 8TB/s memory bandwidth on every iteration. Then you have 50GB spread out remaining for KV cache pages, so you can raise the batch size up to 8 (or maybe more).

renewiltord · 2024-04-12T15:20:56 1712935256

I’ve got a few 4090s that I’m planning on doing this with. Would appreciate even the smallest directional tip you can provide on splitting the model that you believe is likely to work.

andersa · 2024-04-12T15:23:41 1712935421

The split is done automatically by the inference engine if you enable tensor parallelism. TensorRT-LLM, vLLM and aphrodite-engine can all do this out of the box. The main thing is just that you need either 4 or 8 GPUs for it to work on current models.

renewiltord · 2024-04-12T16:03:32 1712937812

Thank you! Can I run with 2 GPUs or with heterogeneous GPUs that have same RAM? I will try. Just curious if you already have tried.

andersa · 2024-04-12T16:06:10 1712937970

2 GPUs works fine too, as long as your model fits. Using different GPUs with same VRAM however, is highly highly sketchy. Sometimes it works, sometimes it doesn't. In any case, it would be limited by the performance of the slower GPU.

renewiltord · 2024-04-12T16:49:10 1712940550

All right, thank you. I can run it on 2x 4090 and just put the 3090s in different machine.

Tepix · 2024-04-14T00:13:27 1713053607

I know there's some overhead, it's not my calculation.

https://www.tweaktown.com/news/97110/tinycorps-new-tinybox-a...

Quote: "Runs 70B FP16 LLaMA-2 out of the box using tinygrad"

Related: https://github.com/tinygrad/tinygrad/issues/3791

liuliu · 2024-04-12T15:59:52 1712937592

6 seems reasonable. 128 Lanes from ThreadRipper needs to have a few for network and NVMe (4x NVMe would be x16 lanes, and 10G network would be another x4 lanes).

numpad0 · 2024-04-12T15:40:54 1712936454

I was googling public NVIDIA SXM2 materials the other day, and it seemed SXM2/NVLink 2.0 just was a six-way system. NVIDIA SXM had updated to versions 3 and 4 since, and this isn't based on none of those anyway, but maybe there's something we don't know that make six-way reasonable.

andersa · 2024-04-12T15:43:14 1712936594

It was probably just before running LLMs with tensor parallelism became interesting. There are plenty of other workloads that can be divided by 6 nicely, it's not an end-all thing.

dheera · 2024-04-12T19:12:08 1712949128

What is a six-way system?

TylerE · 2024-04-13T03:58:44 1712980724

oid school way of saying core (or in this case GPU), basically.

boromi · 2024-04-13T18:01:52 1713031312

Any chance you could share the details of the build you'd go for. I need a server for our lab, but am kinda out of my depth with all the options/

ShamelessC · 2024-04-12T14:52:28 1712933548

> Many things will only function well with 4 or 8 GPUs

What do you mean?

andersa · 2024-04-12T14:54:26 1712933666

For example, if you want to run low latency multi-GPU inference with tensor parallelism in TensorRT-LLM, there is a requirement that the number of heads in the model is divisible by the number of GPUs. Most current published models are divisible by 4 and 8, but not 6.

bick_nyers · 2024-04-12T15:48:55 1712936935

Interesting... 1 Zen 4 EPYC CPU yields a maximum of 128 PCIE lanes so it wouldn't be possible to put 8 full fat GPUs on while maintaining some lanes for storage and networking. Same deal with Threadripper Pro.

andersa · 2024-04-12T15:52:09 1712937129

It should be possible with onboard PCIe switches. You probably don't need the networking or storage to be all that fast while running the job, so it can dedicate almost all of the bandwidth to the GPU.

I don't know if there are boards that implement this, though, I'm only looking at systems with 4x GPUs currently. Even just plugging in a 5kW GPU server in my apartment would be a bit of a challenge. With 4x 4090, the max load would be below 3kW, so a single 240V plug can handle it no issue.

bick_nyers · 2024-04-12T17:39:42 1712943582

I've seen it done with a PLX Multiplexer as well, but they add quite a bit of cost:

https://c-payne.com/products/pcie-gen4-switch-backplane-4-x1...

Not sure if there exists an 8-way PCIE Gen 5 Multiplexer that doesn't cost ludicrous amounts of cash. Ludicrous being a highly subjective and relative term of course.

namibj · 2024-04-12T19:22:31 1712949751

98 lanes of PCIe 4.0 fabric switch as just the chip (to solder onto a motherboard/backplane) costs 850$ (PEX88096). You could for example take 2 x16 GPUs, pass then through (2216=64 lanes), and have 2 x16 that bifurcate to at least x4 (might even be x2, I didn't find that part of the docs just now) for anything you want, plus 2 x1 for minor stuff. They do claim to have no problems being connected up into a switching fabric, and very much allow multi-host operations (you will need signal retimers quite soon, though).

They're the stuff that enables cloud operators to pool like 30 GPUs across like 10 CPU sockets while letting you virtually hot-plug them to fit demand. Or when you want to make a SAN with real NVMe-over-PCIe. Far cheaper than normal networking switches with similar ports (assuming hosts doing just x4 bifurcation, it's very comparable to a 50G Ethernet port. The above chip thus matches a 24 port 50G Ethernet switch. Trading reach for only needing retimers, not full NICs, in each connected host. Easily better for HPC clusters up to about 200 kW made from dense compute nodes.), but sadly still lacking affordable COTS parts that don't require soldering or contacting sales for pricing (the only COTS with list prices seem to be Broadcom's reference designs, for prices befitting an evaluation kit, not a Beowulf cluster).

pests · 2024-04-13T00:17:36 1712967456

I really like the information about how the cloud providers do their multiplexing, thanks. There was some tech posted here a few month ago that was similar I found very interesting - plug all devices, ram, hard drives, and CPUS into a larger fabric and a way to spin up "servers" of any size from the pool of resources... wish I could remember the name now.

nit: HN formatting messed up your math in the second sentence, I believe you italicized on accident using * for equations.

thangngoc89 · 2024-04-12T16:17:07 1712938627

8 GPUs x 16 PCIe lanes each = 128 lanes already.

That’s the limit of single CPU platforms.

segfaultbuserr · 2024-04-12T14:55:20 1712933720

It's more difficult to split your work across 6 GPUs evenly, and easier when you have 4 or 8 GPUs. The latter setups have powers of 2, which for example, can evenly divide a 2D or 3D grid, but 6 GPUs are awkward to program. Thus, the OP argues that a 6-GPU setup is highly suboptimal for many existing applications and there's no point to pay more for the extra 2.

cjbprime · 2024-04-12T16:09:02 1712938142

I don't think P2P is very relevant for inference. It's important for training. Inference can just be sharded across GPUs without sharing memory between them directly.

andersa · 2024-04-12T16:10:16 1712938216

It can make a difference when using tensor parallelism to run small batch sizes. Not a huge difference like training because we don't need to update all weights, but still a noticeable one. In the current inference engines there are some allreduce steps that are implemented using nccl.

Also, paged KV cache is usually spread across GPUs.

namibj · 2024-04-12T18:48:57 1712947737

It massively helps arithmetic intensity to batch during inference, and the desired batch sizes by that tend to exceed the memory capacity of a single GPU. Thus desire to do training-like cluster processing to e.g. use a weight for each inference stream that needs it every time it's fetched from memory. It's just that you can't fit 100+ inference streams of context on one GPU, typically, thus the desire to shard along less-wasteful (w.r.t. memory bandwidth) dimensions than entire inference streams.

qeternity · 2024-04-12T19:45:54 1712951154

You are talking about data parallelism. Depending on the model tensor parallelism can still be very important for inference.

corn13read2 · 2024-04-12T14:35:39 1712932539

A macbook is cheaper though

tgtweak · 2024-04-12T14:47:01 1712933221

The extra $3k you'd spend on a quad-4090 rig vs the top mbp... ignoring the fact you can't put the two on even ground for versatility (very few libraries are adapted to apple silicone let alone optimized).

Very few people that would consider an H100/A100/A800 are going to be cross-shopping a macbook pro for their workloads.

LoganDark · 2024-04-12T17:01:01 1712941261

> very few libraries are adapted to apple silicone let alone optimized

This is a joke, right? Have you been anywhere in the LLM ecosystem for the past year or so? I'm constantly hearing about new ways in which ASi outperforms traditional platforms, and new projects that are optimized for ASi. Such as, for instance, llama.cpp.

cavisne · 2024-04-12T17:20:21 1712942421

Nothing compared to Nvidia though. The FLOPS and memory bandwidth is simply not there.

spudlyo · 2024-04-12T19:52:10 1712951530

The memory bandwidth of the M2 Ultra is around 800GB/s verses 1008 GB/s for the 4090. While it’s true the M2 has neither the bandwidth or the GPU power, it is not limited to 24G of VRAM per card. The 192G upper limit on the M2 Ultra will have a much easier time running inference on a 70+ billion parameter model, if that is your aim.

Besides size, heat, fan noise, and not having to build it yourself, this is the only area where Apple Silicon might have advantage over a homemade 4090 rig.

LoganDark · 2024-04-12T20:45:04 1712954704

It doesn't need GPU power to beat the 4090 in benchmarks: https://appleinsider.com/articles/23/12/13/apple-silicon-m3-...

int_19h · 2024-04-12T20:52:07 1712955127

It doesn't beat RTX 4090 when it comes to actual LLM inference speed. I bought a Mac Studio for local inference because it was the most convenient way to get something fast enough and with enough RAM to run even 155b models. It's great for that, but ultimately it's not magic - NVidia hardware still offers more FLOPS and faster RAM.

LoganDark · 2024-04-12T21:13:29 1712956409

> It doesn't beat RTX 4090 when it comes to actual LLM inference speed

Sure, whisper.cpp is not an LLM. The 4090 can't even do inference at all on anything over 24GB, while ASi can chug through it even if slightly slower.

I wonder if with https://github.com/tinygrad/open-gpu-kernel-modules (the 4090 P2P patches) it might become a lot faster to split a too-large model across multiple 4090s and still outperform ASi (at least until someone at Apple does an MLX LLM).

numpad0 · 2024-04-14T15:17:06 1713107826

PSA for all people who are still being misled by hand-wavy Apple M1 marketing charts[1] implicating total dominance of M-series wondersilicon obsoleting all Intel/NVIDIA PCs:

There are benchmark data showing that an Apple M2 Ultra is 47% and 60% slower against Xeon W9 and RTX 4090, or 0.35% and 2% slower against i9-13900K and RTX 4060 Ti, respectively, in Geekbench 5 Multi-threaded and OpenCL Compute tests.

Apple Silicon Macs are NOT faster than competing desktop computers, nor M1 was massively faster than NVIDIA 3070(Desktop - 2x faster than Laptop variant M1 was compared against) for that matter. They just offer up to 128GB shared RAM/VRAM options in slim desktops and laptops, which is handy for LLM, that's it.

Please stop taking Apple marketing materials at full face value or above. Thank you.

  1: https://i.extremetech.com/imagery/content-types/03ekiQwNudC75iOK4AMuEkw/images-2.jpg    
  2: screenshot from[4]: https://www.igorslab.de/wp-content/uploads/2023/06/Apple-M2-ULtra-SoC-Geekbench-5-Multi-Threaded.jpg  
  3: screenshot from[4]: https://www.igorslab.de/wp-content/uploads/2023/06/Apple-M2-ULtra-SoC-Geekbench-5-OpenCL-Compute.jpg  
  4: https://wccftech.com/apple-m2-ultra-soc-isnt-faster-than-amd-intel-last-year-desktop-cpus-50-slower-than-nvidia-rtx-4080/

dragonwriter · 2024-04-12T21:22:32 1712956952

> The 4090 can't even do inference at all on anything over 24GB, while ASi can chug through it even if slightly slower.

Common LLM runners can split model layers between VRAM and system RAM; a PC rig with a 4090 can do inference on models larger than 24G.

Where the crossover point where having the whole thing on Apple Silicon unified memory vs. doing split layers on a PC with a 4090 and system RAM is, I don't know, but its definitely not “more than 24G and a 4090 doesn't do anything”.

LoganDark · 2024-04-12T23:25:04 1712964304

> Common LLM runners can split model layers between VRAM and system RAM; a PC rig with a 4090 can do inference on models larger than 24G.

Sure and ASi can do inference on models larger than the Unified Memory if you account for streaming the weights from the SSD on-demand. That doesn't mean it's going to be as fast as keeping the whole thing in RAM, although ASi SSDs are probably not particularly bad as far as SSDs go.

treprinum · 2024-04-13T11:27:13 1713007633

Slightly slower in this case is like 10x. I have M3 Max with 128GB RAM, 4090 trashes it on anything under 24GB, then M3 Max trashes it on anything above 24GB, but it's like 10x slower at it than 4090 on <24GB.

LoganDark · 2024-04-12T20:26:56 1712953616

Yeah. Let me just walk down to Best Buy and get myself a GPU with over 24 gigabytes of VRAM (impossible) for less than $3,000 (even more impossible). Then tell me ASi is nothing compared to Nvidia.

Even the A100 for something around $15,000 (edit: used to say $10,000) only goes up to 80 gigabytes of VRAM, but a 192GB Mac Studio goes for under $6,000.

Those figures alone proves Nvidia isn't even competing in the consumer or even the enthusiast space anymore. They know you'll buy their hardware if you really need it, so they aggressively segment the market with VRAM restrictions.

andersa · 2024-04-12T20:32:32 1712953952

Where are you getting an A100 80GB for $10k?

LoganDark · 2024-04-12T20:36:03 1712954163

Oops, I remembered it being somewhere near $15k but Google got confused and showed me results for the 40GB instead so I put $10k by mistake. Thanks for the correction.

A100 80GB goes for around $14,000 - $20,000 on eBay and A100 40GB goes for around $4,000 - $6,000. New (not from eBay - from PNY and such), it looks like an 80GB would set you back $18,000 to $26,000 depending on whether you want HBM2 or HBM2e.

Meanwhile you can buy a Mac Studio today without going through a distributor and they're under $6,000 if the only thing you care about is having 192GB of Unified Memory.

And while the memory bandwidth isn't quite as high as the 4090, the M-series chips can run certain models faster anyway, if Apple is to be believed

andersa · 2024-04-12T14:58:37 1712933917

Sure, it's also at least an order of magnitude slower in practice, compared to 4x 4090 running at full speed. We're looking at 10 times the memory bandwidth and much greater compute.

chaostheory · 2024-04-12T21:47:42 1712958462

Yeah, even a Mac Studio is way too slow compared to Nvidia which is too bad because at $7000 maxed to 192gb it would be an easy sell. Hopefully, they will fix this by m5. I don’t trust the marketing for m4

faeriechangling · 2024-04-13T19:08:14 1713035294

Buying a MacBook for AI is great if you were already going to buy a MacBook, as this makes it a lot more cost competitive. It's also great if what you're doing is REALLY privacy sensitive, such as if you're a lawyer, where uploading client data to OpenAI is probably not appropriate or legal.

But in general, I find the appeal is narrow because either consumer GPUs are better for training in general and inferencing at scale[1]. Cloud services also allow the vast majority of individuals to get higher quality inferencing at lower cost. The result is Apple Silicon's appeal being quite niche.

[1] Mind you, Nvidia considers this a licensing violation, not that GeoHot has historically ever been all scared to violate a EULA and force a company to prove its terms have legal force.

llm_trw · 2024-04-12T16:15:01 1712938501

So is a TI-89.

amelius · 2024-04-12T18:32:59 1712946779

And looks way cooler

numpad0 · 2024-04-12T16:28:50 1712939330

4x32GB(128GB) DDR4 is ~$250. 4x48GB(192GB) DDR5 is ~$600. Those are even cheaper than upgrade options for Macs($1k).

papichulo2023 · 2024-04-12T18:59:56 1712948396

No many consumer mobo support 192GB DDR5.

faeriechangling · 2024-04-13T19:33:39 1713036819

Most consumer mobo's I see support this even if the setup isn't on the QVL. If a DDR5 motherboard support 4 sticks at all you can probably run 192gb on it so long as you update the BIOS firmware. The problem is running at rated speeds.

AMD tends to be worse than Intel, and I hear people having to run anywhere between DDR5-3200 to DDR5-5200. You are better off running two sticks, because even with 2 sticks you really can't run larger models with acceptable performance anyways, much less with 4.

There is competition to apple on the low end (dual channel fast DDR5) and on the high end (8+ channel like Xeon/Epyc/AmpereOne). In the middle, Apple is sort of crushing because if you run a true 4 channel system you're going to get poor performance if you load up a 192gb model, and if you compare pricing to 96gb/128gb apple systems, there's not all that much of a cost advantage and you have to make a lot of sacrifices to get there. The truth is that Apple really doesn't have all that much competition right now and won't for the foreseeable future.

papichulo2023 · 2024-04-13T21:35:55 1713044155

Hopefully Qualcom will free us of this 2 channels noghtmare.

wtallis · 2024-04-14T17:05:30 1713114330

I don't think it's realistic to pin your hopes on Qualcomm given that they're unlikely to care about supporting anything other than LPDDR with their laptop processors.

faeriechangling · 2024-04-14T03:02:53 1713063773

I’m optimistic about APUs personally like AMDs upcoming Strix Halo APU with a 256-bit memory bus competing at the lower end of the market, but that will only provide so much competition.

wtallis · 2024-04-12T20:05:00 1712952300

If it supports DDR5 at all, then it should be at most a firmware update away from supporting 48GB dual-rank DIMMs. There are very few consumer motherboards that only have two DDR5 slots; almost all have the four slots necessary to accept 192GB. If you are under the impression that there's a widespread limitation on consumer hardware support for these modules, it may simply be due to the fact that 48GB modules did not exist yet when DDR5 first entered the consumer market, and such modules did not start getting mentioned on spec sheets until after they existed.

imtringued · 2024-04-13T09:15:57 1712999757

You don't want to use more than two slots because you only have two memory channels. The overclocking potential of DDR5 is extremely high when you only run two DIMMs. All the way up to 8000. Meanwhile if you go for populating all four slots, you are limited significantly below 5000. Almost a 50% performance drop if you are willing to overclock your RAM.

wtallis · 2024-04-13T09:21:58 1713000118

If you want to run something that doesn't fit in 96GB of RAM, you'll get better performance from having enough RAM. Yes, having two dual-rank DIMMs per channel will force you to run at a slower speed, but it's still far faster than your SSD. The second slot per channel exists precisely because many people really do want to use it.

ojbyrne · 2024-04-12T20:05:03 1712952303

A lot that have specs showing they support a max of 4x32 DDR5 actually support 4x48 DDR5 via recent BIOS updates.

papichulo2023 · 2024-04-13T21:37:23 1713044243

In the specs yeap, in practice hardly anyone got it working. As far as I saw in reddit, it requires customizing timings to make 4 slots work over 6000 Mhz at the same time.

thangngoc89 · 2024-04-12T15:16:27 1712934987

training on MPS backend is suboptimal and really slow.

wtallis · 2024-04-12T18:41:37 1712947297

Do people do training on systems this small, or just inference? I could see maybe doing a little bit of fine-tuning, but certainly not from-scratch training.

redox99 · 2024-04-12T21:01:15 1712955675

If you mean train llama from scratch, you aren't going to train it on any single box.

But even with a single 3090 you can do quite a lot with LLMs (through QLoRA and similar).

thangngoc89 · 2024-04-13T04:33:55 1712982835

Yep. Price/performance of multiple 4090s system are way better than the professional cards (Axxx). Also deep learning outside of LLM has many different usage.

chriskanan · 2024-04-13T14:53:33 1713020013

This is great news. As an academic, I'm aware of multiple labs that built boxes with 4090s, not realizing that Nvidia had impaired P2P communication among cards. It's one of the reasons I didn't buy 4090s, despite them being much more affordable for my work. It isn't nvlink, but Nvidia has mostly gotten rid of that except for their highest end cards. It is better than nothing.

Late last year, I got quotes for machines with four nvlink H100s, but the lead time for delivery was 13 months. I could get the non-nvlink ones in just four months. For now, I've gone with four L40S cards to hold my lab over but supply chain issues and gigantic price increases are making it very hard for my lab to do it's work. That's not nearly enough to support 6 PhD students and a bunch of undergrads.

Things were a lot easier when I could just build machines with two GPUs each with Nvlink for $5K each and give one to each student to put under their desks, which is what I did back in 2015-2018 at my old university.

uniqueuid · 2024-04-14T15:56:41 1713110201

And before that, Nvidia made our lives harder by phasing out blower-style designs in consumer cards that we could put in servers. In my lab, I'd take a card for 1/4 the price that has half the MTBF over a card for full price anytime.

photonbeam · 2024-04-14T18:12:27 1713118347

How does cost compare with some of the GPU-cloud providers?

uniqueuid · 2024-04-14T18:49:07 1713120547

Not op, but I found this benchmark of whisper large-v3 interesting [1]. It includes the cloud provider's pricing per gpu, so you can directly calculate break-even timing.

Of course, if you use different models, training, fine tuning etc. the benchmarks will differ depending on ram, support of fp8 etc.

[1] https://blog.salad.com/whisper-large-v3/

jstanley · 2024-04-12T12:10:07 1712923807

What does P2P mean in this context? I Googled it and it sounds like it means "peer to peer", but what does that mean in the context of a graphics card?

__alexs · 2024-04-12T12:15:09 1712924109

It means you can send data from the memory of 1 GPU to another GPU without going via RAM. https://xilinx.github.io/XRT/master/html/p2p.html

ot1138 · 2024-04-12T12:25:08 1712924708

Is this really efficient or practical? My understanding is that the latency required to copy memory from CPU or RAM to GPU negates any performance benefits (much less running over a network!)

llm_trw · 2024-04-12T12:57:05 1712926625

Yes, the point here is that you do a direct write from one cards memory to the other using PCIe.

In older NVidia cards this could be done through a faster link called NVLink but the hardware for that was ripped out of consumer grade cards and is only in data center grade cards now.

Until this post it seemed like they had ripped all such functionality of their consumer cards, but it looks like you can still get it working at lower speeds using the PCIe bus.

acka · 2024-04-13T15:10:17 1713021017

> In older NVidia cards this could be done through a faster link called NVLink but the hardware for that was ripped out of consumer grade cards and is only in data center grade cards now.

NVLink is still very much available in both RTX 3090 and A6000, both of which are still on the market. It was indeed removed from the RTX 40 series{0].

[0]: https://www.pugetsystems.com/labs/articles/nvidia-nvlink-202...

spxneo · 2024-04-12T15:36:29 1712936189

so whats stopping from somebody buying a ton of GPUs that are cheap and wiring it up via P2P like we saw with crypto mining

genewitch · 2024-04-12T18:51:28 1712947888

crypto mining only needs 1 PCIe lane per GPU, so you can fit 24+ GPUs on a standard consumer CPU motherboard (24-32 lanes depending on the CPU). Apparently ML workloads require more interconnect bandwidth when doing parallel compute, so each card in this demo system uses 16 lanes, and therefore requires 1.) full size slots, and 2.) epyc[0] or xeon based systems with 128 lanes (or at least greater than 32 lanes).

per 1 above crypto "boards" have lots of x1 (or x4) slots, the really short PCIe slots. You then use a riser that uses USB3 cables to go to a full size slot on a small board, with power connectors on it. If your board only has x8 or x16 slots (the full size slot) you can buy a breakout PCIe board that splits that into four slots, using 4 USB-3 cables, again, to boards with full size slots and power connectors. These are different than the PCIe riser boards you can buy for use with cases that allow the GPUs to be placed vertically rather than horizontally, as those have full x16 "fabric" that interconnect between the riser and the board with the x16 slot on them.

[0] i didn't read the article because i'm not planning on buying a threadripper (48-64+ lanes) or an epyc (96-128 lanes?) just to run AI workloads when i could just rent them for the kind of usage i do.

myself248 · 2024-04-12T19:11:45 1712949105

Oooo, got a link to one of these fabric boards? I've been playing with stupid PCIe tricks but that's a new one on me.

genewitch · 2024-04-12T20:28:34 1712953714

https://www.amazon.com/gp/product/B07DMNJ6QM/

i used to use this one when i had all (three of my) nvme -> 4x sata boardlets and therefore could not fit a GPU in a PCIe slot due to the cabling mess.

myself248 · 2024-04-12T22:52:30 1712962350

Oh, um, just a flexible riser.

I thought we were using "fabric" to mean "switching matrix".

wmf · 2024-04-12T15:55:58 1712937358

That's what this thread is about. Geohot is doing that.

wtallis · 2024-04-12T18:29:07 1712946547

Crypto mining could make use of lots of GPUs in a single cheap system precisely because it did not need any significant PCIe bandwidth, and would not have benefited at all from p2p DMA. Anything that does benefit from using p2p DMA is unsuitable for running with just one PCIe lane per GPU.

numpad0 · 2024-04-12T19:52:38 1712951558

PCIe P2P still has to go up to a central hub thing and back because PCIe is not a bus. That central hub thing is made by very few players(most famously PLX Technologies) and it costs a lot.

wtallis · 2024-04-12T20:09:46 1712952586

PCIe p2p transactions that end up routed through the CPU's PCIe root complex still have performance advantages over split transactions using the CPU's DRAM as an intermediate buffer. Separate PCIe switches are not necessary except when the CPU doesn't support routing p2p transactions, which IIRC was not a problem on anything more mainstream than IBM POWER.

numpad0 · 2024-04-13T08:53:09 1712998389

Maybe not strictly necessary, but a separate PCIe backplane just for P2P bandwidth bypasses topology and bottleneck mess[1][2] of PC platform altogether and might be useful. I suspect this was the original premise for NVLink too.

1: https://assets.hardwarezone.com/img/2023/09/pre-meteror-lake...

2: https://www.gigabyte.com/FileUpload/Global/MicroSite/579/inn...

sparky_ · 2024-04-12T14:54:01 1712933641

I take it this is mostly useful for compute workloads, neural networks, LLM and the like -- not for actual graphics rendering?

CYR1X · 2024-04-12T15:28:21 1712935701

jmalicki · 2024-04-12T19:18:43 1712949523

For very large models, the weights may not fit on one GPU.

Also, sometimes having more than one GPU enables larger batch sizes if each GPU can only hold the activations for perhaps one or two training examples.

There is definitely a performance hit, but GPU<->GPU peer is less latency than GPU->CPU->software context switch->GPU.

For "normal" pytorch training, the training is generally streamed through the GPU. The model does a batch training step on one batch while the next one is being loaded, and the transfer time is usually less than than the time it takes to do the forward and backward passes through the batch.

For multi-GPU there are various data parallel and model parallel topologies of how to sort it, and there are ways of mitigating latency by interleaving some operations to not take the full hit, but multi-GPU training is definitely not perfectly parallel. It is almost required for some large models, and sometimes having a mildly larger batch helps training convergence speed enough to overcome the latency hit on each batch.

zamadatix · 2024-04-12T12:56:56 1712926616

Peer to peer as in one pcie slot directly to another without going through the CPU/RAM, not peer to peer as in one PC to another over the network port.

brrrrrm · 2024-04-12T12:27:33 1712924853

Yea. It’s one less hop through slow memory

publicmail · 2024-04-13T03:23:15 1712978595

PCIe busses are like a tree with “hubs” (really switches).

Imagine you have a PC with a PCIe x16 interface which is attached to a PCIe switch that has four x16 downstream ports, each attached to a GPU. Those GPUs are capable of moving data in and out of their PCIe interfaces at full speed.

If you wanted to transfer data from GPU0 and 1 to GPU2 and 3, you have basically 2 options:

- Have GPU0 and 1 move their data to CPU DRAM, then have GPU2 and 3 fetch it

- Have GPU0 and 1 write their data directly to GPU2 and 3 through the switch they’re connected to without ever going up to the CPU at all

In this case, option 2 is better both because it avoids the extra copy to CPU DRAM and also because it avoids the bottleneck of two GPUs trying to push x16 worth of data up through the CPUs single x16 port. This is known as peer to peer.

There are some other scenarios where the data still must go up to the CPU port and back due to ACS, and this is still technically P2P, but doesn’t avoid the bottleneck like routing through the switch would.

whereismyacc · 2024-04-12T12:41:00 1712925660

this would be directly over the memory bus right? I think it's just always going to be faster like this if you can do it?

toast0 · 2024-04-12T19:41:49 1712950909

There's not really any busses in modern computers. It's all point to point messaging. You can think of a computer as a distributed system in a way.

PCI has a shared address space which usually includes system memory (memory mapped i/o). There's a second, smaller shared address space dedicated to i/o, mostly used to retain compatability with PC standards developed by the ancients.

But yeah, I'd expect to typically have better throughput and latency with peer to peer communication than peer to system ram to peer. Depending on details, it might not always be better though, distributed systems are complex, and sometimes adding a seperate buffer between peers can help things greatly.

fulafel · 2024-04-14T05:25:46 1713072346

Yes, networking is similarly pointless.

haunter · 2024-04-12T12:13:23 1712924003

Shared memory access for Nvidia GPUs

https://developer.nvidia.com/gpudirect

CamperBob2 · 2024-04-12T15:17:42 1712935062

The correct term, and the one most people would have used in the past, is "bus mastering."

wmf · 2024-04-12T16:00:38 1712937638

PCIe isn't a bus and it doesn't really have a concept of mastering. All PCI DMA was based on bus mastering but P2P DMA is trickier than normal DMA.

publicmail · 2024-04-13T03:09:17 1712977757

I consider it bus mastering when the endpoints initiate the transactions

amelius · 2024-04-12T18:34:22 1712946862

Stupid terminology. Might as well call an RS-232 link "peer to peer".

userbinator · 2024-04-12T13:48:12 1712929692

I wish more hardware companies would publish more documentation and let the community figure out the rest, sort of like what happened to the original IBM VGA (look up "Mode X" and the other non-BIOS modes the hardware is actually capable of - even 800x600x16!) Sadly it seems the majority of them would rather tightly control every aspect of their products' usage since they can then milk the userbase for more $$$, but IMHO the most productive era of the PC was also when it was the most open.

rplnt · 2024-04-12T14:25:21 1712931921

Then they couldn't charge different customers different amounts for the same HW. It's not a win for everyone.

axus · 2024-04-12T14:32:15 1712932335

The price of 4090 may increase now, in theory locking out some features might have been a favor for some of the customers.

Sayrus · 2024-04-13T13:54:34 1713016474

But it wouldn't if all cards supporting this were "unlocked" by default and thus the other "enterprise-grade" cards weren't that much more expensive. Of course that'd reduce profits by a lot.

paulmd · 2024-04-14T18:21:57 1713118917

it probably would - you saw exactly that outcome with mining.

for a lot of these demand bursts, demand is so high it cannot be sated even consuming 100% or 200% of typical GPU production.

cards like RX 6500XT that simply don't have the RAM to participate were less affected, but even then you've got enough cross-elasticity (demand from people being crowded out of other product segments) that tends to pump prices to 2-3x the "normal" clearance prices we see today. And yes, absolutely anything that can mine in any capacity will get pulled in during that sort of boom/bubble, not just "high-end"/"enterprise".

greggsy · 2024-04-12T23:16:45 1712963805

Which (as controversial as it sounds in this kind of forum) is a sensible pricing model to recover and fund R&D and finance operations.

golergka · 2024-04-12T18:41:17 1712947277

If I'm a hardware manufacturer and my soft lock on product feature doesn't work, I'll switch to a hardware lock instead, and the product will just cost more.

matheusmoreira · 2024-04-14T16:37:45 1713112665

> the most productive era of the PC was also when it was the most open

The openness certainly was great but it's not actually required. People can figure out how to work with closed systems. Adversarial interoperability was common. People would reverse engineer things and make the software work whether or not the manufacturer wanted it.

It's the software and hardware locks that used to be rare and are now common. Cryptography was supposed to be something that would empower us but it ended up being used against us to lock us out of our own machines. We're no longer in the driver's seat. Our operating systems don't even operate the system anymore. Our free Linux systems are just the "user OS" in the manufacturer's unknowable amalgamation of silicon running proprietary firmware, just a little component to be sandboxed away from the real action.

mhh__ · 2024-04-12T16:05:56 1712937956

nvidia's software is their moat

thot_experiment · 2024-04-12T22:48:59 1712962139

That's a huge overstatement, it's a big part of the moat for sure, but there are other significant components (hardware, ecosystem lock-in, heavy academic incentives)

mhh__ · 2024-04-13T23:11:00 1713049860

No software -> hardware is massively hobbled. Evidence: AMD.

Ecosystem -> Software. At the moment especially people are looking for arbitrages everywhere i.e. inference costs / being able to inference at all (llama.cpp)

Academics -> Also software but easily fiddled with a bit of spending as you say.

No1 · 2024-04-12T15:14:31 1712934871

The original justification that Nvidia gave for removing Nvlink from the consumer grade lineup was that PCIe 5 would be fast enough. They then went on to release the 40xx series without PCIe 5 and P2P support. Good to see at least half of the equation being completed for them, but I can’t imagine they’ll allow this in the next gen firmware.

HPsquared · 2024-04-12T10:32:57 1712917977

Is this one of those features that's disabled on consumer cards for market segmentation?

mvkel · 2024-04-12T14:16:56 1712931416

Sort of.

An imperfect analogy: a small neighborhood of ~15 houses is under construction. Normally it might have a 200kva transformer sitting at the corner, which provides appropriate power from the grid.

But there is a transformer shortage, so the contractor installs a commercial grade 1250kva transformer. It can power many more houses than required, so it's operating way under capacity.

One day, a resident decides he wants to start a massive grow farm, and figures out how to activate that extra transformer capacity just for his house. That "activation" is what geohot found

bogwog · 2024-04-12T15:01:12 1712934072

That's a poor analogy. The feature is built in to the cards that consumers bought, but Nvidia is disabling it via software. That's why a hacked driver can enable it again. The resident in your analogy is just freeloading off the contractor's transformer.

Nvidia does this so that customers that need that feature are forced to buy more expensive systems instead of building a solution with the cheaper "consumer-grade" cards targeted at gamers and enthusiasts.

bpye · 2024-04-12T20:00:20 1712952020

This isn’t even the first time a hacked driver has been used to unlock some HW feature - https://github.com/DualCoder/vgpu_unlock

captcanuk · 2024-04-12T22:47:18 1712962038

There was also this https://hackaday.com/2013/03/18/hack-removes-firmware-crippl... using resistors and a different one before that used a graphene lead pencil to enable functionality.

segfaultbuserr · 2024-04-12T15:08:16 1712934496

Except that in the computer hardware world, the 1250 kVA transformer was used not because of shortage, but because of the fact that making a 1250 kVA transformer on the existing production line and selling it as 200 kVA, is cheaper than creating a new production line separately for making 200 kVA transformers.

hatthew · 2024-04-12T20:18:19 1712953099

And then because this residential neighborhood now has commercial grade power, the other lots that were going to have residential houses built on them instead get combined into a factory, and the people who want to buy new houses in town have to pay more since residential supply was cut in half.

zten · 2024-04-14T19:39:51 1713123591

This represents pretty well how gamers (residential buyers) are going to feel when the next generation of consumer cards are scooped up for AI.

HPsquared · 2024-04-12T21:11:18 1712956278

Excellent analogy of the other side of this issue.

cesarb · 2024-04-12T21:06:24 1712955984

That's a bad analogy, because in your example, the consumer is using more of a shared resource (the available transformer, wiring, and generation capacity). In the case of the driver for a local GPU card, there's no sharing.

A better example would be one in which the consumer has a dedicated transformer. For instance, a small commercial building which directly receives 3-phase 13.8 kV power; these are very common around here, and these buildings have their own individual transformers to lower the voltage to 3-phase 127V/220V.

m3kw9 · 2024-04-12T16:37:49 1712939869

Where is the hack in this analogy

pixl97 · 2024-04-12T17:30:44 1712943044

Taking off the users panel on the side of their house and flipping it to 'lots of power' when that option had previously been covered up by the panel interface.

cesarb · 2024-04-12T21:21:49 1712956909

Except that this "lots of power" option does not exist. What limits the amount of power used is the circuit breakers and fuses on the panel, which protect the wiring against overheating by tripping when too much power is being used (or when there's a short circuit). The resident in this analogy would need to ensure that not only the transformer, but also the wiring leading to the transformer, can handle the higher current, and replace the circuit breaker or fuses.

And then everyone on that neighborhood would still lose power, because there's also a set of fuses upstream of the transformer, and they would be sized for the correct current limit even when the transformer is oversized. These fuses also protect the wiring upstream of the transformer, and their sizing and timings is coordinated with fuses or breakers even further upstream so that any fault is cleared by the protective device closest to the fault.

stavros · 2024-04-12T23:39:11 1712965151

There are analogies, and then there's this.

Dylan16807 · 2024-04-13T02:31:27 1712975487

They're pointing out how the analogy doesn't work, so it's fine.

Nobody's taking more than their share of any resources when they enable this feature.

rustcleaner · 2024-04-12T15:45:37 1712936737

I am sure many will disagree-vote me, but I want to see this practice in consumer devices either banned or very heavily taxed.

xandrius · 2024-04-12T18:38:04 1712947084

You're right. Especially because you didn't present your reasons.

wmf · 2024-04-12T23:20:51 1712964051

Of course power users want an end to price discrimination because it benefits them... at a cost of more expensive products for the masses.

yogorenapan · 2024-04-12T19:01:57 1712948517

Curious as to your reasoning,

imtringued · 2024-04-13T09:24:55 1713000295

Well, they have zero incentives to implement and test this feature for consumer GPUs. Multi GPU setups never really worked that well for gaming.

ivanjermakov · 2024-04-12T13:09:56 1712927396

I was always fascinated by George Hotz's hacking abilities. Inspired me a lot for my personal projects.

jgpc · 2024-04-12T14:34:18 1712932458

I agree. It is fascinating. When you observe his development process (btw, it is worth noting his generosity in sharing it like he does) he gets frequently stuck on random shallow problems which a perhaps more knowledgable engineer would find less difficult. It is frequent to see him writing really bad code, or even wrong code. The whole twitter chapter is a good example. Yet, himself, alone just iterating resiliently, just as frequently creates remarkable improvements. A good example to learn from. Thank you geohot.

zoogeny · 2024-04-12T16:25:54 1712939154

This matches my own take. I've tuned into a few of his streams and watched VODs on YouTube. I am consistently underwhelmed by his actual engineering abilities. He is that particular kind of engineer that constantly shits on other peoples code or on the general state of programming yet his actual code is often horrendous. He will literally call someone out for some code in Tinygrad that he has trouble with and then he will go on a tangent to attempt to rewrite it. He will use the most blatant and terrible hacks only to find himself out of his depth and reverting back to the original version.

But his streams last 4 hours or more. And he just keeps grinding and grinding and grinding. What the man lacks in raw intellectual power he makes up for (and more) in persistence and resilience. As long as he is making even the tiniest progress he just doesn't give up until he forces the computer to do whatever it is he wants it to do. He also has no boundaries on where his investigations take him. Driver code, OS code, platform code, framework code, etc.

I definitely couldn't work with him (or work for him) since I cannot stand people who degrade the work of others while themselves turning in sub-par work as if their own shit didn't stink. But I begrudgingly admire his tenacity, his single minded focus, and the results that his belligerent approach help him to obtain.

ctrw · 2024-04-13T04:58:12 1712984292

There are developers who have breadth and developers who have depth. He is very much on the breadth end of the spectrum. It isn't lack of intelligence but lack of deep knowledge of esoteric fields you will use once a decade.

That said I find it a bit astonishing how little Ai he uses on his streams. I convert all the documentation I need into a rag system that I query stupid questions against.

spirobelv2 · 2024-04-13T11:26:39 1713007599

link your github. want to see your raw intellectual power

zoogeny · 2024-04-15T17:02:47 1713200567

I know what I said about lacking raw intellectual power probably feels like a personal attack rather than a description. However, that comment is in comparison to guys like Peter Norvig or Donald Knuth, not random Hacker News mid-wits like myself.

I had a younger cousin who wanted to start a career in software engineering. He asked me, assuming my years of experience had some merit, what programming languages to learn, what code editor to use, what platforms and frameworks to study. I told him the most important thing he could do is to be persistent. The computer will constantly humble you. Your coworkers will constantly try to rail-road you into solutions that are sub-optimal. You have to be resilient and keep going no matter what, you can't ever give up.

I think it is fair to say that George excels at what I consider to be the most important aspect of programming. And if he could manage to stop disparaging others in his streams, suggesting that everyone else is stupid and that the code they write is rotten, I could very easily look over the fact that he is frequently careless and hasty.

yazzku · 2024-04-13T16:42:28 1713026548

It's over 9000.

gorkish · 2024-04-13T03:52:22 1712980342

If a stopped clock is right twice a day, relentlessly winding a clock forward will make it right quite frequently. That is geohot.

vrnvu · 2024-04-12T13:12:34 1712927554

I agree, I feel so inspired with his streams. Focus and hard work, the key to good results. Add a clear vision and strategy, and you can also accomplish “success”.

Congratulations to him and all the tinygrad/comma contributors.

sambull · 2024-04-12T13:18:30 1712927910

He's got that focus like a military pilot on a long flight.

postalrat · 2024-04-12T13:45:00 1712929500

Any time I open guys steam half of it is some sort of politics

CYR1X · 2024-04-12T15:29:40 1712935780

You can blame chat for that lol

gaws · 2024-04-13T16:08:03 1713024483

He should ban chat and focus on development. Leave the political talk to the kids in their respective Discord servers.

Jerrrry · 2024-04-12T14:04:00 1712930640

His Xbox360 laptop was the crux of teenage-motivation, for me.

llm_trw · 2024-04-12T10:34:28 1712918068

Skimming the readme this is p2p over PCIe and not NVLink in case anyone was wondering.

formerly_proven · 2024-04-12T11:15:29 1712920529

RTX 40 doesn’t have NVLink on the PCBs, though the silicon has to have it, since some sibling cards support it. I’d expect it to be fused off.

llm_trw · 2024-04-12T12:59:47 1712926787

A cursory google search suggests that it's been removed at the silicon level.

steeve · 2024-04-12T13:20:05 1712928005

Some do: https://wccftech.com/gigabyte-geforce-rtx-4090-pcb-shows-lef...

jsheard · 2024-04-12T13:37:48 1712929068

I'm pretty sure that's just a remnant of a 3090 PCB design that was adapted into a 4090 PCB design by the vendor. None of the cards based on the AD102 chip have functional NVLink, not even the expensive A6000 Ada workstation card or the datacenter L40 accelerator, so there's no reason to think NVLink is present on the silicon anymore below the flagship GA100/GH100 chips.

HeatrayEnjoyer · 2024-04-12T11:50:24 1712922624

How to unfuse it?

magicalhippo · 2024-04-12T12:25:51 1712924751

I don't know about this particular scenario, but typically fuses are small wires or resistors that are overloaded so they irreversibly break the connection. Hence the name.

Either done during manufacture or as a one-time programming[1][2].

Though sometimes reprogrammable configuration bits are sometimes also called fuse bits. The Atmega328P of Arduino fame uses flash[3] for its "fuses".

[1]: https://www.nxp.com/docs/en/application-note/AN4536.pdf

[2] https://www.intel.com/programmable/technical-pdfs/654254.pdf

[3]: https://ww1.microchip.com/downloads/en/DeviceDoc/Atmel-7810-...

HeatrayEnjoyer · 2024-04-12T13:10:15 1712927415

Wires, flash, and resistors can be replaced

mschuster91 · 2024-04-12T13:15:08 1712927708

Not at the scale we're talking about here. These structures are very thin, far thinner than bond wires which is about the largest structure size you can handle without a very, very specialized lab. And you'd need to unsolder the chip, de-cap it, hope the fuse wire you're trying to override is at the top layer, and that you can re-cap the chip afterwards and successfully solder it back on again.

This may be workable for a nation state or a billion dollar megacorp, but not for your average hobbyist hacker.

z33k · 2024-04-12T13:43:00 1712929380

You’re absolutely right. In fact, some billion dollar megacorps use fuses as a part of hardware DRM for this reason.

magicalhippo · 2024-04-12T14:56:07 1712933767

These are part of the chip, thus microscopic and very inaccessible.

There are some good images here[1] of various such fuses, both pristine and blown. Here's[2] a more detailed writeup examining one type.

It's not something you fix with a soldering iron.

[1]: https://semiengineering.com/the-benefits-of-antifuse-otp/

[2]: https://www.eetimes.com/a-look-at-metal-efuses/

metadat · 2024-04-12T15:25:24 1712935524

I miss the days when you could do things like connecting the L5 bridges on the surface of the AMD Athlon XP Palomino [0] CPU packaging with a silver trace pen to transform them into fancier SMP multi-socket capable Athlon MPs, e.g. Barton [1].

https://arstechnica.com/civis/threads/how-did-you-unlock-you...

Some folks even got this working with only a pencil, haha.

Nowadays, silicon designers have found highly effective ways to close off these hacking avenues, with techniques, such as the microscopic, nearly invisible, and as parent post mentions, totally inaccessible e-fuses.

[0] https://upload.wikimedia.org/wikipedia/commons/7/7c/KL_AMD_A...

[1] https://en.wikichip.org/w/images/a/af/Atlhon_MP_%28.13_micro...

aceazzameen · 2024-04-13T09:03:20 1712999000

I'm one of those folks that did it with a pencil. Haha. Maybe I was lucky? That was my first overclock and it ran pretty well.

mepian · 2024-04-12T13:36:53 1712929013

Use a Focused Ion Beam instrument.

klohto · 2024-04-12T10:43:54 1712918634

afaik 4090 doesn’t support 5.0 so you are limited to 4.0 speeds. Still an improvement.

jsheard · 2024-04-12T10:32:28 1712917948

It'll be nice while it lasts, until they start locking this down in the firmware instead on future architectures.

mnau · 2024-04-12T15:00:23 1712934023

Sure, but that was something that was always going to happen.

So it's better to have it at least for one generation instead of no generation.

jagrsw · 2024-04-12T10:29:21 1712917761

Was it George himself, or a person working for a bounty that was set up by tinycorp?

Also, a question for those knowledgeable about the PCI subsys: it looked like something NVIDIA didn't care about, rather than something they actively wanted to prevent, no?

toast0 · 2024-04-12T15:21:26 1712935286

PCI devices have always been able to read and write to the shared address space (subject to IOMMU); most frequently used for DMA to system RAM, but not limited to it.

So, poking around to configure the device to put the whole VRAM in the address space is reasonable, subject to support for resizable BAR or just having a fixed size large enough BAR. And telling one card to read/write from an address that happens to be mapped to a different card's VRAM is also reasonable.

I'd be interested to know if PCI-e switching capacity will be a bottleneck, or if it'll just be the point to point links and VRAM that bottlenecks. Saving a bounce through system RAM should help in either case though.

namibj · 2024-04-12T19:56:21 1712951781

Fixed large bar exists in some older accelerator cards like e.g. iirc the MI50/MI60 from AMD (the data center variant of the Radeon Vega VII, the first GPU with PCIe 4.0, also famous for dominating memory bandwidth until the RTX 40-series took that claim back. It had 16GB of HBM delivering 1TB/s memory bandwidth).

It's notably not compatible with some legacy boot processes and iirc also just 32bit kernels in general, so consumer cards had to wait for resizable BAR to get the benefits of large BAR (that being notably direct flat memory mapping of VRAM so CPUs and PCIe peers can directly read and write into all of VRAM, without dancing through a command interface with doorbell registers. AFAIK it allows a GPU to talk directly to NICs and NVMe drives by running the driver in GPU code (I'm not sure how/if they let you properly interact with doorbell registers, but polled io_uring as an ABI would be no problem (I wouldn't be surprised if some NIC firmware already allows offloading this).

mtlynch · 2024-04-12T10:42:07 1712918527

Commits are by geohot, so it looks like George himself.

throw101010 · 2024-04-12T12:59:21 1712926761

I've seen him work on tinygrad on his Twitch livestream couple times, so more than likely him indeed.

squarra · 2024-04-12T12:54:15 1712926455

He also documented his progress on the tinygrad discord

throwaway8481 · 2024-04-12T19:00:36 1712948436

I feel like I should say something about discord not being a suitable replacement for a forum or bugtracker.

guywhocodes · 2024-04-12T20:21:20 1712953280

We are talking about a literal monologue while poking at a driver for a few hours, this wasn't a huge project.

rfoo · 2024-04-12T11:54:32 1712922872

Glad to see that geohot is back being geohot, first by dropping a local DoS for AMD cards, then this. Much more interesting :p

jaimehrubiks · 2024-04-12T14:09:02 1712930942

Is this the same guy that hacked the PS3?

dji4321234 · 2024-04-12T16:24:38 1712939078

He has a very checkered history with "hacking" things.

He tends to build heavily on the work of others, then use it to shamelessly self-promote, often to the massive detriment of the original authors. His PS3 work was based almost completely on a presentation given by fail0verflow at CCC. His subsequent self-promotion grandstanding world tour led to Sony suing both him and fail0verflow, an outcome they were specifically trying to avoid: https://news.ycombinator.com/item?id=25679907

In iPhone land, he decided to parade around a variety of leaked documentation, endangering the original sources and leading to a fragmentation in the early iPhone hacking scene, which he then again exploited to build on the work of others for his own self-promotion: https://news.ycombinator.com/item?id=39667273

There's no denying that geohotz is a skilled reverse engineer, but it's always bothersome to see him put onto a pedestal in this way.

pixelpoet · 2024-04-12T17:03:29 1712941409

There was also that CheapEth crypto scam he tried to pull off.

samtheprogram · 2024-04-12T20:16:09 1712952969

To me that was obvious satire of the crypto scene.

ansible · 2024-04-12T21:57:37 1712959057

I don't think people can tell what is satire or not in the crypto scene anymore. Someone issue a "rug pull token" and still received 8.8 ETH (approx $29K USD), while telling people it was a scam.

https://www.web3isgoinggreat.com/?id=rug-pull-token

pixelpoet · 2024-04-12T21:55:17 1712958917

Ah yes, nothing like a bit of hypocrisy to make a point. It's okay though, as long as it's people we don't agree with, defrauding them is fine.

samtheprogram · 2024-04-13T02:11:21 1712974281

The website literally stated it was not for speculation, they didn't want the price to go up, and there were multiple ways to get some for free.

If people were reckless, greedy, and/or lazy because of the crypto hype and got "defrauded" without doing any amount of due diligence -- that's kinda the point.

georgehotz · 2024-04-13T05:17:07 1712985427

I actually lost about $5k on cheapETH running servers. Nobody was "defrauded", I think these people don't understand how forks work. It's a precursor to the modern L2 stuff, I did this while writing the first version of Optimism's fraud prover. https://github.com/ethereum-optimism/cannon

I suspect most of the people who bring this up don't like me for other reasons, but with this they think they have something to latch on to. Doesn't matter that it isn't true and there wasn't a scam, they aren't going to look into it since it agrees with their narrative.

delfinom · 2024-04-12T16:40:23 1712940023

[flagged]

StressedDev · 2024-04-12T21:09:42 1712956182

Who is melon?

aeyes · 2024-04-13T01:24:54 1712971494

elon musk?

mikepurvis · 2024-04-12T15:00:32 1712934032

Yes, but he spent several years in self-driving cars (https://comma.ai), which while interesting is also a space that a lot of players are in, so it's not the same as seeing him back to doing stuff that's a little more out there, especially as pertains to IP.

nolongerthere · 2024-04-12T15:25:09 1712935509

Did he abandon this effort? That would be pretty sad bec he was approaching the problem from a very different perspective.

Topgamer7 · 2024-04-12T15:30:50 1712935850

He stepped down from it. https://geohot.github.io//blog/jekyll/update/2022/10/29/the-...

cjbprime · 2024-04-12T16:12:29 1712938349

It's still a company, still making and selling products, and I think he's still pretty heavily involved in it.