More

pavelstoev · 2024-12-24T04:54:07 1735016047

The problem is that performance achievements on AMD consumer-grade GPUs (RX7900XTX) are not representative/transferrable to the Datacenter grade GPUs (MI300X). Consumer GPUs are based on RDNA architecture, while datacenter GPUs are based on the CDNA architecture, and only sometime in ~2026 AMD is expected to release unifying UDNA architecture [1]. At CentML we are currently working on integrating AMD CDNA and HIP support into our Hidet deep learning compiler [2], which will also power inference workloads for all Nvidia GPUs, AMD GPUs, Google TPU and AWS Inf2 chips on our platform [3]

[1] https://www.jonpeddie.com/news/amd-to-integrate-cdna-and-rdn.... [2] https://centml.ai/hidet/ [3] https://centml.ai/platform/

llm_trw · 2024-12-24T07:41:25 1735026085

The problem is that the specs of AMD consumer-grade GPUs do not translate to computer performance when you try and chain more than one together.

I have 7 NVidia 4090s under my desk happily chugging along on week long training runs. I once managed to get a Radeon VII to run for six hours without shitting itself.

mpreda · 2024-12-24T13:36:13 1735047373

> I have 7 NVidia 4090s under my desk

I have 6 Radeon Pro VII under my desk (in a single system BTW), and they run hard for weeks until I choose to reboot e.g. for Linux kernel updates.

I bought them "new old stock" for $300 apiece. So that's $1800 for all six.

highwaylights · 2024-12-24T14:27:01 1735050421

How does the compute performance compare to 4090’s for these workloads?

(I release it will be significantly lower, just try to get as much of a comparison as is possible).

crest · 2024-12-24T16:20:14 1735057214

The Radeon VII is special compared to most older (and current) affordable GPUs in that it used HBM giving it memory bandwidth comparable to modern cards ~1TB/s and has reasonable FP64 (1:4) throughput instead of (1:64). So this card can still be pretty interesting for running memory bandwidth intensive FP64 workloads. Anything affordable afterward by either AMD or Nvidia crippled realistic FP64 throughput to below what a AVX-512 many-core CPU can do.

nine_k · 2024-12-24T18:08:38 1735063718

If we speak about FP64, are your loads more like fluid dynamics than ML training?

cainxinth · 2024-12-24T14:45:44 1735051544

The 4090 offers 82.58 teraflops of single-precision performance compared to the Radeon Pro VII's 13.06 teraflops.

adrian_b · 2024-12-24T16:12:46 1735056766

On the other hand, for double precision a Radeon Pro VII is many times faster than a RTX 4090 (due to 1:2 vs. 1:64 FP64:FP32 ratio).

Moreover, for workloads limited by the memory bandwidth, a Radeon Pro VII and a RTX 4090 will have about the same speed, regardless what kind of computations are performed. It is said that speed limitation by memory bandwidth happens frequently for ML/AI inferencing.

ryao · 2024-12-25T01:39:35 1735090775

Double precision is not used in either inference or training as far as I know.

adrian_b · 2024-12-25T11:34:36 1735126476

Even the single precision given by the previous poster is seldom used for inference or training.

Because the previous poster had mentioned only single precision, where RTX 4090 is better, I had to complete the data with double precision, where RTX 4090 is worse, and memory bandwidth where RTX 4090 is the same, otherwise people may believe that progress in GPUs over 5 years has been much greater than it really is.

Moreover, memory bandwidth is very relevant for inference, much more relevant than FP32 throughput.

llm_trw · 2024-12-27T06:23:01 1735280581

For people wondering:

Titan V: 7.8 TFLOPs

AMD Radeon Pro VII: 6.5 TFLOPs

AMD Radeon VII: 3.52 TFLOPs

4090: 1.3 TFLOPs

llm_trw · 2024-12-24T22:50:23 1735080623

For inference sure, for training: no.

llm_trw · 2024-12-24T20:13:40 1735071220

Are you running ml workloads or solving differential equations?

The two are rather different and one market is worth trillions, the other isn't.

comboy · 2024-12-24T22:43:45 1735080225

I think there is some money to be made in machine learning too.

tspng · 2024-12-24T08:27:30 1735028850

Wow, are these 7 RTX 4090s in a single setup? Care to share more how you build it (case, cooling, power, ..)?

osmarks · 2024-12-24T13:13:54 1735046034

Most of these are just an EPYC server platform, some cursed risers and multiple PSUs (though cryptominer server PSU adapters are probably better). See https://nonint.com/2022/05/30/my-deep-learning-rig/ and https://www.mov-axbx.com/wopr/wopr_concept.html.

Keyframe · 2024-12-24T13:58:44 1735048724

Looks like a fire hazard :)

icelancer · 2024-12-25T04:44:14 1735101854

WOPR read is the best IMO.

ghxst · 2024-12-24T11:31:19 1735039879

You might find the journey of Tinycorp's Tinybox interesting, it's a machine with 6 to 8 4090 GPUs and you should be able to track down a lot of their hardware choices including pictures on their Twitter and other info on George his livestreams.

icelancer · 2024-12-25T04:44:36 1735101876

EPYC + Supermicro + C-Payne retimers/cabling. 208-240V power typically mandatory for the most affordable power supplies (chain a server/crypto PSU for the GPUs from ParallelMiner to an ATX PSU for general use).

Beyond that, not much else.

llm_trw · 2024-12-24T12:45:05 1735044305

Basically this but with an extra card on the x8 slot for connecting my monitors: https://www.youtube.com/watch?v=C548PLVwjHA

There's a bunch of similar setups and there are a couple of dozen people that have done something similar on /r/localllama.

adakbar · 2024-12-24T10:19:44 1735035584

I'd like to know too

archi42 · 2024-12-25T17:15:04 1735146904

How do you manage heat? I'm looking at a hashcat build with a few 5090, and water cooling seems to be the sensible solution if we scale beyond two cards.

ThinkBeat · 2024-12-25T12:38:31 1735130311

What motherboard are you using to have space and ports for 7 of them?

slavik81 · 2024-12-25T15:46:24 1735141584

The ASRock Rack ROMED8-2T has seven PCIe x16 slots. They're too close together to directly put seven 4090s on the board, but you'd just need some riser cables to mount the cards on a frame.

majke · 2024-12-25T16:45:26 1735145126

What software stack you use for training?

zozbot234 · 2024-12-24T09:03:19 1735030999

It looks like AMD's CDNA gpu's are supported by Mesa, which ought to suffice for Vulkan Compute and SYCL support. So there should be ways to run ML workloads on the hardware without going through HIP/ROCm.

pavelstoev · 2024-11-20T04:19:39 1732076379

Love both games spending countless hours, Unreal single player story was great (for its time). Are there any servers still online for UT?

pavelstoev · 2024-08-23T02:10:25 1724379025

I recommend hidet backend in torch.compile - implements many advanced model-specific optimizations automatically. https://github.com/hidet-org/hidet

roanakb · 2024-08-23T04:13:37 1724386417

oh this looks great, thank you for bringing this up! I'll have to give it a try, but seems like the FSDP limitation on torch.compile might carry over?

pavelstoev · 2024-08-22T01:59:29 1724291969

   that feeling you get when you realize the person who lived ~2300 years before you is smarter than you now...

pavelstoev · 2024-05-30T02:19:43 1717035583

Keyword - survived.

Everything must become better.

NegativeLatency · 2024-05-30T02:58:39 1717037919

Is this watch better though? Or is it just some gadget to be thrown away in a few years.

windexh8er · 2024-05-30T03:05:08 1717038308

I would gather Google gives up on this branch of hardware within 3 years. You're better off getting an Apple Watch SE.

pavelstoev · 2024-05-30T02:18:43 1717035523

I strongly disagree with your comment. First of all, ALL kids must survive, not most. Safety and safety of knowing are two different things. Lastly, this tech looks a lot less intrusive than the current another watch that everyone is getting for their kids — this one appears to have more activity-engaging features.

kelnos · 2024-05-30T10:17:57 1717064277

> First of all, ALL kids must survive, not most.

I don't think anyone would really want to live in a world where we've done what's necessary such that literally all kids survive the various accidents and perils they might face out in the world. Such a world would be sanitized into oblivion.

This is just a variation on the "security vs. freedom" stuff. You can have perfect security if you don't allow for any freedom. But hopefully we can agree that a world with no freedom isn't one we want to live in.

But sure, let's step back from the extreme that you introduced. Are the downsides of pervasive 24/7 tracking and surveillance worth the (possible and as-yet unproven) increase in good outcomes? I can see that many people here seem to think it is, but I don't agree.

pavelstoev · 2024-05-27T03:12:32 1716779552

Very interesting project and good progress on making private LLM use cases more accessible and usable, please keep going !

pavelstoev · 2024-05-20T03:50:29 1716177029

Next iteration of this will be video generated on demand with GenAI running closer to the request, ideally at the request.

pavelstoev · 2024-04-20T02:24:46 1713579886

I think what will emerge eventually are LLM architecture specific ICs.

UncleOxidant · 2024-04-20T02:52:02 1713581522

Seems quite likely that there are already several in the works.

pavelstoev · 2024-04-16T06:40:01 1713249601

Should I be worried ? I just landed at SFO

karlgkk · 2024-04-16T06:49:21 1713250161

For what it's worth, the last major earthquake here killed 63 people, the significant majority of which were on various freeways and bridges that collapsed. All infrastructure has undergone significant retrofitting since then.

coolspot · 2024-04-16T06:56:29 1713250589

callalex · 2024-04-16T06:45:15 1713249915