Hacker News new | past | comments | ask | show | jobs | submit | pavelstoev's comments login

The problem is that performance achievements on AMD consumer-grade GPUs (RX7900XTX) are not representative/transferrable to the Datacenter grade GPUs (MI300X). Consumer GPUs are based on RDNA architecture, while datacenter GPUs are based on the CDNA architecture, and only sometime in ~2026 AMD is expected to release unifying UDNA architecture [1]. At CentML we are currently working on integrating AMD CDNA and HIP support into our Hidet deep learning compiler [2], which will also power inference workloads for all Nvidia GPUs, AMD GPUs, Google TPU and AWS Inf2 chips on our platform [3]

[1] https://www.jonpeddie.com/news/amd-to-integrate-cdna-and-rdn.... [2] https://centml.ai/hidet/ [3] https://centml.ai/platform/


The problem is that the specs of AMD consumer-grade GPUs do not translate to computer performance when you try and chain more than one together.

I have 7 NVidia 4090s under my desk happily chugging along on week long training runs. I once managed to get a Radeon VII to run for six hours without shitting itself.


> I have 7 NVidia 4090s under my desk

I have 6 Radeon Pro VII under my desk (in a single system BTW), and they run hard for weeks until I choose to reboot e.g. for Linux kernel updates.

I bought them "new old stock" for $300 apiece. So that's $1800 for all six.


How does the compute performance compare to 4090’s for these workloads?

(I release it will be significantly lower, just try to get as much of a comparison as is possible).


The Radeon VII is special compared to most older (and current) affordable GPUs in that it used HBM giving it memory bandwidth comparable to modern cards ~1TB/s and has reasonable FP64 (1:4) throughput instead of (1:64). So this card can still be pretty interesting for running memory bandwidth intensive FP64 workloads. Anything affordable afterward by either AMD or Nvidia crippled realistic FP64 throughput to below what a AVX-512 many-core CPU can do.


If we speak about FP64, are your loads more like fluid dynamics than ML training?


The 4090 offers 82.58 teraflops of single-precision performance compared to the Radeon Pro VII's 13.06 teraflops.


On the other hand, for double precision a Radeon Pro VII is many times faster than a RTX 4090 (due to 1:2 vs. 1:64 FP64:FP32 ratio).

Moreover, for workloads limited by the memory bandwidth, a Radeon Pro VII and a RTX 4090 will have about the same speed, regardless what kind of computations are performed. It is said that speed limitation by memory bandwidth happens frequently for ML/AI inferencing.


Double precision is not used in either inference or training as far as I know.


Even the single precision given by the previous poster is seldom used for inference or training.

Because the previous poster had mentioned only single precision, where RTX 4090 is better, I had to complete the data with double precision, where RTX 4090 is worse, and memory bandwidth where RTX 4090 is the same, otherwise people may believe that progress in GPUs over 5 years has been much greater than it really is.

Moreover, memory bandwidth is very relevant for inference, much more relevant than FP32 throughput.


For people wondering:

Titan V: 7.8 TFLOPs

AMD Radeon Pro VII: 6.5 TFLOPs

AMD Radeon VII: 3.52 TFLOPs

4090: 1.3 TFLOPs


For inference sure, for training: no.


Are you running ml workloads or solving differential equations?

The two are rather different and one market is worth trillions, the other isn't.


I think there is some money to be made in machine learning too.


Wow, are these 7 RTX 4090s in a single setup? Care to share more how you build it (case, cooling, power, ..)?


Most of these are just an EPYC server platform, some cursed risers and multiple PSUs (though cryptominer server PSU adapters are probably better). See https://nonint.com/2022/05/30/my-deep-learning-rig/ and https://www.mov-axbx.com/wopr/wopr_concept.html.


Looks like a fire hazard :)


WOPR read is the best IMO.


You might find the journey of Tinycorp's Tinybox interesting, it's a machine with 6 to 8 4090 GPUs and you should be able to track down a lot of their hardware choices including pictures on their Twitter and other info on George his livestreams.


EPYC + Supermicro + C-Payne retimers/cabling. 208-240V power typically mandatory for the most affordable power supplies (chain a server/crypto PSU for the GPUs from ParallelMiner to an ATX PSU for general use).

Beyond that, not much else.


Basically this but with an extra card on the x8 slot for connecting my monitors: https://www.youtube.com/watch?v=C548PLVwjHA

There's a bunch of similar setups and there are a couple of dozen people that have done something similar on /r/localllama.


I'd like to know too


How do you manage heat? I'm looking at a hashcat build with a few 5090, and water cooling seems to be the sensible solution if we scale beyond two cards.


What motherboard are you using to have space and ports for 7 of them?


The ASRock Rack ROMED8-2T has seven PCIe x16 slots. They're too close together to directly put seven 4090s on the board, but you'd just need some riser cables to mount the cards on a frame.


What software stack you use for training?


It looks like AMD's CDNA gpu's are supported by Mesa, which ought to suffice for Vulkan Compute and SYCL support. So there should be ways to run ML workloads on the hardware without going through HIP/ROCm.


Love both games spending countless hours, Unreal single player story was great (for its time). Are there any servers still online for UT?


I recommend hidet backend in torch.compile - implements many advanced model-specific optimizations automatically. https://github.com/hidet-org/hidet


oh this looks great, thank you for bringing this up! I'll have to give it a try, but seems like the FSDP limitation on torch.compile might carry over?


   that feeling you get when you realize the person who lived ~2300 years before you is smarter than you now...


Keyword - survived.

Everything must become better.


Is this watch better though? Or is it just some gadget to be thrown away in a few years.


I would gather Google gives up on this branch of hardware within 3 years. You're better off getting an Apple Watch SE.


I strongly disagree with your comment. First of all, ALL kids must survive, not most. Safety and safety of knowing are two different things. Lastly, this tech looks a lot less intrusive than the current another watch that everyone is getting for their kids — this one appears to have more activity-engaging features.


> First of all, ALL kids must survive, not most.

I don't think anyone would really want to live in a world where we've done what's necessary such that literally all kids survive the various accidents and perils they might face out in the world. Such a world would be sanitized into oblivion.

This is just a variation on the "security vs. freedom" stuff. You can have perfect security if you don't allow for any freedom. But hopefully we can agree that a world with no freedom isn't one we want to live in.

But sure, let's step back from the extreme that you introduced. Are the downsides of pervasive 24/7 tracking and surveillance worth the (possible and as-yet unproven) increase in good outcomes? I can see that many people here seem to think it is, but I don't agree.


Very interesting project and good progress on making private LLM use cases more accessible and usable, please keep going !


Next iteration of this will be video generated on demand with GenAI running closer to the request, ideally at the request.


I think what will emerge eventually are LLM architecture specific ICs.


Seems quite likely that there are already several in the works.


Should I be worried ? I just landed at SFO


For what it's worth, the last major earthquake here killed 63 people, the significant majority of which were on various freeways and bridges that collapsed. All infrastructure has undergone significant retrofitting since then.


Yes


No.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: