AMD may get across the CUDA moat

omneity · on Oct 6, 2023

I was able to use ROCm recently with Pytorch and after pulling some hair it worked quite well. The Radeon GPU I had on hand was a bit old and underpowered (RDNA2) and it only supported matmul on fp64, but for the job I needed done I saw a 200x increase in it/s over CPU despite the need to cast everywhere, and that made me super happy.

Best of all is that I simply set the device to `torch.device('cuda')` rather than openCL, which does wonders for compatibility and to keep code simple.

Protip: Use the official ROCM Pytorch base docker image [0]. The AMD setup is so finicky and dependent on specific versions of sdk/drivers/libraries and it will be much harder to make work if you try to install them separately.

[0]: https://rocm.docs.amd.com/en/latest/how_to/pytorch_install/p...

mikepurvis · on Oct 6, 2023

Sigh. It's great that these container images exist to give people an easy on-ramp, but they definitely don't work for every use case (especially once you're in embedded where space matters and you might not be online to pull multi-gb updates from some registry).

So it's important that vendors don't feel let off the hook to provide sane packaging just because there's an option to use a kitchen-sink container image they rebuild every day from source.

xahrepap · on Oct 6, 2023

I know it's still different than what you're looking for, so you probably already know this, but many projects like this have the Dockerfile on github which shows exactly how they set up the image. For example:

https://github.com/RadeonOpenCompute/ROCm-docker/blob/master...

They also have some for Fedora. Looks like for this you need to install their repo:

    curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add - \
    && printf "deb [arch=amd64] https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" | tee /etc/apt/sources.list.d/rocm.list \
    && printf "deb [arch=amd64] https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" | tee /etc/apt/sources.list.d/amdgpu.list \

then install Python, a couple other dependencies (build-essential, etc) and then the package in question: rocm-dev

So they are doing the packaging. There might even be documentation elsewhere for that type of setup.

mikepurvis · on Oct 6, 2023

Oh yeah, I mean... having the source for the container build is kind of table stakes at this point. No one would accept a 10gb mystery meat blob as the basis of their production system. It's bad enough that we still accept binary-only drivers and proprietary libraries like TensorRT.

I think my issue is more just with the mindset that it's okay to have one narrow slice of supported versions of everything that are "known to work together" and those are what's in the container and anything outside of those and you're immediately pooched.

This is not hypothetical btw, I've run into real problems around it with libraries like gproto, where tensorflow's bazel build pulls in an exact version that's different from the default one in nixpkgs, and now you get symbol conflicts when something tries to link to the tensorflow c++ API while linking to another component already using the default gproto. I know these problems are solveable with symbol visibility control and whatever, but that stuff is far from universal and hard to get right, especially if the person setting up the build rules for the library doesn't themselves use it in that type of heterogeneous environment (like, everyone at Google just links the same global proto version from the monorepo so it doesn't matter).

weebull · on Oct 7, 2023

> I think my issue is more just with the mindset that it's okay to have one narrow slice of supported versions of everything that are "known to work together" and those are what's in the container and anything outside of those and you're immediately pooched.

I hear you. I think docker has been a plague on the quality of software. It's allowed "works for me" to become the norm, except it's now pronounced "works on the official docker image". It seems to be especially true in the ML sphere where compiling things is so temperamental that there's a lot of binaries being distributed.

Docker was meant to be a deployment platform, not a distribution medium.

mgaunard · on Oct 7, 2023

I don't know what world you live in, but this is a problem for any software development.

You need to ensure that there is only one version of any library used globally throughout the code and that the set of versions is compatible with each other, and preferably you also want everything to be built against the same toolchain with the same flags.

That usually means onboarding third-party libraries into your own build system.

anuraaga · on Oct 7, 2023

I'd say with semver becoming far better known, this is not a problem for "any" software development. The developer gets the choice to pick libraries that are stable, often also influencing language choice. Mistakes happen, Guava broke the Java ecosystem for about two years, but it's never something that is accepted as just a fact of software development, it is a mistake.

Wanting to hold Python+C ecosystem more accountable is fair I think, at least from my own experience around half a year ago, Anaconda doesn't work and you need a Dockerfile for any sort of reproducibility, which can have issues since GPU with docker isn't that easy. And this means developers from the vendors working with Anaconda, for example, on solving the issue rather than just hoping for contributors to do it. If AMD were to make easy, reproducible builds without root or VM a reality, that would be reason enough to try their hardware. If not, hopefully Nvidia does and then there really would be no way across the moat for me at least.

mgaunard · on Oct 7, 2023

Semver is a joke and doesn't work. Languages like C and C++ can easily have problems if you link code built with different versions together (even if you aim for them to be compatible, or even if they are indeed the same source version but with subtly different flags), and there are no good solutions for this, except not doing it.

A docker container is not really any different from any other process; the main difference is that it runs in a chroot pretty much.

Dylan16807 · on Oct 9, 2023

> problems if you link code built with different versions

But that has nothing to do with semver.

Semver gives you information about when when you can replace one version with another version. It doesn't promise that you can mix multiple versions together.

mgaunard · on Oct 9, 2023

It gives you information about intent, not reality.

And you are mixing multiple versions if you are building against version x.y and linking against version x.(y+z).

Dylan16807 · on Oct 9, 2023

Maybe I misunderstood "built with", because I thought you were talking about the compiler version there. I know semver is just intent, but the intent doesn't even touch mixing internal data from multiple versions.

If linking against a different version of the code breaks like that, that sounds like someone did semver wrong. If that happens a lot to you, then oh, I'm sorry about that happening.

kiitos · on Oct 12, 2023

Every versioning scheme necessarily describes intent, not reality.

anthk · on Oct 7, 2023

This would be the work for Guix. Much better than docker, and exportable to a lot of formats. Or just build a vm from the CLI, an ad-hoc environment, a Docker export or a direct rootfs to deploy and run in any compatible machine.

josephg · on Oct 7, 2023

It’s not a universal problem. A lot of modern languages allow multiple versions of a library to be pulled in to the same code base, through different dependency paths. (Eg nodejs, rust). It’s not a perfect answer by any means, but it’s nice not needing to worry about some package pulling in an inconvenient version of one of its dependencies.

Also, just to name it, it’s ridiculous that a specific graphics card manages to restrict the version of gproto that you’re using. You don’t have this problem with nvidia drivers, since cuda stuff is much less fiddly. AMD needs to pull a finger out and fix the bugs in their stack that make it so fragile like this.

iopq · on Oct 7, 2023

In NixOS, I can install multiple versions of libraries

Or rather, I install no versions of libraries because NixOS will put them all in the store in different folders, and will compile the executable to use the correct path (or patch the elf when needed)

it has an issue with pip because it's allergic to just randomly executing things as part of package management, but pip in general is wtf

mikepurvis · on Oct 8, 2023

Ironically I'm having this problem in a Nix build context because of the broken approach Nix takes to packaging bazel—which itself is largely a consequence of the larger issue I'm grouching about here: unbundling tensorflow's locked dependencies is very hard to do when the underlying source is written to assume it's only targeting the exact version specified in the build rules. You can't just switch it to target the gproto in nixpkgs because then you get compilation failures.

anthk · on Oct 7, 2023

That's trivial with Guix.

JonChesterfield · on Oct 7, 2023

> No one would accept a 10gb mystery meat blob as the basis of their production system

Well, except for cuda. Which is a massive pile of proprietary software that people are using in production anyway.

hotstickyballs · on Oct 7, 2023

If anything, the situation with tensor rt shows that companies are absolutely willing to accept a multi gig meat blob

pixl97 · on Oct 7, 2023

> No one would accept a 10gb mystery meat blob as the basis of their production system

Heh, if only. When working with F100's I've seen many terrible, terrible things.

fwsgonzo · on Oct 6, 2023

I feel the same way, especially about build systems. OpenSSL and v8 are among a large list of things that have horrid build systems. Only way to build them sanely is to use some randos CMake fork, then it Just Works. Literally a two-liner in your build system to add them to your project with a sane CMake script.

mikepurvis · on Oct 6, 2023

I was part of a Nix migration over the past two years, and literally one of the first things we checked is that there was already a community-maintained tensorflow+gpu package in nixpkgs because without that the whole thing would have been a complete non-starter, and we sure as heck didn't have the resources or know-how to figure it out for ourselves as a small DevOps team just trying to do basic packaging.

amelius · on Oct 6, 2023

> So it's important that vendors don't feel let off the hook to provide sane packaging just because there's an option to use a kitchen-sink container image they rebuild every day.

Sadly if e.g. 95% of their users can use the container, then it could make economical sense to do it that way.

mathisfun123 · on Oct 6, 2023

> especially once you're in embedded

is this a real problem? exactly which embedded platform has a device that ROCm supports?

mikepurvis · on Oct 6, 2023

Robotic perception is the one relevant to me. You want to do object recognition on an industrial x86 or Jetson-type machine, without having to use Ubuntu or whatever the one "blessed" underlay system is (either natively or implicitly because you pulled a container based on it).

mathisfun123 · on Oct 6, 2023

>industrial x86 or Jetson-type machine

that's not embedded dev. if you

1. use underpowered devices to perform sophisticated tasks

2. using code/tools that operate at extremely high levels of "abstraction"

don't be surprised when all the inherent complexity is tamed using just more layers of "abstraction". if that becomes a problem for your cost/power/space budget then reconsider choice 1 or choice 2.

mikepurvis · on Oct 6, 2023

Not sure this is worth an argument over semantics, but modern "embedded" development is a lot bigger than just microcontrollers and wearables. IMO as soon as you're deploying a computer into any kind of "appliance", or you're offline for periods of time, or you're running on batteries or your primary network connection is wireless... then yeah, you're starting to hit the requirements associated with embedded and need to seek established solutions for them, including using distros which account for those requirements.

serf · on Oct 6, 2023

fwiw CompTIA classifies an embedded engineer/developer as " those who develop an optimized code for specific hardware platforms."

mathisfun123 · on Oct 6, 2023

> IMO as soon as you're deploying a computer into any kind of "appliance", or you're offline for periods of time, or you're running on batteries or your primary network connection is wireless

yes and in those instances you do not reach for pytorch/tensorflow on top of ubuntu on top of x86 with a discrete gpu and 32gb of ram. instead you reach for C and micro or some arm soc that supports baremetal or at most rtos. that's embedded dev.

so i'll repeat myself: if you want to run extremely high-level code then don't be "surprised pikachu" when your underpowered platform, that you chose due to concrete, tight budgets doesn't work out.

Const-me · on Oct 6, 2023

The hardware can be fast, actually. Here’s an example of relatively modern industrial x86: https://www.onlogic.com/ml100g-41/ That thing is probably faster than half of currently sold laptops.

However, containers or Ubuntu Linux don’t perform great in that environment. Ubuntu is for desktops, containers are for cloud data centers. An offline stand-alone device is different. BTW, end users don’t typically aware that thing is a computer at all.

Personally, I usually pick Alpine or Debian Linux for similar use cases, bare metal i.e. without any containers.

cannonpalms · on Oct 7, 2023

> Ubuntu is for desktops

Tell that to their (much larger, more profitable, and better-funded) server org. This is far from true.

iopq · on Oct 7, 2023

It also works much better as a server. Snaps work really well for things like certbot

On Desktop you have to worry about things like... UIs, sound, Wine, etc.

ngcc_hk · on Oct 7, 2023

That is the moat they tried to cross. Imagine you have a PyTorch app and run on iOS, arm based, amd based and intel … cloud, or embedded. just imagine. You scale and embed as your business case, not as any one firm current strategy is.

Or at least you have some case as heaven never come. Or come just we do not aware now like internet. Can you need to use ibm to rub sna to provide a token ring based network. In 1980 …

Imagine and let us or they competite …

rcxdude · on Oct 7, 2023

Not that I want to encourage gatekeeping in the first place, but you'll have more success if you have a clue what the other person is talking about in the first place (and some idea of what embedded looks like outside of tiny micros, and how the concerns about abstractions extend beyond matters of how much computational power is available).

nightski · on Oct 6, 2023

Clearly you've never used a Nvidia Jetson and have no idea what it is. You don't need a discrete GPU, it has a quite sophisticated GPU in the SoC. It's Nvidia's embedded platform for ML/AI.

ngcc_hk · on Oct 7, 2023

Better to come if the tide shift so we can have compatible layer. The key is the tide. Obviously would n try to sue … it would be a sign that finally we have real competition. Gar is where innovation do.

X86 cannot do 64 bit let us do this and that so the market can use only our cpu. Repeat with me x86-64 is impossible.

Not sure Apple is in this otherwise the real great competition come.

wyldfire · on Oct 6, 2023

> Best of all is that I simply set the device to `torch.device('cuda')` rather than openCL, which does wonders for compatibility

Man oh man where did we go wrong that cuda is the more compatible option over OpenCL?

KeplerBoy · on Oct 6, 2023

It must be a misnomer on PyTorch's side. Clearly it's neither CUDA nor OpenCL.

AMD should just get it's shit together. This is ridiculous. Not the name, but the fact that you can only do FP64 on a GPU. Everybody is moving to FP16 and AMD is stuck on doubles?

omneity · on Oct 6, 2023

I believe the fp64 limitation came from the laptop-grade GPU I had rather than inherent to AMD or ROCm.

The API level I could target was at least two or three versions behind the latest they have to offer.

KeplerBoy · on Oct 6, 2023

Might very well be true. I don't blame anyone for not diving deeper into figuring out why this stuff doesn't work.

But this is one of the great strengths of CUDA: I can develop a kernel on my workstation, my boss can demo it on his laptop and we can deploy it on Jetsons or the multi-gpu cluster with minimal changes and i can be sure that everything runs everywhere.

brutus1213 · on Oct 7, 2023

There is indeed something excellent about CUDA from a user perspective that is hard to beat. I do high-level DNN and it is not clear to me what it is or why that is. Anytime I have worked on optimizing to mobile hardware (not Jetson, but actual phones or accelerators), it is just a world of hurt and incompatibilities. This notion that operators or subgraphs can be accelerated by lower level closed blobs .. I wonder if that is part of the issue. But then why doesn't OpenCL not just work? I thought it gave a CUDA kernel like general purpose abstraction.

I just don't understand the details enough to understand why things are problematic without CUDA :(

iopq · on Oct 7, 2023

Sorry, still trying to install some dependencies for DNN and CUDA, not sure why it says my Clang version is too new (!)

JonChesterfield · on Oct 7, 2023

FP64 is what HPC is built on. F32 works on the cards too (same rate or faster). I don't know the status of F16 or F8.

Some architectures provide fast F16->F32 and F32->F16 conversion instructions so you can DIY the memory bandwidth saving - that always seemed reasonable to me, but I don't know if the AMD hardware people are/will go down that path.

KeplerBoy · on Oct 7, 2023

Sure but Radeon cards are not HPC accelerators. A modest 7800XT for example, which would be a great card for SD, has 76 TFlops@FP16, 37TF@FP32 and 1.16TF@FP64.

Keeping all those FPUs busy is another problem and not easy, but in cases where it can be done FP32 is clearly desirable.

londons_explore · on Oct 7, 2023

More importantly, if you specify FP16, yet the hardware only supports FP32, then the library should emit a warning but work anyway, doing transparent casts behind your back as necessary.

NavinF · on Oct 7, 2023

This has always been the case. OpenCL is a shit show

RockRobotRock · on Oct 6, 2023

Have you gotten it to work with Whisper by any chance?

kkielhofner · on Oct 7, 2023

Whisper is actually a great example of why Nvidia has such a stronghold on ML/AI and why it’s so difficult to compete.

There’s getting something to “work”, which is often enough of a challenge with ROCm. Then there’s getting it to work well (next challenge).

Then there’s getting it to work as well as Nvidia/CUDA.

With Whisper, as one example, you should be running it with ctranslate2[0]. Of all the platforms on their supported list you won’t find ROCm.

When you really start to look around you’ll find that ROCm is (at best) still very much in the “get it to work (sometimes)” stage. In most cases it’s still a long way away from getting it to work well, and even further away from making it actually competitive with Nvidia for serious use cases and applications.

People get excited about the progress ROCm has made getting basic things to work with PyTorch and this is good - progress is progress. But saving 20% on the hardware when the equivalent Nvidia product is often somewhere between 5-10x as performant (at a fraction of the development time) because of vastly superior software support you realize pretty quickly Nvidia is actually a bargain compared to AMD.

I’m desperately rooting for Nvidia to have some actual competition but after six years of ROCm and my own repeated failed attempts to have it make any sense overall I’m only more and more skeptical that real competition in the space will come from AMD.

[0] - https://github.com/OpenNMT/CTranslate2

errnoh · on Oct 7, 2023

While I agree that it's much more effort to get things working on AMD cards than it is with Nvidia, I was a bit surprised to see this comment mention Whisper being an example of "5-10x as performant".

https://www.tomshardware.com/news/whisper-audio-transcriptio... is a good example of Nvidia having no excuses being double the price when it comes to Whisper inference, with 7900XTX being directly comparable with 4080, albeit with higher power draw. To be fair it's not using ROCm but Direct3D 11, but for performance/price arguments sake that detail is not relevant.

EDIT: Also using CTranslate2 as an example is not great as it's actually a good showcase why ROCm is so far behind CUDA: It's all about adapting the tech and getting the popular libraries to support it. Things usually get implemented in CUDA first and then would need additional effort to add ROCm support that projects with low amount of (possibly hobbyist) maintainers might not have available. There's even an issue in CTranslate2 where they clearly state no-one is working to get ROCm supported in the library. ( https://github.com/OpenNMT/CTranslate2/issues/1072#issuecomm... )

kkielhofner · on Oct 7, 2023

> While I agree that it's much more effort to get things working on AMD cards than it is with Nvidia, I was a bit surprised to see this comment mention Whisper being an example of "5-10x as performant".

It easily is. See the benchmarks[0] from faster-whisper which uses Ctranslate2. That's 5x faster than OpenAI reference code on a Tesla V100. Needless to say something like a 4080 easily multiplies that.

> https://www.tomshardware.com/news/whisper-audio-transcriptio... is a good example of Nvidia having no excuses being double the price when it comes to Whisper inference, with 7900XTX being directly comparable with 4080, albeit with higher power draw. To be fair it's not using ROCm but Direct3D 11, but for performance/price arguments sake that detail is not relevant.

With all due respect to the author of the article this is "my first entry into ML" territory. They talk about a 5-10 second delay, my project can do sub 1 second times[1] even with ancient GPUs thanks to Ctranslate2. I don't have an RTX 4080 but if you look at the performance stats for the closest thing (RTX 4090) the performance numbers are positively bonkers - completely untouchable for anything ROCm based. Same goes for the other projects I linked, lmdeploy does over 100 tokens/s in a single session with LLama2 13b on my RTX 4090 and almost 600 tokens/s across eight simultaneous sessions.

> EDIT: Also using CTranslate2 as an example is not great as it's actually a good showcase why ROCm is so far behind CUDA: It's all about adapting the tech and getting the popular libraries to support it. Things usually get implemented in CUDA first and then would need additional effort to add ROCm support that projects with low amount of (possibly hobbyist) maintainers might not have available. There's even an issue in CTranslate2 where they clearly state no-one is working to get ROCm supported in the library. ( https://github.com/OpenNMT/CTranslate2/issues/1072#issuecomm... )

I don't understand what you're saying here. It (along with the other projects I linked here[2]) are fantastic examples of just how far behind the ROCm ecosystem is. ROCm isn't even on the radar for most of them as your linked issue highlights.

Things always get implemented in CUDA first (ten years in this space and I've never seen ROCm first) and ROCm users either wait months (minimum) for sub-par performance or never get it at all.

[0] - https://github.com/guillaumekln/faster-whisper#benchmark

[1] - https://heywillow.io/components/willow-inference-server/#ben...

[2] - https://news.ycombinator.com/item?id=37793635#37798902

pedrovhb · on Oct 7, 2023

I've had luck with an RX5700XT and whisper.cpp built with clblast. Works like a charm, not entirely a scarring experience getting it to work (easier than most other stuff which was surprising to me).

One arcane detail is that whereas for PyTorch I have to set the env var HSA_OVERRIDE_GFX_VERSION to 10.3.0, getting it to run with whisper.cpp and llama.cpp requires setting it to 10.1.0. Good luck and may it cost you less hair than it did me.

incognition · on Oct 7, 2023

Fp64??

latchkey · on Oct 7, 2023

https://en.wikipedia.org/wiki/Double-precision_floating-poin...

NVIDIA fp32 (H100) has 2x more TFLOPS than AMD's fp32 (MI250) and AI doesn't need fp64 precision.

incognition · on Oct 8, 2023

Lol it was meant as I wouldn't be caught dead using fp64

fransje26 · on Oct 7, 2023

Hardware limitation.

javchz · on Oct 6, 2023

CUDA is the only reason I have an Nvidia card, but if more projects start migrating to a more agnostic environment, I'll be really grateful.

Running Nvidia in Linux isn't as much fun. Fedora and Debian can be incredibly reliable systems, but when you add an Nvidia card, I feel like I am back in Windows Vista with kernel crashes from time to time.

distract8901 · on Oct 6, 2023

My Arch system would occasionally boot to a black screen. When this happened, no amount of tinkering could get it back. I had to reinstall the whole OS.

Turns out it was a conflict between nvidia drivers and my (10 year old) Intel integrated GPU. But once I switched to an AMD card, everything works flawlessly.

Ubuntu based systems barely worked at all. Incredibly unstable and would occasionally corrupt the output and barf colors and fragments of the desktop all over my screens.

AMD on arch has been an absolute delight. It just. Works. It's more stable than nvidia on windows.

For a lot of reasons-- but mainly Linux drivers-- I've totally sworn off nvidia cards. AMD just works better for me.

aftbit · on Oct 6, 2023

As a counter-argument, I ran Arch Linux + nvidia GPUs + Intel CPUs between 2012 and 2020, and still run Arch + nvidia (now with AMD CPU) to this day. I won't say it has been bug free at all, but it generally works pretty well. If you find a problem in Arch that you cannot fix without reinstalling, you do not sufficiently understand the problem or Arch itself. "Installing" Arch is refreshingly manual and "simple" compared to the magic that is other Linux distros or the closed source OSes.

iopq · on Oct 7, 2023

I tried using an Nvidia card with OBS to record my screen and it kind of freezes in Wine. I switched from x11 to Wayland and now Wine shows horizontal lines (!) and performs like crap.

Even my 4GB RX 570 from years ago gives a better experience doing this. You just install OBS from flathub, Wayland works, everything works without any setup or tinkering. You click record and you can record your gameplay footage.

__rito__ · on Oct 7, 2023

I use OBS on Linux with NVIDIA card fairly regularly.

It works flawlessly.

Never used Wine + OBS, though.

iopq · on Oct 15, 2023

Update: OBS is not broken

just ALL videos on my system are broken, I can't play back video past like half speed so the sound gets really choppy

Thanks, Nvidia

distract8901 · on Oct 7, 2023

I'm sure that I could have fixed it, but I gave up after spending multiple evenings on it. Have you ever spent hours debugging a system exclusively in text mode? It isn't fun. Reinstalling the OS takes less than 30 minutes. It's a clear choice for me

aftbit · on Oct 9, 2023

Yes in fact, I have spent hours debugging a system from the console. links/lynx is a godsend. I agree though, reinstalling is certainly easier. This is more of a philosophical argument than a practical one. I installed Arch to really learn Linux, not just to get work done. If I just wanted to get work done, I'd have used Fedora, Ubuntu, or Debian.

Anyway, no judgement, just my POV.

wildzzz · on Oct 7, 2023

I ran a laptop with the swappable dedicated Nvidia and integrated Intel GPU for a decade with no issues. Used to use something called Bumblebee to swap between them depending on workload, actually worked surprisingly well given the circumstances. Eventually I just dropped back to integrated only when I stopped doing anything intensive with the machine.

MegaDeKay · on Oct 7, 2023

I run Arch as well and AMD is only "good". I would have a problem every now and then where my RX560 would lose its mind coming out of sleep and I'd have to reboot.

But the other problem that really bugs me is the "AMD reset bug" that you trip over with most AMD GPUs. This is when you pass through a second GPU through to another OS running under KVM, and is what lets you run Linux and (say) Windows simultaneously with full GPU hardware acceleration on the guest. The reset bug means the GPU will hang upon shutdown of the guest and only a reboot will let you recover the card. This is a silicon level bug that has existed for many years across many generations of cards and AMD can't be arsed to fix it. Projects like "vendor-reset" help for some cards, but gnif2 has basically given up (he mentioned he even personally raised the issue with Lisa Su). Even AMDs latest cards like the 7800 XT are affected. NVidia works flawlessly here.

__rito__ · on Oct 7, 2023

I have used Pop OS and Ubuntu with NVIDIA card, and honestly, I never faced any serious problem.

After every kernel upgrade, I just have to reinstall the nvidia drivers and the cuda toolkit.

Everything works as before after I do that. I don't face any problems at all.

hskalin · on Oct 9, 2023

I'm not sure what card you have but I've never really had any major problems running Nvidia + Intel integrated graphics on Arch, Ubuntu etc.

nextaccountic · on Oct 7, 2023

> CUDA is the only reason I have an Nvidia card, but if more projects start migrating to a more agnostic environment, I'll be really grateful.

What AMD really needs is to have 100% feature parity with CUDA without changing a single line of code. Maybe for this to happen it needs to add hardware features or something (I see people saying that CUDA as an API is very tailored to the capabilities of nvidia GPUs), I don't know.

If AMD relies on people changing their code to make it portable, it already lost.

JonChesterfield · on Oct 7, 2023

The idea was supposed to be people convert cuda to hip, which is a pretty similar language, either by hand or by running a tool called 'hipify' that comes with rocm. You can then compile that unmodified for amdgpu or for nvptx.

I think where that idea goes wrong is in order to compile it unmodified for nvptx, you need to use a toolchain which knows hip and nvptx, which the cuda toolchain does not. Clang can mostly compile cuda successfully but it's far less polished than the cuda toolchain. ROCm probably has the nvptx backend disabled, and even if it's built in, best case it'll work as well as clang upstream does.

What I'm told does work is keeping all the source as cuda and using hipify as part of a build process when using amdgpu - something like `cat foo.cu | hipify | clang -x hip -` - though I can't personally vouch for that working.

The original idea was people would write in opencl instead of cuda but that really didn't work out.

pjmlp · on Oct 7, 2023

Both ideas are already lost before starting, Hip isn't polyglot as CUDA, and OpenCL is mostly stuck in C.

mrweasel · on Oct 7, 2023

> I see people saying that CUDA as an API is very tailored to the capabilities of nvidia GPUs

I'm wondering how true that is, because that could give NVidia issues in the future if they need to redesign their GPU should they hit some limit with the current designs. Dependence on certain instruction makes sense, but there's not technical preventing AMD from implementing those instructions, only legal mumbo jumbo.

javchz · on Oct 7, 2023

I think that could work too. I wonder if they could do a translation layer, something like Apple with the M1 chips that translates JIT x86 to ARM.

JonChesterfield · on Oct 7, 2023

That's a fun idea. Qemu parses a binary into something very like a compiler IR, optimises it a bit, then writes it out as a binary for the same or another target in JIT like fashion. So that sort of thing can be built. Apple's rosetta is functionally similar, I expect it does the same sort of thing under the hood. Valgrind is another from the same architecture.

It would be a painful reverse engineering process - the cuda file format is sort of like elf, but with undocumented bonus constraints, and you'd have to reverse the instruction encoding to get sass, which isn't documented, or try to take it directly to ptx which is somewhat documented, and then convert that onward.

It would be far more difficult than compiling cuda source directly. I'm not sure anyone would pay for a cuda->amdgpu conversion tool, and it's hard to imagine AMD making one as part of ROCm.

mschuetz · on Oct 7, 2023

Not just feature parity, but proper UX. Things need to just work, without spending hours or days to make them work.

weebull · on Oct 7, 2023

Blame Nvidia. They are the ones the got the industry hooked on a proprietry API.

mschuetz · on Oct 8, 2023

Why would I blame NVIDIA? If it wasn't for them, we'd still only have needlessly cumbersome APIs and ecosystems. They did what Khronos always failed to do: They created something that is both easy, powerful and fast. Khronos always heavily neglects the easy part.

freilanzer · on Oct 9, 2023

Blame them for being anti competitive and anti consumer.

mschuetz · on Oct 9, 2023

How are they preventing the competition to create something better than CUDA? And how does it hurt the consumers that they are providing a fantastic product that others refuse to provide?

PH95VuimJjqBqy · on Oct 6, 2023

I see these complains from time to time and I never understand them.

I've literally been running nvidia on linux since the TNT2 days and have _never_ had this sort of issue. That's across many drivers and many cards over the many many years.

LtWorf · on Oct 6, 2023

I've had kernel panics that disappeared when I started using the on board intel graphics instead of the nvidia.

Your statement makes no sense. It's like a smoker claiming that since he didn't die of lung cancer, smoke is 100% safe.

kkielhofner · on Oct 7, 2023

Describing kernel panics and general nightmare scenarios as the general course with Nvidia doesn’t make sense either.

Nvidia has 80% market share of the discrete GPU desktop market and at least 90% market share of cloud/datacenter.

Nvidia GPUs are used almost exclusively for every cloud powered AI service and to train virtually every ML model in existence. Almost always on Linux.

Do you really think any of this would be possible if what you are describing was anything approaching the typical experience starting at the /driver/ level?

Nvidia would have never achieved their market dominance nor held on to it this long if the issues you’ve experienced impacted anything approaching a statistically significant number of users or applications.

Nvidia gets a lot of hate on HN and elsewhere (much of it fair) but I will never understand the people who claim it doesn’t work and get the job done (often very well).

mr_toad · on Oct 7, 2023

People use flakey software all their time. As long as it mostly works most of the time most people put up with it. Examples: Windows in the 90’s and 00’s, or any AAA game on first release in the last 10 years.

kkielhofner · on Oct 7, 2023

I have a friend at the Facebook AI Research lab and I assure you they would not tolerate any level of fundamental flakiness from their 8,000 GPU cluster. Talent, opportunity cost, and time to market in general is so crucial in AI no one has any time or patience for the "oddball Linux desktop" experiences people are describing here.

Gaming users may tolerate some flakiness for their hobby but these AI companies dealing in the nine-figure range (minimum) absolutely do not.

pixl97 · on Oct 7, 2023

My guess is when FB does run into such flakiness they email ____.____@nvidia.com as part of some support contract they have and go "Yo, we see this issue, figure it out and fix it".

But I can promise you after reading things like the LKML for decades and a number of different Microsoft blogs, that everyone on this planet experiences flakiness issues at times and has to figure out how to adjust their workload to avoid it until the issue is discovered and fixed.

kkielhofner · on Oct 7, 2023

He has described to me, in detail, some of the challenges they have had. I'm not saying it's exhaustive but I'm pretty sure if their experience with the fundamental software stack was what people here are claiming I would never hear the end of it.

Actually, no. Obviously they have Nvidia support but in one especially obscure issue he was describing Meta took it as an internal challenge and put three teams on it in competition. Naturally his team won (of course) ;).

Of course all software has flakiness - I'm not taking the ridiculous position that Nvidia is the first company in history to deliver perfect anything.

What I am saying is these anecdotal reports (primarily from Linux desktop hobbyists/enthusiasts) of "It's broken, it doesn't work. Nvidia sucks because it locked up my patched kernel ABC with Wayland XYZ on my bleeding edge rolling release and blah blah blah" (or whatever) are extreme edge cases and in no way representative of 99% of the Nvidia customer base and use cases.

Show me anything (I don't care what it is) and I'll find someone who has a horror story about it. Nvidia gets a lot of heat from the Linux desktop situation over the years and some people clearly hold an irrational hatred and grudge.

Nvidia isn't perfect but it's very hard to argue they don't deliver generally working solutions - actually best of breed in their space as demonstrated by their overwhelmingly dominant market share I highlighted originally.

PH95VuimJjqBqy · on Oct 9, 2023

On the flip side, one of the reasons I'm loyal to nvidia is a combination of two things.

1. They supported linux when no one else did, 2. I've never experienced instability from their drivers, and as I mentioned before, I've been running their cards under linux since the TNT2 days.

iopq · on Oct 7, 2023

Nvidia is bad when combined with Wine/Firefox/Chrome on Wayland

Which is literally only 1% of users anyway

LtWorf · on Oct 10, 2023

Doesn't justify a kernel panic.

jjoonathan · on Oct 6, 2023

Same but linux experience is a steep and bumpy function of hardware.

My guess: something like laptop GPU switching failed badly in the nvidia binary, earning it a reputation.

HideousKojima · on Oct 6, 2023

That was my experience, Nvidia Optimus (which is what allows dynamic switching between the integrated and dedicated GPU in laptops) was completely broken (as in a black screen, not just crashes or other issues) for several years, and Nvidia didn't care to do anything about it.

lhl · on Oct 6, 2023

Yeah, Optimus was a huge PITA. I remember fighting with workarounds like bumblebee and prime for years. Also Nvidia dragged their feet on Wayland support for a few years too (and simultaneously was seemingly intent on sabotaging Nouveau).

distract8901 · on Oct 6, 2023

I tried bumblebee again recently, and it works shockingly well now. I have a thinkpad T530 from 2013 with an NVS5400m.

There is some strange issue with some games where they don't get full performance from the dGPU, but more than the iGPU. I have to use optirun to get full performance.

It also has problems when the computer wakes from sleep. For whatever reason, hardware video decoding doesn't work after entering standby. Makes steam in home streaming crash on the client, but flipping to software decoding usually works fine.

The important part is that battery life is almost as good with bumblebee as it is with the dGPU turned off. No more fucking with Prime or rebooting into BIOS to turn the GPU back on.

PH95VuimJjqBqy · on Oct 6, 2023

I don't run laptops except when work requires it and that tends to be windows so that may explain the difference in experience.

temp0826 · on Oct 6, 2023

I understand it, but I also haven't had any trouble since I figured out the right procedure for me on fedora (which probably took some time, but it's been so long that I can't remember). Whenever I read people having issues it sounds like they are using a package installed via dnf for the driver/etc. I've always had issues with dkms and the like and just install the latest .run from nvidia's website whenever I have a kernel update (I made a one-line script to call it with the silent option and flags for signing for secure boot so I don't really think about it). No issues in a very long time even with the whackiness of prime/optimus offloading on my old laptop.

PH95VuimJjqBqy · on Oct 6, 2023

actually, it's a good point because that's how I always install nvidia drivers as well. Never from the local package manager.

bootsmann · on Oct 6, 2023

So you don‘t recommend going the rpm-fusion route?

ant6n · on Oct 6, 2023

Well tnt2 should be pretty well supported by now ;-)

PH95VuimJjqBqy · on Oct 6, 2023

lmao, touche :)

einpoklum · on Oct 6, 2023

I have been NVIDIA cards for compute capabilities only, both personally and at work, for nearly a decade. I've had dozens and dozens of different issues involving the hardware, the drivers, integration with the rest of the OS, version compatibilities, ensuring my desktop environment doesn't try to use the NVIDIA cards, etc. etc.

Having said that - I (or rarely, other people) have almost always managed to work out those issues and get my systems to work. Not in all cases though.

kombine · on Oct 6, 2023

I use a rolling distro (OpenSUSE Tumbleweed) and have had zero issues with my NVIDIA card despite it pulling the kernel and driver updates as they get released. The driver repo is maintained by NVIDIA itself, which is amazing.

filterfiber · on Oct 6, 2023

Do you use wayland, multiple monitors, and/or play games or is it just for ML/AI?

smoldesu · on Oct 6, 2023

I do all of those things with my 3070 and it works just fine. Most of them will depend on your DE's Wayland implementation.

I'm not here to desparage anyone experiencing issues, but my experience on the NixOS rolling-release channel has also been pretty boring. There was a time when my old 1050 Ti struggled, but the modern upstream drivers feel just as smooth as my Intel system does.

chaostheory · on Oct 7, 2023

Yeah with my CUDA setup, it feels like I just ducktaped my deployment. I am very hesitant to make changes and it’s not easy to replicate

gymbeaux · on Oct 6, 2023

I often have issues booting to the installer or first boot after install with an NVidia GPU.

Pop_OS, Fedora and OpenSUSE work out of the box. Those are all Wayland I believe. Debian/Ubuntu distros are a bad time. I think they’re still X11. It’s ironic because X11 is supposed to be the more stable window manager.

Flameancer · on Oct 7, 2023

I think they moved to Wayland on 23.04 or 23.10. I just recently installed both to try and get a 7800xt working with PyTorch and the default was Wayland.

anthk · on Oct 7, 2023

X11 is not a window manager.

gymbeaux · on Oct 7, 2023

anthk · on Oct 8, 2023

Neither.

gymbeaux · on Oct 9, 2023

Oh I get it, I originally said “window manager” rather than “window server.” You’re so helpful, thank you!

smoldesu · on Oct 6, 2023

Those problems might just be GNOME-related at this point. I've been daily-driving two different Nvidia cards for ~3 years now (1050 Ti then 3070 Ti) and Wayland has felt pretty stable for the past 12 months. The worst problem I had experienced in that time was Electron and Java apps drawing incorrectly in xWayland, but both of those are fixed upstream.

I'm definitely not against better hardware support for AI, but I think your problems are more GNOME's fault than Nvidia's. KDE's Wayland session is almost flawless on Nvidia nowadays.

arsome · on Oct 6, 2023

If GNOME can tank the kernel, it ain't GNOME's fault.

kombine · on Oct 6, 2023

I really hope that with KDE 6 I can finally switch to Wayland!

Zardoz84 · on Oct 7, 2023

I'm using KDE on Debian 12 with AMD GPU with Wayland, and works. it keeps being a bit annoying compared with X11 with a few programs (Eclipse, Dbeaver... I need to launch both with flags to not use Wayland backend). But even I can play AAA games without problems

orangetuba · on Oct 7, 2023

Nvidia on Linux is more like running Windows 95 from the gulag, and you're covered in ticks. I absolutely detest Nvidia because of the Linux hell they've created.

codemk8 · on Oct 8, 2023

From my recent experience, ROCm and hipcc allow you to port cuda programs fairly easily to AMD arch. The compiling is so much faster than nvcc too.

wubrr · on Oct 6, 2023

Yeah, nvidia linux support is meh, but still much better than amd.

phkahler · on Oct 6, 2023

>> Yeah, nvidia linux support is meh, but still much better than amd.

Can not confirm. I used nvidia for years when it was the only option. Then used the nouveau driver on a well supported card because it worked well and eliminated hassle. Now I'm on AMD APU and it just works out of the box. YMMV of course. We do get reports of issues with AMD on specific driver versions, but I can't reproduce.

Zambyte · on Oct 6, 2023

Is it better than AMD? I have had literally no graphics issues on my 6650 XT with swaywm using the built in kernel drivers.

aseipp · on Oct 6, 2023

This week I upgraded my kernel on a 2017 workstation to 6.5.5 and when I rebooted and looked at 'dmesg' there were no less than 7 kernel faults with stack traces in my 'dmesg' from amdgpu. Just from booting up. This is a no-graphical-desktop system using a Radeon Pro W5500, which is 3.5 years old (I just had the card and needed something to plug in for it to POST.)

I have come to accept that graphics card drivers and hardware stability ultimately comes down to whether or not ghosts have decided to haunt you.

christkv · on Oct 6, 2023

I think the problems are pro drivers and the issues with ROCm being buggy not the open source graphics drivers.

HansHamster · on Oct 6, 2023

Guess I'm also doing something wrong. Never had any serious issues with either Nvidia or AMD on Linux (and only a few annoyances on RNDA2 shortly after release)...

treprinum · on Oct 6, 2023

I never had an issue with nVidia drivers on Linux in the past 5 years, but recently bought a laptop with a 4090 and AMD CPU. Now I get random freezes, often right after I login into Cinnamon but can't really tell if it's the nVidia driver for 4090, AMDGPU driver for integrated RDNA, kernel 6.2 or Cinnamon issue. The laptop just hangs and stops responding to keyboard so I can't login to console and dmesg it.

SoftTalker · on Oct 6, 2023

The main issue with Nvidia on Linux AIUI is that they don't release the source code for their drivers.

treprinum · on Oct 6, 2023

That might be a philosophical problem that never prevented me from training models on Linux. The half-baked half-crashing AMD solutions just lead to wasting time I can spend on ML research instead.

65a · on Oct 7, 2023

I literally gave away my last laptop with a discrete nVidia card because it wasted so much of my time.

bryanlarsen · on Oct 6, 2023

Not my experience. The open source AMD drivers are much more pleasant to deal with than the closed source Nvidia ones.

silisili · on Oct 6, 2023

In the closed source days of fglrx or whatever it's called I'd agree. Since they went open source, hard disagree. AMD graphics work in Linux about as well as Intel always has.

acomjean · on Oct 6, 2023

As someone who was tasked with trying to get nvidia working on Ubuntu, it’s a pretty terrible experience.

I have a nvidia laptop with popos. That works well.

IronWolve · on Oct 6, 2023

Yup, thank the hobbyists. Pytorch is allowing other hardware. Stable diffusion working on m chips, intel arc, and Amd.

Now what I'd like to see is real benchmarks for compute power. Might even get a few startups to compete in this new area.

mandevil · on Oct 6, 2023

It isn't the hobbyists who are making sure that PyTorch and other frameworks runs well on these chips, but teams of engineers who work for NVIDIA, AMD, Intel, etc. who are doing this as their primary assigned jobs, in exchange for money from their employer, who are paying those salaries because they want to sell chips into the enormous demand for running PyTorch faster.

Hobbyist and open-source are definitely not synonyms.

janalsncm · on Oct 7, 2023

Special mention to Facebook and Google AI research teams that maintain PyTorch and Tensorflow respectively. And also to ptrblck on the PyTorch forums [1] who has the answer to basically every question it seems. He alone is probably responsible for hundreds of millions of dollars of productivity gain.

[1] https://discuss.pytorch.org/u/ptrblck/summary

Eisenstein · on Oct 6, 2023

People don't usually get employed to make things with no demand, and people who work for companies with a budget line don't really care how much the nVidia tax is. You can thank hobbyists for creating a lot of demand for compatability with other cards.

kiratp · on Oct 6, 2023

There are so many billions of dollar being spent on this hardware that everyone other than Nvidia is doing everything they can to make competition happen.

Eg: https://www.intel.com/content/www/us/en/developer/videos/opt...

https://www.intel.com/content/www/us/en/developer/tools/onea...

https://developer.apple.com/metal/tensorflow-plugin/

Large scale opensource is, outside of a few exceptions, built by engineers paid to build it.

johngossman · on Oct 7, 2023

I can only point you to cloud financial results and the huge cost of the AI race. Note also the story recently about OpenAI looking at building their own chips. Companies absolutely care immensely about the cost of GPUs. It's billions of dollars.

roenxi · on Oct 6, 2023

There is huge demand for AMD cards that can efficiently multiply matrices together. The issue is that while there are currently isolated cases where people can make them do that, it doesn't seem to be possible at the scale that it needs to happen at.

AMD are being dragged along by the market. Willingly, they aren't fighting it, but their focus has been on other areas.

viewtransform · on Oct 6, 2023

They've shifted a large pool of experienced engineers from legacy software projects to AI and moved the team under a veteran Xilinx AI director. Fingers crossed we should see significant changes in 2024.

Flameancer · on Oct 7, 2023

As a new owner of a 7800XT I’m excited.

iopq · on Oct 7, 2023

Look at the earnings call:

https://www.fool.com/earnings/call-transcripts/2023/08/01/ad...

it's literally ALL AI, server, enterprise talk - AI is mentioned 64 times

AMD literally doesn't care about gaming anymore, server is their primary focus

mattnewton · on Oct 6, 2023

Re: startups, Geohotz raised a few million for this already. https://tinygrad.org/

IntelMiner · on Oct 6, 2023

Didn't he do what he always does. Rake in a ton of money, fart around and then cash out exclaiming it's everyone else's fault?

The way he stole Fail0verflow's work with the PS3 security leak after failing to find a hypervisor exploit for months absolutely soured any respect I had for him at the time

throwitawayfam · on Oct 6, 2023

Yep, did exactly that. IMO he threw a fit, even though AMD was working with him squashing bugs. https://github.com/RadeonOpenCompute/ROCm/issues/2198#issuec...

nomel · on Oct 6, 2023

To be fair, kernel crashes from running an AMD provided demo loop isn’t something he should have to work with them on. That’s borderline incompetence. His perspective was around integration into his product, where every AMD bug is a bug in his product. They deserve criticism, and responded accordingly (actual resources to get their shit together). It’s not like GPU accelerated ML is some new thing.

aeyes · on Oct 6, 2023

He's back on it after getting AMD's CEO to commit resources to this:

https://twitter.com/realGeorgeHotz/status/166980346408248934...

https://twitter.com/LisaSu/status/1669848494637735936

JonChesterfield · on Oct 7, 2023

That's a tough issue to read through, thanks for the link. 'Your demo code on a system setup exactly as you describe dereferences null in the kernel and falls over'. Fuzz testing + a vaguely reasonable kernel debugging workflow should make things like that much harder to find.

ShamelessC · on Oct 7, 2023

> The way he stole Fail0verflow's work with the PS3 security leak after failing to find a hypervisor exploit for months absolutely soured any respect I had for him at the time

That sounds interesting. I tried googling about it but can't really find much other than that failoverflow found a key and didn't release it, and then geohot released his own subsequently. I'd love to hear more about how directly he "stole" the work from the Fail0verflow team.

edit: Reading some sibling comments here, it seems you are either mistaken and/or were exaggerating your claim about the "theft" here. As far as I can tell, he simply took their findings and made his own version of an exploit that they had detailed publicly. That may be in poor taste in this particular community but it's certainly not theft. I do agree that his behavior there was lacking in decency, but not to the degree implied here where I was thinking he _literally_ stole their exploit by hacking them, or something similar to that.

cyrux004 · on Oct 7, 2023

People here generally try to bash people who are much smarter than them, throwing shade at their background. They will say that he abandoned his first company, gave up on tiny grad but both of them are very much alive projects

kinematikk · on Oct 6, 2023

Do you have a source on the stealing part? A quick Google search didn't result in anything

IntelMiner · on Oct 6, 2023

Marcan (of Asahi Linux fame) has talked about it many times before. But an abridged version

Fail0verflow demoed how they were able to derive the private signing keys for the Sony Playstation 3 console at I believe CCC

Geohot after watching the livestream raced into action to demo a "hello world!" jailbreak application and absolutely stole their thunder without giving any credit

ryanjshaw · on Oct 6, 2023

If they demod something then they released it publically and it was fair game?

In any case he absolutely did credit them, it's easily verifiable: https://web.archive.org/web/20110104040706/http://geohot.com...

Sony sued them both, afterall!

aftbit · on Oct 6, 2023

This apparently worked pretty well for him, as I still remember him primarily as "that guy who hacked PS3". Some people let someone else do the hard technical core, then do all the other easy but boring stuff and claim 100% credit.

22c · on Oct 6, 2023

I remember geohot as being one of the people who developed a fairly successful jailbreak for iPhone. I understand that iPhone jailbreaking is often standing on the shoulders of predecessors, but I believe he does deserve significant credit for at least one popular iPhone jailbreak.

georgehotz · on Oct 8, 2023

Are you interested in being factually correct, or are you interested in hating? If it's the former, I think you should do some research. If it's the latter, :salute:

I had GPT-4 do some research for you, hopefully you will incorporate it in future comments you make about me. https://chat.openai.com/share/d0fa24e9-3ed7-4b17-8497-24bfdd...

adastra22 · on Oct 6, 2023

Wow, TIL

nomel · on Oct 6, 2023

Obligatory Lex Fridman podcast, where he discusses it: https://youtu.be/dNrTrx42DGQ?t=2408

jauntywundrkind · on Oct 6, 2023

Pytorch is just using Google's OpenXLA now, & OpenXLA is the actual cross platform thing, no? I'm not very well versed in this area, so pardon if mistaken. https://pytorch.org/blog/pytorch-2.0-xla-path-forward/

mathisfun123 · on Oct 6, 2023

> Pytorch is just using Google's OpenXLA now

this is so far from accurate it should be considered libelous; from the link

> PyTorch/XLA is set to migrate to the open source OpenXLA

so PyTorch on the XLA backend is set to migrate to use OpenXLA instead of XLA. but basically everyone moved from XLA to OpenXLA because there is no more OSS XLA. so that's it. in general, PyTorch has several backends, including plenty of homegrown CUDA and CPU kernels. in fact the majority of your PyTorch code runs through PyTorch's own kernels.

fotcorn · on Oct 6, 2023

You can use OpenXLA, but it's not the default. The main use-case for OpenXLA is running PyTorch on Google TPUs. OpenXLA also supports GPUs, but I am not sure how many people use that. Afaik JAX uses OpenXLA as backend to run on GPUs.

If you use model.compile() in PyTorch, you use TorchInductor and OpenAIs Triton by default.

jauntywundrkind · on Oct 7, 2023

Thank you for saying something useful here. I was vaguely under the impression that pytorch 2.0 had fully flipped to defaulting to openxla. That seems to not be the case.

Good to hear more than a cheap snub. OpenAI Triton as the reason other GPUs work is a real non-shit answer, it seems. And interesting to hear JAX too. Thank you for being robustly useful & informative.

voz_ · on Oct 7, 2023

Wrong.

withwarmup · on Oct 6, 2023

CUDA is the result of years of NVIDIA supporting the ecosystem, some people likes to complain because they bought hardware that was cheaper but can't use it for what they want to use it, when you buy NVIDIA, you aren't buying only the hardware, but the insane amount of work they have put into the ecosystem, the same goes for Intel, mkl and scikit-learn intelex aren't free to develop.

AMD has the hardware but the support for HPC is non-existent outside of the joke that is bliss and AOCL.

I really wish for more competitors to enter the market in HPC, but AMD has a shitload of work to do.

arcanus · on Oct 7, 2023

> AMD has the hardware but the support for HPC is non-existent outside of the joke that is bliss and AOCL.

You are probably two years behind the state of the art. The world's largest supercomputer, OLCF's Frontier, runs AMD CPUs and GPUs. It's emphatically using ROCm, not just BLIS and AOCL. See for example: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html

That's hardly non-existent support for HPC.

65a · on Oct 7, 2023

Agreed...the main gap is support on consumer and workstation cards, which is where nVidia made headway, but that is starting erode super recently. ROCm works pretty well for me, I have had a lot more problems with specific packagers than the ROCm layer.

ashu1461 · on Oct 7, 2023

Exactly, with NVIDIAs core focus on AI way before it was cool has lead to them being in this advantageous position. For AMD just being a price friendly competitor to Intel and Nvidia was the motto.

runiq · on Oct 7, 2023

Yeah, that's a pretty shortsighted take of things. Do you really believe that Nvidia hasn't taken steps do make sure their moat is as wide as possible?

Blammar · on Oct 7, 2023

The thing about owning the CUDA spec is that Nvidia can add new features quickly without having to argue with other hardware vendors. I find that a positive thing overall.

Also, I choose to pay the ~$120 Windows tax once (per box), everything works very well, and I don't have the driver issues that some fraction of other users seem to have with Linux and Nvidia cards. Seems like a good use of my time.

anon291 · on Oct 7, 2023

Literally never had an issue with Nvidia and Linux in decades. Despite this, my windows installs have all sorts of issues.. as always

pama · on Oct 6, 2023

There is only limited empirical evidence of AMD closing the gap that NVidia has created in the science or ML software. Even when considering pytorch only, the engineering effort to maintain specialized ROCm along with CUDA solutions is not trivial (think flashattention, or any customization that optimizes your own model). If your GPUs only need a simple ML workflow all times for a few years nonstop, maybe there exist corner cases where the finances make sense. It is hard for AMD now to close the gap across the scientific/industrial software base of CUDA. NVidia feels like a software company for the hardware they produce; luckily they make the money from hardware thus cannot lock the software libraries.

(Edited “no” to limited empirical evidence after a fellow user mentioned El Capitan.)

fotcorn · on Oct 6, 2023

ROCm has HIP (1) which is a compatibility layer to run CUDA code on AMD GPUs. In theory, you only have to adjust #includes, and everything should just work, but as usual, reality is different.

Newer backends for AI frameworks like OpenXLA and OpenAI Triton directly generate GPU native code using MLIR and LLVM, they do not use CUDA apart from some glue code to actually load the code onto the GPU and get the data there. Both already support ROCm, but from what I've read the support is not as mature yet compared to NVIDIA.

1: https://github.com/ROCm-Developer-Tools/HIP

Certhas · on Oct 6, 2023

The fact that El Capitan is AMD says that at least for Science/HPC there definitely is evidence of a closing gap.

pama · on Oct 6, 2023

Thanks. You are actually right that this new supercomputer might move the needle once it is in production mode. I will wait and see how it goes.

falconroar · on Oct 7, 2023

I don't understand why developers of PyTorch and similar don't use OpenCL. Open standard, runs everywhere, similar performance - what's the downside??

pama · on Oct 7, 2023

I don’t know for sure why the early pytorch team picked it, but my guess is due to simplicity and performance. NVidia optimizes CUDA better that OpenCL and provides tons of useful performance tuning tools. It is hard to match the CUDA performance with OpenCL even on the same NVidia GPU hardware, and making performant code compatible across different GPU with OpenCL is also hard. I know examples of scientific codes that became simpler and faster (on nvidia hardware) by going from openCL to CUDA but haven’t yet heard of examples the other way around.

Roark66 · on Oct 7, 2023

I think the article claiming "PyTorch has dropped the drawbridge on the CUDA moat" is way over optimistic. Jest pytorch is widely used by researchers and by users to quickly iterate various over various ways to use the models, but when it comes to inference there are huge gains to be had by going a different route. Llama.cpp has showed 10x speedups on my hardware for example (32gb of gpu ram + 32gb of cpu ram)for models like falcon-40b-instruct, for much smaller models on the cpu (under 10b) I saw up to 3x speedup just by switching to onnc and openvino.

Apple has showed us in practice the benefits of CPU/GPU memory sharing, will AMD be able to follow in their footsteps? The article claims AMD has a design with up to 192gb of shared ram. Apple is already shipping a design with the same amount of RAM(if you can afford it). I wish them-and) success, but I believe they need to aim higher than just matching apple in some unspecified future.

bigcat12345678 · on Oct 6, 2023

Cuda is the foundation

NVIDIA moat is the years of work built by oss community, big corporations, research insistute

They spend all time building for cuda, a lot of implicit designs are derived from cuda's characteristic

That will be the main challenge

mikepurvis · on Oct 6, 2023

It depends on the domain. Increasingly people's interfaces to this stuff are the higher level libraries like tensorflow, pytorch, numpy/cupy, and to a lesser degree accelerated processing libraries such as opencv, PCL, suitesparse, ceres-solver, and friends.

If you can add hardware support to a major library and improve on the packaging and deployment front while also undercutting on price, that's the moat gone overnight. CUDA itself only matters in terms of lock-in if you're calling CUDA's own functions.

bigcat12345678 · on Oct 6, 2023

what I meant is that all these stuff have 15 years of implicit accumulation of knowledge and tips and even hacks builtin in the software

No matter what you depends on, you'll have a slew of larger or minor obstacles or annoyance

That collectively is the most itself

As you said, already it's clear that replacing cuda itself is not that daunting

pixelesque · on Oct 6, 2023

Does AMD have a solution to forward device combatibility (like PTX for NVidia)?

Last time I looked into ROCm (two years ago?), you seemed to have to compile stuff explicitly for the architecture you were using, so if a new card came out, you couldn't use it without a recompile.

mnau · on Oct 6, 2023

Not natively, but AdaptiveCpp (previously hiSycl, then OpenSycl) has a single source single compiler pass, where they basically store LLVM IR as an intermediate representation.

https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/...

Performance penalty was within ew precents, at least according to the paper (figure 9 and 10) https://cdrdv2-public.intel.com/786536/Heidelberg_IWOCL__SYC...

einpoklum · on Oct 6, 2023

I don't know what they do with ROCm, but with OpenCL, the answer is: Certainly. It's called SPIR:

https://www.khronos.org/spir/

nabla9 · on Oct 6, 2023

> Crossing the CUDA moat for AMD GPUs may be as easy as using PyTorch.

Nvidia has spent huge amount of work to make code run smoothly and fast. AMD has to work hard to catch up. ROCm code is slower , has more bugs, don't have enough features and they have compatibility issues between cards.

latchkey · on Oct 6, 2023

Lisa has said that they are committed to improving ROCm, especially for AI workloads. Recent releases (5.6/5.7) prove that.

einpoklum · on Oct 6, 2023

> Nvidia has spent huge amount of work to make code run smoothly and fast.

Well, let's say "smoother" rather than "smoothly".

> ROCm code is slower

On physically-comparable hardware? Possible, but that's not an easy claim to make, certainly not as expansively as you have. References?

> has more bugs

Possible, but - NVIDIA keeps their bug database secret. I'm guessing you're concluding this from anecdotal experience? That's fair enough, but then - say so.

> ROCm ... don't have enough features and

Likely. while AMD has both spent less in that department (and had less to spend I guess); plus, and no less importantly - it tried to go along with the OpenCL initiative, as specified by the Khronos consortium, while NVIDIA has sort of "betrayed" the initiative by investing in it's vendor-locked, incompatible ecosystem and letting their OpenCL support decay in some respects.

> they have compatibility issues between cards.

such as?

kkielhofner · on Oct 7, 2023

I wouldn’t say ROCm code is “slower”, per se, but in practice that’s how it presents. References:

https://github.com/InternLM/lmdeploy

https://github.com/vllm-project/vllm

https://github.com/OpenNMT/CTranslate2

You know what’s missing from all of these and many more like them? Support for ROCm. This is all before you get to the really wildly performant stuff like Triton Inference Server, FasterTransformer, TensorRT-LLM, etc.

ROCm is at the “get it to work stage” (see top comment, blog posts everywhere celebrating minor successes, etc). CUDA is at the “wring every last penny of performance out of this thing” stage.

In terms of hardware support, I think that one is obvious. The U in CUDA originally stood for unified. Look at the list of chips supported by Nvidia drivers and CUDA releases. Literally anything from at least the past 10 years that has Nvidia printed on the box will just run CUDA code.

One of my projects specifically targets Pascal up - when I thought even Pascal was a stretch. Cue my surprise when I got a report of someone casually firing it up on Maxwell when I was pretty certain there was no way it could work.

A Maxwell laptop chip. It also runs just as well on an H100.

THAT is hardware support.

RcouF1uZ4gsC · on Oct 6, 2023

I am not so sure.

Everyone knows that CUDA is a core competency of Nvidia and they have stuck to it for years and years refining it, fixing bugs, and making the experience smoother on Nvidia hardware.

On the other hand, AMD has not had the same level of commitment. They used to sing the praises of OpenCL. And then there is ROCm. Tomorrow, it might be something else.

Thus, Nvidia CUDA will get a lot more attention and tuning from even the portability layers because they know that their investment in it will reap dividends even years from now, whereas their investment in AMD might be obsolete in a few years.

In addition, even if there is theoretical support, getting specific driver support and working around driver bugs is likely to be more of a pain with AMD.

AnthonyMouse · on Oct 6, 2023

This is what people complain about, but at the same time there aren't enough cards, so the people with AMD cards want to use them. So they fix the bugs, or report them to AMD so they can fix them, and it gets better. Then more people use them and submit patches and bug reporters, and it gets better.

At some point the old complaints are no longer valid.

hot_gril · on Oct 6, 2023

People complain about Nvidia being anticompetitive with CUDA, but I don't really see it. They saw a gap in the standards for on-GPU compute and put tons of effort into a proprietary alternative. They tied CUDA to their own hardware, which sorta makes technical sense given the optimizations involved, but it's their choice anyway. They still support the open standards, but many prefer CUDA and will pay the Nvidia premium for it because it's actually nicer. They also don't have CPU marketshare to tie things to.

Good for them. We can hope the open side catches up either by improving their standards, or adding more layers like this article describes.

zirgs · on Oct 6, 2023

CUDA was released in 2007 and the development of it started even earlier - possibly even in the 90s. Back then nobody else cared about GPU compute. OpenCL came out 2 years after that.

killerstorm · on Oct 6, 2023

Not true. People got interested in general-purpose GPU compute (GPGPU) in early 2000s when video cards with programmable shaders became available. https://en.wikipedia.org/wiki/General-purpose_computing_on_g...

People made a programming language & a compiler/runtime for GPGPU in 2004: https://en.wikipedia.org/wiki/BrookGPU

hot_gril · on Oct 6, 2023

Everything has old beginnings that the specialists will remember, but GPU compute really reached mass popularity and became a large selling point for Nvidia in the 2010s.

binarymax · on Oct 6, 2023

And the question for most that remains once AMD catches up: will the duopoly result in lower prices to a reasonable level for hobbyists or bootstrapped startups, or will AMD just gouge like NVidia?

quitit · on Oct 6, 2023

I think in this case the changes needed to make AMD useful will open the market to other players as well (e.g. Intel).

PyTorch is already walking down this path and while CUDA-based performance is significantly better, that is changing and of course an area of continued focus.

It's not that people don't like Nvidia, rather it's just that there is a lot of hardware out there that can technically perform competitively, but the work needs to be done to bring it into the circle.

binarymax · on Oct 6, 2023

Last I checked I saw the H100 was about two gens more advanced for certain components (tensor cores, bfloats, cache, mem bandwidth) - but my research may have been wrong as admittedly I'm not as familiar with AMDs offerings for GPU.

FuriouslyAdrift · on Oct 6, 2023

They are not behind... https://www.tomshardware.com/news/amd-expands-mi300-with-gpu...

You can also actually buy them as opposed to the nVidia offerings which you are going to have to fight for.

klysm · on Oct 6, 2023

A simplistic economic take would suggest that the competition would result in lower prices, but given two players in the market who knows.

binarymax · on Oct 6, 2023

My intuition is along the lines that if AMD had a competing product earlier, then it would have kept prices down. But since Nvidia has shown what the market will pay, AMD won't be able to resist overcharging. It will probably come down a little, but nowhere near to the point of affordability.

I sure hope I'm wrong.

tyre · on Oct 6, 2023

AMD might have to charge less to break into customers that are already bought into Nvidia. There has to be a discount to cover the switching costs + still provide savings (or access).

zirgs · on Oct 6, 2023

AMD will have to provide a REALLY steep discount to convince me to come back.

sumtechguy · on Oct 6, 2023

It is oligopoly pricing.

https://www.investopedia.com/terms/o/oligopoly.asp

With that few competitors pricing would not change much.

AnthonyMouse · on Oct 6, 2023

That's mostly when there isn't a lot of price elasticity of demand. If you're Comcast and Verizon, each customer wants one internet connection and you're not going to change the size of the market much by offering better prices.

If you're AMD and NVIDIA and lowering the price would double the number of customers, you might very well want to do that, unless you're supply constrained -- which has been the issue because they're both bidding against everyone else for limited fab capacity. But that should be temporary.

This is also a market with a network effect. If all your GPUs are $1000 and nobody can afford them then nobody is going to write code for them, and then who wants them? So the winning strategy is actually to make sure that there are kind of okay GPUs available for less than $300 and make sure lots of people have them, then sell very expensive ones that use the same architecture but are faster.

That has been the traditional model, but the lack of production capacity meant that they've only been making the overpriced ones recently. Which isn't actually in their interests once the supply of fab capacity loosens up.

sumtechguy · on Oct 10, 2023

I think you may be mixing demand curve with the supply curve? The supply curve in that case would be 'vertical' at some point. Monopoly prices and and oligopoly pricing do not necessarily want to be at MR=MC. Usually that pricing is where max profit is (closer to MD=MC). That in most cases is never to the right of MR=MC in this style market but to the left of it. It makes sense to make less items and make more money as you are not consuming your profit on support type items or factories.

> This is also a market with a network effect

That is what the demand curve describes. In your hypothetical that would mean the demand curve is more vertical in slope.

Higher prices are more likely due to several items. Participants willing to pay more (crypto and ai). But also less companies making the things than 15 years ago so less supply and oligopoly style pricing. Plus one company being the hinge pin on building the chips and another company consuming large portions of its supply. The supply curve shifted left and up. While the demand curve is going the up and right. There is no 'one thing' that causes it. But oligopoly pricing is very much in effect. With 3 companies making the things.

> Which isn't actually in their interests once the supply of fab capacity loosens up

Which would change the supply curve and they would re-evaluate which way to move the price. That could mean bad things or nothing happens other than possible lower prices (eventually).