Anyone following geohots current tinygrad struggles has seen this proven right in front of them. AMD gpus are practically unusable for any serious ML work and he had to learn it the hard way, having dropped 100K into AMD gpus, assuming the drivers would work and if not, he even personally offered to fix them.
Not just geohot but AMD has a really amazing opportunity here to work with many other talented engineers that are more than willing to put resources into this if they had good documentation and tools to work with, it's really puzzling to me that AMD isn't taking this opportunity more seriously.
A shortened version of what was happening: the firmware / hardware is so bad that instead of fixing it the AMD team just added some restarts when the AMD card locks up (which happens all the time with computatiins), and even those restarts don't work, the whole computer had to be restarted.
This serious bug was open since May and AMD doesn't seem to respond as seriously as it should be.
Is this the same geohot that 9+ months ago declared he was "done with AMD"?
Isn't geohot infamous for stealing other people's work?
PBCAK?
That said, ROCm only officially supports a fraction of its product line, and an odd smattering throughout at that. It's a joke compared to CUDA which will run on damn near anything. And AMD has a long, long history of dogshit drivers (at least on Windows.)
AMD just doesn't seem to give enough of a shit to invest money into securing top talent for this, and NVIDIA will continue to stomp them.
That same bug is still open, not fixed. Azure announced access to AMD GPU cloud with NDA, but the cards are unusable for compute work as they lock up randomly.
AMD is a deeply unserious company. They could have made boatload of money for shareholders like Nvidia did, but the AMD management looks very bad to me.
Shareholders of AMD should look into it and do some firings of top Executives/CEO until morale improves.
I think the problem AMD has is that they just don't have enough engineers and can't hire more because nvidia (an to a lesser extent Apple, AWS, Google and Microsoft) just gobbles up all people who have any experience with this sort of thing.
A long time ago AMD decided to 100% focus on budget consumer graphics (including consoles), that decision was the right decision at the time. However being in low-margin business it seems they don't have the people (or the budget to last-minute hire) to pump out the R&D for a generic neural network platform without moving people away from their consumer graphics division.
I dont understand this - arent almost all ML NN models built in pytorch, and arent these compiled / jit'd into a lower level format - and can we not have various backends/drivers for that, such as CUDA / ROCM / vnni ?
The article is unsatisfying because it doesnt explain WHY cuda reigns supreme.
One hypothesis put forward is that the main alternative ROCM is just not very complete and not very fast - thats a good argument.
Another hypothesis that is not considered is : CUDA reigns supreme, because NVIDIA GPUs reign supreme.
But people dont write CUDA code .. they write pytorch code ?!
Nobody else seems to be willing to invest serious funding, including market rates for SWEs, into compelling alternatives. I believe AMD's TC for senior software engineers tops out at 200k in the Bay Area.
The problems you generally experience are:
* Inexplicably poor performance
* Poor (and sometimes incorrect) documentation
* Difficulties debugging
* Crashes and hangs
I started playing around with porting some CUDA code to ROCm/HIP on a Ryzen laptop APU I had. While an "unsupported" configuration (which was understood), it all worked until AMD suddenly and explicitly blocked the ability to run on APU's. Currently the only way to get back to work on that project on that particular computer would be to run a closed-source patched driver from some rando on the internet. Needless to say, I lost interest.
Last I checked, there were only 7 consumer SKU's that could run AMD's current compute stack, the oldest being 1 generation old. Even among the enterprise hardware they only support ~2 generations back. So you can't even grab some old cheap recycled gear on e-bay to hack on their ecosystem.
Meanwhile, I can pull anything with an NVIDIA logo on it from a junkyard it'll happily run CUDA code that I wrote for the 8800GTX 15+ years ago.
I'm an AI compiler engineer and AMDs hiring process was ... Non-competitive. Companies are hiring left and right at a fast clip and heres AMD wanting you to fly out in a month. I love their CPUs but... Come on. You gotta be serious to compete
They do write CUDA code, oh boy do they ever. PyTorch is just a coordinator for CUDA or sometimes Metal kernels. New AI architectures and algorithms often end up needing a new or tuned kernel. Look at Flash Attention for an example of one of those that had a big impact.
The tooling around ROCm is not as good (debuggers, profilers etc), and at least in my tangential experience (that is, involving GPGPU computation, but not for ML), custom operations are faster when written in CUDA code than in a high level Python wrapper (or, for that matter, using tools like OpenMP). Just as we write all our actually performance demanding code in C/C++, we write all our performance sensitive GPU code in CUDA (and obviously, performance is the entire point of putting in the effort to write GPU code).
I wonder where does Mojo (new programming language by Chris Lattner's company) fit in all this? Their promise is to be a super-set of Python (like C++ was to C) and resolve all hardware interface issues.
I know its still in development. But curious to know if someone has played around with it for the kind of needs discussed on this page.
> I dont understand this - arent almost all ML NN models built in pytorch, and arent these compiled / jit'd into a lower level format - and can we not have various backends/drivers for that, such as CUDA / ROCM / vnni ?
PyTorch already does. But if you're saying "NN" and "pytorch" that already means you're outside of the audience for CUDA I'm talking about in the article. My own stuff was usually Bayesian Hierarchical Models, which at least at the time made pytorch completely useless (that was nearly a decade ago though—maybe that specific use case improved).
If you've tried to write actually new (or different enough) NNs or entirely different models, pytorch is too high-level, and sometimes even TF is too. Even aside from that, if you're a maintainer of BLAS or some specific library for sparse MM with very specific distributions that are optimized for it...
Anyway, those are the key cases, but even aside from that, if you've ever tried even with some higher-level libraries to do non-vanilla stuff, nothing works as well as it should. You get random, inscrutable errors that certainly do exist on NVIDIA GPUs/stuff-based-on-CUDA-under-the-hood, but way way fewer. For newer, custom stuff, getting things like numerical overflows or other completely breaking problems on alternative backends, but don't happen / work just fine on CPU or CUDA backend is not really that uncommon. Or the CUDA backend is just ridiculously faster. If you're doing something annoying, new, and complicated enough, there's no point in taking the aggravation.
The people who write the stuff that is used in PyTorch or other libraries definitely write CUDA code (in C++ etc). And then the people who use PyTorch just build on top of that.
I deliberately tried to keep it accessible and have non-technical (or just non-software) audiences also be able to get an intuition for why CUDA has such strong lock-in. Otherwise, the pushback I've often gotten "just re-write it" or "it's just software" which if it were so simple, people wouldn't need to be yelling so much at AMD across so many comments. Basically, people who can't fathom why software technical debt can ever be a thing. Or, if it is, China has infinite money and time anyway.
A high-level analysis should say that Huawei, AMD, and Intel all should easily invest enough to make this all work and compete with CUDA to push their hardware platforms. The reality is decentralized decision-making from users also makes it more of an expensive, uncertain bet that people will adopt. A bunch of the lower-level, underlying libraries that things are built on AND the researchers who do bleeding-edge research still have a huge amount of experience in and stuff built on CUDA.
I am not sure CUDA is the moat, but yes, software is the moat.
To first order nobody writes any CUDA, and even if you do you are probably bad at it. The language is slightly easier to use than openCL but writing really performant code is still a nightmare (a pipeline of asynchronous memory copies from global to shared memory is not easy to program but this is a requirement for full performance on tensor cores).
So no, the moat really isn't the language. It's not even the libraries, it's the integration of the libraries into third party software like pytorch, jax, etc. This is the truly massive advantage NVIDIA has, and they got it by being early and by being installed in an awful lot of machines.
”To first order nobody writes any CUDA and even if you do you are probably bad at it” is such an anti-intellectual stance that is repeated to such a large extent that it irks me. It’s the authors protecting their ego and is said about everything they don’t understand. It is said about compilers, about static typing, about pretty anything the authors do not yet know.
At least say why people wouldn’t be good at it. The documentation is poor, the GPUs are a black box or anything in that vein. Then they can help you learn instead of preemptively dismiss it.
I used to work at NVidia on the design of their tensor cores. As you can imagine, I had to be rather familiar with various kinds of high performance kernels that people are talking about in this thread.
I second the GP: nobody in their right mind would try to compete with the performance or functionality of libraries like cuDNN/ or cuBLAS.
NVidia pays for an army of exceptionally skilled folks to write these high performance kernels, working hand in hand with the architects that design the hardware, and with access to various sophisticated tools and performance models beyond what is available to the general public.
It would be like trying to compete against Olympians, to use an analogy that we can all understand.
I must be in a niche where we're consistently crushing cublas, cusolver and cudnn, sometimes cutlass with internship-level competency, mostly because our problem-sizes are not in the cone of optimization of the Olympians of NVIDIA. Large batches of small matrices, specific matrix forms, long kernel pipelines...
Also until all of these libraries are made amenable to kernel fusion or just sometimes prologue/epilogue features they can be beaten on memory bandwidth with pretty lowly-optimized kernels with no global memory traffic.
I'm very glad cuFFT and cuBLAS are getting 'device' (Dx) versions, and NVIDIA is getting wiser on the kernel-fusion track. They're amazingly fast and game-changing but they're still not covering a big chunk of the original libraries.
Also, a lot of problems that are amenable to GPU compute are not expressed in blas/dnn and still can be very, very simply expressed as CUDA code, and still extract huge performance gains against CPUs, without a chance that the Olympians will ever get an interest to your problem space.
> we're consistently crushing cublas, cusolver and cudnn, sometimes cutlass.
I know you probably don't mean to say that Nvidia can't write good CUDA, but this does sort of illustrate how hard that is. I've seen similar cases (tiny matrix multiplied by enormous matrix) in which it was possible to write something faster than Nvidia's library. I'm not sure if this has been addressed since though.
> they can be beaten on memory bandwidth with pretty lowly-optimized kernels
This is partly why I believe most CUDA code probably isn't "good" - there's this enormous gulf between acceptable and good which often isn't worth crossing.
I meant to say that they optimized deeply for known and popular use cases and that it doesn't take ungodly amount of expertise to perform better, depending on the way you express your problem or its dimensions or whatever they didn't cover -edit to add- if your use-case doesn't fit.
I also meant to say that the domain is full of low hanging fruits if your problem doesn't fit whatever NVIDIA didn't optimize deeply. An intern may beat the cuXXX libraries with a little work and you can work up to max perf, yes, with serious effort.
There is probably thousands of man hours plunked in BLAS on Intel hardware and anyone who seriously tried to do AVX2/AVX512 knows it's hard to reach actual max perf on all problems. Yet I don't read 'only Intel experts can code efficient code'. It's no more true for CUDA than other parrallel or memory-weird architectures I've worked on. Yes it's different, but getting max perf has always been hard on any modern hardware.
As for the gulf between acceptable and good, the problem is similar here too: people stop when they've reached their goal or feel they can scale more efficiently by other means. I really don't see the difference with heavily optimized x86 stuff. We keep seeing new stuff you can do to improve AVX512 code or new places where you can apply it (JSON parsing, utf validation...) and it's been out for a while too. There hasn't been any free lunch there for a long, long time.
I don't think I'm saying anything revolutionary or derogatory when I say that e.g. linear algebra with big batches of small complex-valued matrices, or thin/very-tall matrix multiplication, or 1D-complex convolutions with large filters are not in the main path of the NVIDIA engineers (I did say 'niche').
Some things are not heavily optimized by NVIDIA, it's fine, and a good thing too that they can focus their effort on what's useful to the overall community.
What I'm saying is that very often writing by hand a naive kernel, optimized by a non expert for some months, can reach better performance than library code that isn't optimized for niche use cases. Which is a testament to how easy to get good or OK (not optimal) performance...
I don't know about pyTorch (I was talking about niche use cases?) but TensorRT allows custom kernels and it's worth to use them and plonk a house-implemented kernel if you know what's your bottleneck and no-one has bothered writing a less-generic version yet... again, intern-level competency (not senior CUDA optimizer).
Sorry, I thought this article/thread was all about pyTorch/AI and NVidia's moat in this area vs AMD and other competitors, so my comments are written in that specific context.
If I have lost track of the conversation, please accept my apologies.
I gave an example of why people wouldn't be good at it with the pipelined asynchronous memory copies. Take a look at link below to the documentation. It's just plain difficult to do something as basic as move data into shared memory efficiently. Others have given far more detailed responses.
You probably won't like this, but I'm also going to suggest you take a look at the HN guidelines about assuming good faith, and around responding to the argument instead of calling names. My comment might have irked you but that's not actually a basis for deciding I'm anti intellectual, that I'm protecting my ego, and that I really just need someone to help me learn.
I've worked in one of the top computing labs, with top GPU computing startups, have investor money from Nvidia, wrote CUDA for years, and hire people to do write GPU code. And would say, most people -- even Nvidia employees and our own -- are individually bad at writing good CUDA code: it takes a highly multi-skilled team working together to make anything more than demoware. Most people who say they can write CUDA, when you scratch a little bit of the items I put below, you realize they can only for some basic one-offs. Think some finance person running one job for a month, but not at the equivalent of a senior python/java/c++ developer doing whatever reliable backend code they're hired to do that lives on.
To give a feel, while at Berkeley, we had an award-winning grad student working on autotuning CUDA kernels and empirically figuring out what does / doesn't work well on some GPUs. Nvidia engineers would come to him to learn about how their hardware and code works together for surprisingly basic scenarios.
It's difficult to write great CUDA code because it needs to excel in multiple specializations at the same time:
* It's not just writing fast low-level code, but knowing which algorithmic code to do. So you or your code reviewer needs to be an expert at algorithms. Worse, those algorithms are both high-level, and unknown to most programmers, also specific to hardware models, think scenarios like NUMA-aware data parallel algorithms for irregular computations. The math is generally non-traditional too, e.g., esoteric matrix tricks to manipulate sparsity and numerical stability.
* You ideally will write for 1 or more generations of architectures. And each architecture changes all sorts of basic constants around memory/thread/etc counts at multiple layers of the architecture. If you're good, you also add some sort of autotuning & JIT layers around that to adjust for different generations, models, and inputs.
* This stuff needs to compose. Most folks are good at algorithms, software engineering, or performance... not all three at the same time. Doing this for parallel/concurrent code is one of the hardest areas of computer science. Ex: Maintaining determinism, thinking through memory life cycles, enabling async vs sync frameworks to call it, handling multitenancy, ... . In practice, resiliency in CUDA land is ~non-existent. Overall, while there are cool projects, the Rust etc revolution hasn't happened here yet, so systems & software engineering still feels like early unix & c++ vs what we know is possible.
* AI has made it even more interesting nowadays. The types of processing on GPUs are richer now, multi+many GPU is much more of a thing, and disk IO as well. For big national lab and genAI foundation model level work, you also have to think about many racks of GPUs, not just a few nodes. While there's more tooling, the problem space is harder.
This is very hard to build for. Our solution early on was figuring out how to raise the abstraction level so we didn't have to. In our case, we figured out how to write ~all our code as operations over dataframes that we compiled down to OpenCL/CUDA, and Nvidia thankfully picked that up with what became RAPIDS.AI. Maybe more familiar to the HN crowd, it's basically the precursor and GPU / high-performance / energy-efficient / low-latency version of what the duckdb folks recently began on the (easier) CPU side for columnar analytics.
It's hard to do all that kind of optimization, so IMO it's a bad idea for most AI/ML/etc teams to do it. At this point, it takes a company at the scale of Nvidia to properly invest in optimizing this kind of stack, and software developers should use higher-level abstractions, whether pytorch, rapids, or something else. Having lived building & using these systems for 15 years, and worked with most of the companies involved, I haven't put any of my investment dollars into AMD nor Intel due to the revolving door of poor software culture.
Chip startups also have funny hubris here, where they know they need to try, but end up having hardware people run the show and fail at it. I think it's a bit different this time around b/c many can focus just on AI inferencing, and that doesn't need as much what the above is about, at least for current generations.
Edit: If not obvious, much of our code that merits writing with CUDA in mind also merits reading research papers to understand the implications at these different levels. Imagine scheduling that into your agile sprint plan. How many people on your team regularly do that, and in multiple fields beyond whatever simple ICML pytorch layering remix happened last week?
If there is a niche that is at the intersection of multiple specialties, and it includes GPU acceleration, there is a good chance it is ripe for a startup to get an early mover advantage. Eg, real-time foundation models for audio around non-english/non-chinese that works small & offline in cars.
Unfortunately, Nvidia has a culture of open sourcing all CUDA code, so if any startup shows something works commercially, Nvidia will rewrite, likely ultimately better, and give away for free, so more companies will do it and buy more GPUs.
If I was any of these companies, I'd totally invest many billions in ecosystem here. Tensorflow (Google) and pytorch (Facebook) are great examples, it can work. Otherwise, hw companies will continue to lose relevance in the growing server market, and SW companies will have an ever growing Nvidia tax.
But it's not easy for the hw co's. OpenCL was more of a hw company thing (Intel, AMD, mobile chip co's), and while they spend billions on adventures all the time, their SW leadership culture has been bad. They fail to do sustained & deep ecosystem investment, and instead look like small feudal orgs that get their projects pulled arbitrarily whenever the VPs rearrange themselves. For example, given that Intel brought back its old CEO, that was a scary signal to me for this front. Intel specifically had the internal talent, I'm not sure if they still do, just not at the management level, and definitely not culturally at the highest leadership level.
Jensen at Nvidia has always been a special CEO here, even when they were helping game companies make their engines, and I'm guessing that taught him the value of long-term vertical SW & ecosystem investment. Instead of Intel unifying on x86 and c++ (compilers, vtune, Intel tbb, ...), and letting Microsoft / Linux / DB people go higher, Jensen went all the way up the stack to get at full utilization, and unified teams internally on that over 1-2 decades.
Apple is a funnier case. I can see them doing it and then pulling the plug. Eg, Chris Lattner making Swift and then they failed to retain him, and their revolving door of frameworks overall. Internally, they do have the technical talent and $, but I don't understand the culture and commercial alignment.
Finally.. I do think the increasing importance of AI inferencing, yet simultaneous simplicity of it, has opened a disruption opportunity here. We are still at a tiny % of where it is going. Onyx, pytorch, transformers, etc ecosystem are still early days from that perspective. It's fast for a hardware co like Groq to port a new model. So I don't rule out big changes here, and those being used to drive the rest of the ecosystem, like your q on ROCm.
ML researchers in statistics departments write stuff in R, which makes everyone scream. ML researchers absolutely do.
My point in the article was basically the class was "indoctrinating" (too strong, but you get the point) the future ML researchers in the superiority of using CUDA and spending NVIDIA company resources to continuously do so in these classes, year after year.
This hits the nail on the head. Nvidia got all the programmers excited about using their GPUs first and now they have all the software targeting their hardware.
If you could compile CUDA for Intel and AMD it's not going to perform well. When you program a GPU you aren't just writing task specific code, you are also writing hardware specific code. So having developer mindshare matters much more than having a nice programming language.
In ML many people write pytorch and not CUDA. But even in ML the choice of precision is driven by the data types Nvidia can deal with efficiently - this is a moat which is nothing to do with CUDA.
Yes. Writing CUDA that calls CuBLAS or CUB is still writing CUDA. Lots of kernels and functions (functors, etc) are "business requirements" moreso than math libraries. It's no different than the CPU code world, there are far more CRUD apps than BLAS libraries written, and writing a CRUD app that calls a BLAS library doesn't mean you're not "writing CPU code". Someone has to write those systems of linear equations for BLAS to solve.
The world is deeper than just assembly and BLAS tuning, and you can get extremely far in CUDA just by gluing together the primitives they give. Python is popular in the AI/ML space, but far from the only way to do that.
What I said was "to first order nobody writes any CUDA". Using "to first order" in that way is probably an abuse of terminology, but my intent was to say the majority of people using GPUs do not write CUDA, not that literally nobody does (which would be absurd).
If I was an AMD shareholder I'd seriously be considering a vote to remove CEO Lisa Su. They make nearly identical products to NVIDIA, yet that other company is worth literally ten times as much, because pytorch actually works on their cards. Why isn't she prioritizing firmware that doesn't crash?
> Why isn't she prioritizing firmware that doesn't crash?
I used to work in the GPU industry and this sort of view is both pervasive and misguided.
GPUs are immensely complex machines. It is really hard to get them to work, let alone work with high performance.
Because of this, and in spite of the amount of time and resources spent on validation and verification, the hardware often contains flaws. It is the responsibility of the drivers to work around these flaws in various ways. When a flaw hasn't been discovered and worked around yet, you perceive it as the GPU being unstable or crashing.
There is no fast simple solution to this. You need a finely tuned corporate machine from beginning to end. Better hiring processes, better management, better design processes, better verification processes, better software development practices, better marketing and sales, better customer relations. Everything.
>GPUs are immensely complex machines. It is really hard to get them to work, let alone work with high performance.
This is like saying combustion engines are immensely complex machines when your car suddenly loses power on the highway for no apparent reason and then when you restart the engine it works for another five minutes again. When you drive on normal roads it works flawlessly. It must be the engine, right? After all, it is the most complicated aspect!
Except in reality it is far more likely for it to be a problem in the electronics driving the fuel pump or spark plug.
AMD most likely has some sort of buffer overflow or deadlock in their GPU drivers that is causing difficult to diagnose problems. It is very unlikely that the silicon itself is broken when it works fine for playing video games and it also works fine when your GPU is one of the few officially supported by ROCm.
> AMD most likely has some sort of buffer overflow or deadlock in their GPU drivers that is causing difficult to diagnose problems. It is very unlikely that the silicon itself is broken when it works fine for playing video games and it also works fine when your GPU is one of the few officially supported by ROCm
Thank you for sharing your opinion. My experience writing GPU device drivers was different.
Drivers are relatively simple compared to the underlying hardware and the hardware programming interface between the two reflects that. As a result of that, driver developers spend a ton of their time chasing down hardware bugs. Drivers are also intrinsically simpler to debug, not just because they are smaller but also because you often have better tools to inspect what is going on.
Another factor to consider is that software bugs are fixed, while hardware bugs are most often worked around in software. This is done out of necessity, because the process of spinning a new hardware revision is extraordinarily expensive and avoided at all cost.
But again, it's just how things went down in my personal experience and yours may be different.
not if this moat could be leveraged into a monopoly on AI chips, to the detriment of society.
I want to see competition in this space.
Unfortunately, the market rally of nvidia stock is suggesting that most investors are expecting this monopoly to eventuate.
Therefore, it is in the interest of society to ensure that such a software moat is not established. Look what happened to the web browser when microsoft held a monopoly on it, and look at what is happening with chrome, apple appstore, etc.
> Look what happened to the web browser when microsoft held a monopoly on it, and look at what is happening with chrome, apple appstore, etc.
Realistically what happened is that after a few decades of development, competitors arose and took the market. In the meantime, Microsoft became rich. Who cares
so why isn't there a windows competitor today for the consumer PC market? Surely it's a bigger market than ai chips.
The answer is that the entrenchment of the tools, software and inertia of a defacto standard is what prevents new entrants. The time to stop it is to nip it in the bud. Prevent monopoly from forming, rather than hope that after the monopoly forms, some competitor will break it.
nobody is kneecapping anyone. The ask is to change CUDA into a standard for which nvidia is one implementation.
instead, what you have today is this:
> You may not reverse engineer, decompile or disassemble any portion of the output generated using SDK elements for the purpose of translating such output artifacts to target a non-NVIDIA platform.
CUDA is a moat because AMD and Intel are run by morons^W^W^W run by people who can't swallow the fact that software is more important than hardware.
Intel should be shoveling out 16GB Arc graphics cards for free to every graduate program in the country who can fill out a web form. In a couple years, they'd displace NVIDIA.
AMD needs to be funding a CUDA shim that allows people to port stuff directly to their cards. And they need to NOT be segmenting the consumer and professional cards software ecosystems.
Yes, there has been progress. However, when you look at the amount of money that AMD and Intel throw at software vs how much NVIDIA throws at software, it's an instant facepalm moment.
NVIDIA is 100% vulnerable--if it weren't for the fact that their competitors are idiots.
>NVIDIA is 100% vulnerable--if it weren't for the fact that their competitors are idiots.
I think Nvidia sees it too. That's why they're moving upstream by providing the entire stack from CUDA, GPUs, interconnects chips, networking chips, racks, OS, software, models.
I think the "CUDA moat" people like OP are underselling Nvidia. They're positioning themselves as the full-stack AI provider. Forget CUDA.
The moat that CUDA currently provides is what gives nvidia the room to move up. CUDA is a stepping stone - something stable they can rely on to cement an even higher position (tell me that the full stack is less vendor locked-in than CUDA, and i'll have a bridge to sell you).
How hard do you think it is to find engineers who are
- Great at legacy C++ code.
- Great at new C++ code.
- Great at embedded/high performance/distributed code.
- Are experts in Linear Algebra and Calculus
- Are competent at Machine Learning and similar problems.
Now imagine, that after you find ~10-50 competent senior engineers who can each segment and train 1-5 engineers, you also need to hire 10-20 managers, PMs and directors who are smart enough to do more than "copy NVidia's offering from last year", and wise enough to still build a 1:1 compatibility layer.
Apple is likely seeing more traction on their metal API by virtue that it is reasonably well guaranteed to be around in ~5 years, and is common on multiple device platforms that students/devs use or customers deploy.
All this describes video game programmers to me (well 4/5 at least). Given that there's been thousands of game layoffs recently anyone looking to build their AI teams should be diving through linkedin looking for laid off video game programmers.
My understanding of the games industry is that a small fraction of game programmers are working on core game engines and low level graphics kernels. Is that inaccurate?
We're talking about trillion dollars of market cap here. If the difficulty is in hiring up to ~70 people, with somewhat but not obscure skills, perhaps the executives should be revisited.
It's kind of hilarious to be saying that Apple is more likely to be seeing traction on Metal of all things, when all but the last one of those requirements fit graphics programmers in Vulkan or DirectX, both of which have far more traction than Metal, and that last requirement is pretty easy to pick up if you're an expert in linear algebra and calculus.
It gets even stranger when considering that as major GPU makers, both AMD and Intel have lots of access to such talent.
Vulkan only has traction on Android, and a couple of Linux titles.
Metal has 20% of the desktop market, and whole of the iOS/iPad/watchOS markets combined.
Even with Android market share, many folks keep using OpenGL ES, because Vulkan tooling on Android sucks and isn't available to Java/Kotlin developers like OpenGL ES is, so only game engines like Godot/Unreal/Unity make use of Vulkan in practice.
Genuine questions. What are your use cases? What do you do? How much experience?
My personal experience shows CUDA to in fact be a very deep moat. In ~12 years CUDA and ~6 ROCm (since Vega) I’ve never met a professional who says otherwise, including those at top500.org AMD sites.
From what I’ve seen online this take really seems to come from some kind of Linux desktop Nvidia grudge/bad experience or just good ‘ol gaming/desktop team red vs green vs blue nonsense.
Many things can be said about Nvidia and all kinds of things can be debated but suggesting that Nvidia has > 90% market share simply and solely because people drink Nvidia kool-aid is a wild take.
I have 40+ yrs of HPC/AI apps/performance engineering experience & I was one of the 1st people to port LAPACK and a number of other numerical libs to CUDA. Moreover, many of those major DoE + AI sites are my customers.
You should not confuse AMD's general & long-standing indifference/incompetence wrt SW with the actual difficulty of providing a portable SW path for acceleration. As Woody Allen once said: "90% of success is showing up"
But what happened in AI, when, in a very short period of time, almost everyone moved away from writing their directly in CUDA, to writing them in frameworks like Tensorflow & PyTorch is all the evidence anyone need to show just how unsound that SW obstacle is.
I'm working on a project ATM at one of the DoE sites you're likely referring to... Maybe we'll bump into each other!
Ah yes, pytorch:
1) Check issues, PRs, etc on torch Github. Considering market share ROCm has a multiple of the number of open and closed issues. There is still much work to be done for things as basic as overall stability.
2) torch is the bare minimum. Consider flash attention. On CUDA just runs of course with sliding window attention, ALiBi, and PagedAttention. ROCm fork? Nope. Then check out the xFormers situation on ROCm. Prepare to spend your time messing around with ROCm, spelunking GH issues/PRs/blogs, etc and going one by one through frameworks and libraries instead of `pip install` and actually doing your work.
3) Repeat for hundreds of libraries, frameworks, etc depending on your specific use case(s).
Then, once you have a model and need to serve it up for inference so your users can actually make use of it and you can get paid? With CUDA you can choose between torchserve, HF TEI/TGI, Nvidia Triton Inference Server, vLLM, and a number of others. vLLM has what I would call (at best) "early" support that requires patches to ROCm, isn't feature complete, and regularly has commits to fix yet another show-stopping bug/crash/performance regression/whatever.
Torch support is a good start but it's just that - a start.
I almost spew my coffee when reading your grand parent comments.
One of the first teams that ported LAPACK to CUDA or CULA are apparently being paid handsomely by Nividia [1],[2].
Interestingly, DCompute is a little known effort to support compute on CUDA and OpenCL in D language, and it was done by a part-time undergrad student [3].
I strongly believe we need a very capable language to make advancement much easier in HPC/AI/etc, and D language fit the bill very much and then some. Heck it even beat other BLAS libraries that other so called data languages namely Matlab and Julia still heavily depended on for their performances to this very day. It does it in style back in 2016 more than seven years ago [4]. The DCompute implementation by the part-timer in 2017 actually depended on this native D implementation of these linear algebra routines in Mir [5].
[1] CULA: hybrid GPU accelerated linear algebra routines:
I got paid to do the LAPACK port, back in the mid 2000s, for a federal contractor working on satellite imaging type apps. I was still a good coder, back then... Took me about a month, as I recall. Maybe 6 weeks.
But I'm one of those old-school HPC guys who believes that libraries are mostly irrelevant, and absolutely no substitute for compilers and targeted code generation.
Julia is cool, btw. It could very well end up supplanting Fortran, once they fix the poor performance code generation issues.
I think you are right on the libraries, that's why there's currently an initiative in D eco-system to have D compiler DMD as a library, and the aim is probably for compiler should be the only way to run the library without extra code [1].
I really wished any modern language should try supplanting Fortran for HPC and personally my bet is on D.
Mostly true, and you'll get no argument from me on the AMD & Intel are fuckwits front. Intel does ok, but AMD in particular has completely dropped the ball on the SW front, and has been doing so for at least 25 yrs.
The point I was glibly trying to get across was that even a small effort on the part of AMD to treat the SW side as seriously as NVidia does would have yielded great benefits, and not have left them so far behind.
Also, there is a lot of work going on in the gcc & llvm toolchain to not only use OpenMP to target accelerators in computationally intensive loops but, in the case of llvm, to also target tensor instructions for more efficient code generation (https://lists.llvm.org/pipermail/llvm-dev/2021-November/1537...).
It took the AI folk less than 18 months to almost completely move away from CUDA to Tensorflow and then PyTorch... LLVM, imho, is going to do the same for Sci/Eng and general code bases in the next 2 years.
> AMD needs to be funding a CUDA shim that allows people to port stuff directly to their cards. And they need to NOT be segmenting the consumer and professional cards software ecosystems.
problem being that despite years of work and despite all the marketing hype, it’s still missing basic feature that are over 10+ years old on the nvidia side. If you can’t do dynamic parallelism then kernels can never launch kernels, for example. It has “partial support” for texture unit access. Inter-process communication is not supported. Etc.
I don't know much about CUDA and NVIDIA, but it has always surprised me how hardware companies are so bad at producing good software tooling for their hardware.
Many microcontroller companies have terrible software support: no free C/C++ compilers, clunky IDEs, too much reliance on 3rd party software providers, no decent code libraries...
Even if they have software support, the code is bad and bloated. Look at ST's HAL libraries, for example. Thankfully, an open source or free tool often comes to the rescue, usually through the efforts of dedicated individual programmers. But billion-dollar companies relying on such 3rd party tooling seems insane to me.
Chasing compatibility is a waste of time and ultimately counterproductive. The important software is open source, they can just add direct support for their stuff. What they need to do is fix the stability of their drivers, make their stuff work on every GPU they sell or have sold in the past few years (as CUDA always has), and pay employees to integrate support into all the popular open source projects while fixing every bug that gets reported.
And they need to release high-RAM versions of their next gaming GPUs. More than anything else that will incentivize people to switch. If they're selling 36 GB while Nvidia is still selling 24 GB, people will do what it takes to move over.
> What they need to do is fix the stability of their drivers, make their stuff work on every GPU they sell or have sold in the past few years (as CUDA always has), and pay employees to integrate support into all the popular open source projects while fixing every bug that gets reported.
This takes a ton of employees which is hard for a company with a fraction of the software employees of Nvidia. (On that note there's 1185 engineering job postings on the AMD site right now... https://careers.amd.com/careers-home/jobs?categories=Enginee...)
Skimping on software is penny wise and pound foolish. This is worth a trillion in market cap. If it cost a hundred billion dollars it would still be worth it. They need to swallow their pride, admit they've been wrong this whole time, and start hiring software engineers like crazy.
That's why they have like 1000 job postings in a market where nobody is hiring. I keep trying to convince people I know to apply, but they think they're not good enough for some reason.
I don't think the person put in the cluse by himself, surely AMD would have known and agreed to it. They would own the output of any of his work anyways, it was only with their approval was he was allowed to take ownership or release it. I do think this is splitting hairs, AMD isn't the savor here but I wouldn't say they don't get any credit either.
Lock-in should be broken. CUDA is one of the worst things about this whole ecosystem. Looks like AMD came close to breaking it, but they abandoned developing the translation layer.
It would take away a huge chunk of their advantage, no doubt about it. Let Nvida compete on merit instead of lock-in. Then you can say their advantage lies in being better. But Nvidia is very lock-in oriented, which undermines the claim that they are so much better than everyone.
CUDA IS the merit. They’ve been developing the software stack to make GPU programming accessible for 2 decades now. They’re a software company as much or more than they are a hardware company. Ignoring this fundamentally misunderstands why they’re in the position they’re in now.
It's not the merit - it's the moat (i.e. lock-in) as the linked article states. Merit in this context would mean something you can compare across different GPUs. For CUDA - you can't. I.e. it's a tool to force you to use Nvidia.
Chip War has a great section on how the Soviet Union tried a “just copy/steal” strategy in semiconductors and fell hopelessly behind because of it. It’s a great theoretical idea to just copy/steal and fast-follow, but semiconductors, AI, and other “harder technologies” require building human and intellectual capital that will get better with time. From there, you need to have the prior generation to keep up with ever-increasing complexity and difficulty as these things get more advanced.
I disagree with your section on Huawei and China. China isn't just trying to just copy/steal AI. In terms of models, China is a bit behind in LLMs but arguably more ahead in self-driving cars. China is throwing everything at semiconductor manufacturing instead because that's where their bottleneck truly is - not CUDA. Had Huawei had access to TSMC's 5nm and 3nm, they might already be equal to Nvidia in raw GPU prowess. After all, HiSilicon's Kirin already matched/exceeded Qualcomm before the Trump ban. Their 5G chips/implementation were well ahead of anyone else. In software, it's easier for China to adopt a CUDA alternative because China is usually really good at unifying under one vision - especially when they have to.