Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If I was an AMD shareholder I'd seriously be considering a vote to remove CEO Lisa Su. They make nearly identical products to NVIDIA, yet that other company is worth literally ten times as much, because pytorch actually works on their cards. Why isn't she prioritizing firmware that doesn't crash?


> Why isn't she prioritizing firmware that doesn't crash?

I used to work in the GPU industry and this sort of view is both pervasive and misguided.

GPUs are immensely complex machines. It is really hard to get them to work, let alone work with high performance.

Because of this, and in spite of the amount of time and resources spent on validation and verification, the hardware often contains flaws. It is the responsibility of the drivers to work around these flaws in various ways. When a flaw hasn't been discovered and worked around yet, you perceive it as the GPU being unstable or crashing.

There is no fast simple solution to this. You need a finely tuned corporate machine from beginning to end. Better hiring processes, better management, better design processes, better verification processes, better software development practices, better marketing and sales, better customer relations. Everything.


>GPUs are immensely complex machines. It is really hard to get them to work, let alone work with high performance.

This is like saying combustion engines are immensely complex machines when your car suddenly loses power on the highway for no apparent reason and then when you restart the engine it works for another five minutes again. When you drive on normal roads it works flawlessly. It must be the engine, right? After all, it is the most complicated aspect!

Except in reality it is far more likely for it to be a problem in the electronics driving the fuel pump or spark plug.

AMD most likely has some sort of buffer overflow or deadlock in their GPU drivers that is causing difficult to diagnose problems. It is very unlikely that the silicon itself is broken when it works fine for playing video games and it also works fine when your GPU is one of the few officially supported by ROCm.


> AMD most likely has some sort of buffer overflow or deadlock in their GPU drivers that is causing difficult to diagnose problems. It is very unlikely that the silicon itself is broken when it works fine for playing video games and it also works fine when your GPU is one of the few officially supported by ROCm

Thank you for sharing your opinion. My experience writing GPU device drivers was different.

Drivers are relatively simple compared to the underlying hardware and the hardware programming interface between the two reflects that. As a result of that, driver developers spend a ton of their time chasing down hardware bugs. Drivers are also intrinsically simpler to debug, not just because they are smaller but also because you often have better tools to inspect what is going on.

Another factor to consider is that software bugs are fixed, while hardware bugs are most often worked around in software. This is done out of necessity, because the process of spinning a new hardware revision is extraordinarily expensive and avoided at all cost.

But again, it's just how things went down in my personal experience and yours may be different.


You want to fire someone who helped getting AMD on top of Intel?

Pretty bad idea, especially in midst of the AI hype.


AMD has a CPU division too, and Zen basically resurrected AMD against Intel.


"Why isn't she prioritizing firmware that doesn't crash?"

why can't xyz company build apps/websites/products that don't have bugs??




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: