I've worked in one of the top computing labs, with top GPU computing startups, h...

j7ake · on March 24, 2024

Thanks for the insight. Looks like the principle of “doing things that don’t scale” works surprisingly well even in the ML space.

lmeyerov · on March 24, 2024

Agreed.

If there is a niche that is at the intersection of multiple specialties, and it includes GPU acceleration, there is a good chance it is ripe for a startup to get an early mover advantage. Eg, real-time foundation models for audio around non-english/non-chinese that works small & offline in cars.

Unfortunately, Nvidia has a culture of open sourcing all CUDA code, so if any startup shows something works commercially, Nvidia will rewrite, likely ultimately better, and give away for free, so more companies will do it and buy more GPUs.

aurareturn · on March 24, 2024

In your opinion, is it hopeless for something like ROCm to compete given that even CUDA is extremely hard for all parties?

What do you think about Apple's Metal?

lmeyerov · on March 24, 2024

If I was any of these companies, I'd totally invest many billions in ecosystem here. Tensorflow (Google) and pytorch (Facebook) are great examples, it can work. Otherwise, hw companies will continue to lose relevance in the growing server market, and SW companies will have an ever growing Nvidia tax.

But it's not easy for the hw co's. OpenCL was more of a hw company thing (Intel, AMD, mobile chip co's), and while they spend billions on adventures all the time, their SW leadership culture has been bad. They fail to do sustained & deep ecosystem investment, and instead look like small feudal orgs that get their projects pulled arbitrarily whenever the VPs rearrange themselves. For example, given that Intel brought back its old CEO, that was a scary signal to me for this front. Intel specifically had the internal talent, I'm not sure if they still do, just not at the management level, and definitely not culturally at the highest leadership level.

Jensen at Nvidia has always been a special CEO here, even when they were helping game companies make their engines, and I'm guessing that taught him the value of long-term vertical SW & ecosystem investment. Instead of Intel unifying on x86 and c++ (compilers, vtune, Intel tbb, ...), and letting Microsoft / Linux / DB people go higher, Jensen went all the way up the stack to get at full utilization, and unified teams internally on that over 1-2 decades.

Apple is a funnier case. I can see them doing it and then pulling the plug. Eg, Chris Lattner making Swift and then they failed to retain him, and their revolving door of frameworks overall. Internally, they do have the technical talent and $, but I don't understand the culture and commercial alignment.

Finally.. I do think the increasing importance of AI inferencing, yet simultaneous simplicity of it, has opened a disruption opportunity here. We are still at a tiny % of where it is going. Onyx, pytorch, transformers, etc ecosystem are still early days from that perspective. It's fast for a hardware co like Groq to port a new model. So I don't rule out big changes here, and those being used to drive the rest of the ecosystem, like your q on ROCm.