Wow! It's incredible how Nvidia has created the dark voodoo magic, and how only ...

whimsicalism · 2024-02-24T17:34:04.000000Z

No, I think it is much more due to AMD absolutely failing to invest heavily in software at all. Honestly, they have had years - it is difficult to see how Nvidia abused their market position in AI when this is effectively a new market.

I find this vague reflexive anti-corpo leftism that seems to have become extremely popular post-2020 really tiresome.

infogulch · 2024-02-24T18:59:53.000000Z

For my part I'm not a leftist, and I'm not so much anti-corpo as pro-free market. I can still acknowledge that Nvidia is persistently pursuing anticompetitive policies and abusing their market position, without holing AMD on a pedestal assuming it would be different if the shoe was on the other foot.

dkjaudyeqooe · 2024-02-24T21:51:24.000000Z

> I find this vague reflexive anti-corpo leftism that seems to have become extremely popular post-2020 really tiresome.

Ah ideology, such a great alternative to actual thinking. Don't investigate or reason, just just blame it on the 'lefties'. Tiresome indeed.

Not sure how me simply stating the obvious makes me a 'lefty'. If you think monopiles, regardless of how they come about, are a good idea, that companies should be allowed to lock up an important market for any reason, then that makes you a corporatist fascist, right? Wow, this mindless name calling is so much fun! I feel like a total genius.

The simple fact is that it is the nature of software, its complexity and dependence on a multitude of fairly arbitrary technical choices makes it a very effective as a moat, even if its not intentional. CUDA, etc is 100% a software compatibility issue, and that's it. There's more than one way to skin a cat but we're stuck with this one. Nvidia isn't interested in interoperability, even though it's critical for the industry in the longer term. I'd wouldn't be either if it was money in my pocket.

The point that is entirely missed here is that we, as a community, are screwing up by not steering the field toward better hardware compatibility, as in anyone being able to produce new hardware. In the rush to improve or or try out the latest model or software we have lost sight of this, and it will be to our great detriment. With the concentration of money in one company we will have a lot less innovation overall. Prices will be higher and resources will be misallocated. Everyone suffers.

It's very possible that AI withers on the vine due to lagging hardware. It's going to need a lot of compute, and maybe a different kind of compute to boot. We may need a million or a billion times what we have to even get close to AGI. But if one company locks that up, and uses that position to squeeze out every dollar from its customers (really, have a look at the almost comical 'upgrades' Nvidia offers in their GPUs other than at the very high end) then it's going to take much longer to progress, and maybe we never get there because some small group of talented maverick researchers were never able to get their hands on the hardware they needed and never produce some critical breakthrough.

nemothekid · 2024-02-25T00:34:02.000000Z

>The point that is entirely missed here is that we, as a community, are screwing up by not steering the field toward better hardware compatibility,

No, there is a limit to the mount of handwringing, begging, and crying the community can do that would have forced AMD to take GPGPU computing seriously. CUDA didn't spring out of no where, it's 16 years old and in that time many people have begged AMD to properly support OpenCL or RocM. It's not the communities fault that AMD didn't take this field seriously until it was too late. Seriously, the consumer GPUs don't even get official RocM support, but somehow it's nvidia's fault that AMD didn't care to support RocM.

I'm sure AMD will wake up now that CUDA is a trillion dollar market, but it's unfair to blame users supporting CUDA. nvidia invested in open source for more than a decade now, and there were people who foresaw the current situation and tried to develop more open backends for frameworks like torch. Unfortunately developers don't work for free and nvidia spent the money and AMD did not. It's not users fault that they didn't work, for free, to get tensorflow working on AMD.

Geohot[1] nearly gave up on AMD entirely when their own drivers don't work. This isn't new, AMD is culpable for the current situation, the community didn't end up here due to indifference.

[1] https://github.com/ROCm/ROCm/issues/2198

kkielhofner · 2024-02-24T18:30:57.000000Z

Sarcasm aside...

Can we drop the "Nvidia is the only self-interested evil company in existence" schtick?

I'm not being "fooled" by anyone. I've been trying to use ROCm since the initial release six years ago (on Vega at the time). I've spent thousands of dollars on AMD hardware over the years hoping to see progress for myself. I've burned untold amounts of time fighting with ROCm, hoping it's even remotely a viable competitor to CUDA/Nvidia.

Here we are in 2024 and they're still doing braindead stuff like dropping a new ROCm release to support their flagship $1000 consumer card a full year after release...

ROCm 6 looks good? Check the docker containers[0]. Their initial release for ROCm only supported Python 3.9 for some strange reason even though the previous ROCm 5.7 containers were based on Python 3.10. Python 3.10 is more-or-less the minimum for nearly anything out there.

It took them 1.5 months to address this... This is merely one example, spend some time actually working with this and you will find dozens of similar "WTF?!?" bombs all over the place.

I suggest you put your money and time where your mouth is (as I have) to actually try to work with ROCm. You will find that it is nowhere near the point of actually being a viable competitor to CUDA/Nvidia for anyone who's trying to get work done.

> Tinygrad are doing god's work

Tinygrad is packaging hardware with off the shelf components plus substantial markup. There is nothing special about this hardware and they aren't doing anything you couldn't have done in the past year. They have been vocal on calling out AMD but show me their commits to ROCm and I'll agree they are "doing god's work".

We'll save the work being done on their framework for another thread.

[0] - https://hub.docker.com/r/rocm/pytorch/tags

latchkey · 2024-02-24T19:30:40.000000Z

> There is nothing special about this hardware and they aren't doing anything you couldn't have done in the past year.

What they are doing is all of the hardware engineering work that it takes to build something like this. You're dismissing the amount of time they spent on figuring stuff like this out:

"Beating back all the PCI-E AER errors was hard, as anyone knows who has tried to build a system like this."

kkielhofner · 2024-02-24T19:41:42.000000Z

> "Beating back all the PCI-E AER errors was hard, as anyone knows who has tried to build a system like this."

Define "hard".

The crypto mining community has had this working for at least half a decade with AMD cards. With Nvidia it's a non-issue. I'd be very, very curious to get more technical details on what new work they did here.

latchkey · 2024-02-24T19:49:27.000000Z

I ran 150,000 AMD cards for mining and we didn't run into that problem because we bought systems with PCIe baseboards (12x cards) instead of dumb risers. I'd be interested in finding out more details as well, but it seems he doesn't want to share that in public.

That said, if you think any of this is easy, you're the one who should define that word.

kkielhofner · 2024-02-24T22:49:27.000000Z

I never used the word easy, I never used the word hard. He used the word hard, you used the word easy.

With that said.

Easy: Assembling off the shelf PC components to provide what is fundamentally no different than what gamers/miners build every day. Six cards in a machine and two power supplies is low-end mining. Also see the x8 GPU machines with multiple power supplies that have been around forever. I'm not quite sure why you're arguing this so hard, you're more than familiar with these things.

Hard: Show me something with a BOM. Some manufacturing? PCB? Fab? Anything.

FWIW for someone that is frequently promoting their startup here you come across as pretty antagonistic. I'm not attacking you, just saying that for someone like myself that has been intrigued by what you're working on it gives me pause in terms of what I'd charitably refer to as potential personality/relationship issues.

Everyone has those days, just thought it was worth mentioning.

latchkey · 2024-02-25T00:17:59.000000Z

I totally get where you're coming from with the mining use case being straightforward. However, when it comes to AI, it's a different ball game. Each use case has its own set of requirements and optimizations, which is why many big mining operations find it challenging to shift towards AI. It's not just about assembling parts; it requires a deeper technical know-how.

For mining, the focus is mainly on GPUs, and the specifics like bus speed or other components aren't as critical. You could get by with a basic setup - a $35 CPU, 4GB of RAM, 100meg network, and PXE booting without any local storage. Even older GPUs like the RX470s did the job perfectly until the very end.

But what George is working on is something else entirely. It's not just about the number of GPUs; it's about creating a cohesive system where every component plays its part and is configured correctly. This complexity of tying everything together, is what makes it challenging. George is incredibly talented, and the fact that he's been dedicating himself to the tinybox project for a year now really speaks volumes about the intricacies involved.

Please don't think I'm trying to be confrontational - that's not my intention in the slightest. I appreciate your perspective, but I'm just trying to offer a different angle based on my own experience in this field.

While it might seem that this hardware isn't groundbreaking or that the developments could have been achieved earlier, it's important to recognize the innovation and hard work behind it. This isn't just about putting together existing pieces; it's about creating something that works better as a whole than the sum of its parts.

I'm confident that if we were to talk in person, we'd get along just fine.

fragmede · 2024-02-24T22:53:50.000000Z

isn't tinygrad's value add the software they provide on top of the open source drivers to make it all work? why would should they commit to ROCm if that's the product they're trying to sell?