Building a deep learning rig

Yenrabbit · 2024-02-24T18:21:09.000000Z

Note that they shared part two recently: https://samsja.github.io/blogs/rig/part_2/

For those talking about breakeven points and cheap cloud compute, you need to factor in the mental difference it makes running a test locally (which feels free) vs setting up a server and knowing you're paying per hour it's running. Even if the cost is low, I do different kinds of experiments knowing I'm not 'wasting money' every minute the GPU sits idle. Once something is working, then sure scaling up on cheap cloud compute makes sense. But it's really, really nice having local compute to get to that state.

buildbot · 2024-02-24T18:32:08.000000Z

Lots of people really underestimate the impact of that mental state and the activation energy it creates towards doing experiments - having some local compute is essential!

krallistic · 2024-02-24T18:58:15.000000Z

This. In the second article, the author touches on this a bit.

With a local setup, I often think, "Might as well run that weird xyz experiment over night" (instead of idling) On a cloud setup, the opposite is often the case: "Do I really need that experiment or can I shut down the sever to save money?". Makes a huge difference over longer periods.

For companies or if you just want to try a bit, then the cloud is a good option, but for (Ph.D.) researchers, etc., the frictionless local system is quite powerful.

ummonk · 2024-02-24T21:59:37.000000Z

I have the same attitude towards gym memberships - it really helps to know I can just go in for 30 minutes when I feel like it without worrying whether I’d be getting my money’s worth.

leobg · 2024-02-25T08:03:11.000000Z

It depends on your local energy prices and your internet speed as well. It may actually be cheaper and faster to spin up a cloud instance. I bought an 80 core server to do some heavy lifting back in 2020. It doesn't even have a GPU, but it costs me around 4 euros per day to run. For that price I can keep a cloud GPU instance running. And even the boot up time isn’t any slower.

KuriousCat · 2024-02-25T09:26:57.000000Z

What cloud GPU instance are you talking about here? Most of the GPUs cost around 2 to 40 dollars an hour. I would love to know the provider who is offering one for 4 dollars a day...

leobg · 2024-02-25T17:05:04.000000Z

Runpod Cloud GPU is ~0.60 $/hr. 16 GB I think. (Not the serverless pods, mind you. Those are ~ 4x more expensive.)

abra0 · 2024-02-24T23:32:23.000000Z

That's a great point! I'd agree that just the extra emotional motivation from having your own thing is worth a ton. I get some distance down that way by having a large RAM no GPU box, so that things are slow but at least possible for random small one offs.

abra0 · 2024-02-24T14:42:38.000000Z

I was thinking of doing something similar, but I am a bit sceptical about how the economics on this works out. On vast.ai renting a 3x3090 rig is $0.6/hour. The electricity price of operating this in e.g. Germany is somewhere about $0.05/hour. If the OP paid 1700 EUR for the cards, the breakeven point would be around (haha) 3090 hours in, or ~128 days, assuming non-stop usage. It's probably cool to do that if you have a specific goal in mind, but to tinker around with LLMs and for unfocused exploration I'd advise folks to just rent.

imiric · 2024-02-24T15:20:21.000000Z

> On vast.ai renting a 3x3090 rig is $0.6/hour. The electricity price of operating this in e.g. Germany is somewhere about $0.05/hour.

Are you factoring in the varying power usage in that electricity price?

The electricity cost of operating locally will vary depending on the actual system usage. When idle, it should be much cheaper. Whereas in cloud hosts you pay the same price whether the system is in use or not.

Plus with cloud hosts reliability is not guaranteed. Especially with vast.ai, where you're renting other people's home infrastructure. You might get good bandwidth and availability on one host, but when that host disappears, you should hope that you did a backup, which vast.ai charges for separately, and if so, you need to spend time restoring the backup to another, hopefully equally reliable host, which can take hours depending on the amount of data and bandwidth.

I recently built an AI rig and went with 2x3090s, and am very happy with the setup. I evaluated vast.ai beforehand, and my local experience is much better, while my electricity bill is not much higher (also in EU).

KeplerBoy · 2024-02-24T15:39:15.000000Z

Well rented cloud instances shouldn't idle in the first place.

imiric · 2024-02-24T16:05:22.000000Z

Sure, but unless you're using them for training, the power usage for inference will vary a lot. And it's cumbersome to shutdown the instance while you're working on something else, and have to start it back up when you need to use it again. During that time, the vast.ai host could disappear.

segmondy · 2024-02-24T18:23:52.000000Z

Most people don't think of storage costs and network bandwidth. I have about 2tb of local models. What's the cost of storing this in the cloud? If I decide not to store them in the cloud, I have to transfer them in anytime I want to run experiments. Build your own rig so you can run experiments daily. This is a budget rig and you can even build cheaper.

isoprophlex · 2024-02-24T20:43:22.000000Z

Let me add that moving data in and out of vast.ai is extremely painful. I might be overprivileged with a 1000 MBit line but these vast.ai instances have highly variable bandwidth in my experience; plus even when advertising good speeds I'm sometimes doing transfers in the 10-100 KiB/s range.

nightski · 2024-02-24T19:18:19.000000Z

Data as well. I have a 100TB NAS I can use for data storage and it was honesty pretty cheap overall.

abra0 · 2024-02-24T15:43:03.000000Z

Well if you are not using a rented machine during a period of time, you should release it.

Agreed on reliability and data transfer, that's a good point.

Out of curiosity, what do you use a 2x3090 rig for? Bulk not time-sensitive inference on down quanted models?

imiric · 2024-02-24T16:21:56.000000Z

> Well if you are not using a rented machine during a period of time, you should release it.

If you're using them for inference, your usage pattern is unpredictable. I could spend hours between having to use it, or minutes. If you shut it down and release it, the host might be gone the next time you want to use it.

> what do you use a 2x3090 rig for? Bulk not time-sensitive inference on down quanted models?

Yeah. I can run 7B models unquantized, ~13-33B at q8, and ~70B at q4, at fairly acceptable speeds (>10tk/s).

whimsicalism · 2024-02-24T17:29:33.000000Z

if you are just using it for inference, i think an appropriate comparison would just be like a together.ai endpoint or something - which allows you to scale up pretty immediately and likely is more economical as well.

imiric · 2024-02-24T17:40:52.000000Z

Perhaps, but self-hosting is non-negotiable for me. It's much more flexible, gives me control of my data and privacy, and allows me to experiment and learn about how these systems work. Plus, like others mentioned, I can always use the GPUs for other purposes.

whimsicalism · 2024-02-24T17:46:43.000000Z

to each their own. if you are having really high-sensitive conversations with your GAI that someone would bother snooping in your docker container, figuring out how you are doing inference, and then capturing it real-time - you have a different risk tolerance than me.

i do think that cloud GPUs can cover most of this experimentation/learning need.

algo_trader · 2024-02-24T20:44:36.000000Z

together.ai is really good but there is a price mismatch for small models (a 1BN model is not x10 cheaper than 10BN models)

This is obviously because their are forced to use high memory cards.

Are there ideal cards for low memory (1-2BN) models? So higher flops/$ on crippled memory

algo_trader · 2024-02-24T20:39:28.000000Z

> built an AI rig and went with 2x3090s,

Is there a goto card for low memory (1-2BN) models?

Something with much better flops/$ but purposely crippled with low memory.

whimsicalism · 2024-02-24T17:28:00.000000Z

with runpod/vast, you can request a set amount of time - generally if I request from Western EU or North America the availability is fine on the week-to-month timescale.

fwiw I find runpod's vast clone significantly better than vast and there isn't really a price premium.

mirekrusin · 2024-02-24T15:34:56.000000Z

For me "economics" are:

- if I have it locally, I'll play with it

- if not, I won't (especially with my data)

- if I have something ready for a long run I may or may not want to send it somewhere (it's not going to be on 3090s for sure if I send it)

- if I have requirement to have something public I'd probably go for per usage with ie [0].

[0] https://www.runpod.io/serverless-gpu

kkielhofner · 2024-02-24T15:36:27.000000Z

With the current more-or-less dependency on CUDA and thus Nvidia hardware it's about making sure you actually have the hardware available consistently.

I've had VERY hit-miss results with Vast.ai and I'm convinced people are cheating their evaluation stuff because when the rubber meets the road it's very clear performance isn't what it's claimed to be. Then you still need to be able to actually get them...

whimsicalism · 2024-02-24T17:30:05.000000Z

use runpod and yeah i think vast.ai has some scams, especially in the asian and eastern european nodes.

wiradikusuma · 2024-02-24T16:04:37.000000Z

For me the economics is when I'm not using it to do AI stuff, I can use it to play games with max settings.

Unfortunately my CFO (a.k.a Wife) does not share the same understanding.

ejb999 · 2024-02-24T20:58:41.000000Z

I fear that someday I will die and my wife will sell off all my stuff for what I said I paid for it.

(not really, but it is a joke I read someplace and I think it applies to a lot of couples).

segmondy · 2024-02-24T18:20:47.000000Z

Unless you are training, you never hit peak watts. When inferring, the watt is still minimal. I'm running inference now and using 20%. GPU 0 is using more because I have it as main GPU. Idle watt sits at about 5%.

Device 0 [NVIDIA GeForce RTX 3060] PCIe GEN 3@16x RX: 0.000 KiB/s TX: 55.66 MiB/s GPU 1837MHz MEM 7300MHz TEMP 43°C FAN 0% POW 43 / 170 W GPU[|| 5%] MEM[|||||||||||||||||||9.769Gi/12.000Gi]

Device 1 [Tesla P40] PCIe GEN 3@16x RX: 977.5 MiB/s TX: 52.73 MiB/s GPU 1303MHz MEM 3615MHz TEMP 22°C FAN N/A% POW 50 / 250 W GPU[||| 9%] MEM[||||||||||||||||||18.888Gi/24.000Gi]

Device 2 [Tesla P40] PCIe GEN 3@16x RX: 164.1 MiB/s TX: 310.5 MiB/s GPU 1303MHz MEM 3615MHz TEMP 32°C FAN N/A% POW 48 / 250 W GPU[|||| 11%] MEM[||||||||||||||||||18.966Gi/24.000Gi]

KuriousCat · 2024-02-24T16:01:06.000000Z

When you compute the break even point did you factor in that you still own the cards and you can resell them? I bought my 3090s for 1000$ and after 1 year I think they go for more in the open market if I resell them now.

ametrau · 2024-02-24T17:55:15.000000Z

Interesting. I checked it out. The providers running your docker container have access to all your data.

lostmsu · 2024-02-25T06:30:28.000000Z

I just made a clone of diskprices.com for GPUs specifically for AI training, and it has a power and depreciation calculator: https://gpuprices.us

You can expect a GPU to last 5 years. So for 128 days break even you are only looking at 6.67% utilization. If you are doing training runs, I think you are going to beat it easily.

P.S. coincidentally or not, but shortly after it got mentioned on Hacker News, Best Buy run out of both RTX 4090s and RTX 4080s. They used to top the chart. Turns out at descent utilization they win due to the electricity costs.

leobg · 2024-02-25T08:06:27.000000Z

Exactly. And you rarely see machines from Germany on vast. Might as well run a data center in Bermuda. [0]

[0] https://www.royalgazette.com/general/business/article/202307...

cyanydeez · 2024-02-24T14:50:53.000000Z

the current economics is a low ball to get costumers. it's absolutely not going to be the market price once commercial interests have locked in their products.

but if you're just goofing around and not planning to create anything production worthy, it's a great deal.

whimsicalism · 2024-02-24T17:51:59.000000Z

> the current economics is a low ball to get costumers.

vast.ai is basically a clearinghouse. they are not doing some VC subsidy thing

in general, community clouds are not suitable for commercial use.

verticalscaler · 2024-02-24T16:52:33.000000Z

Well maybe you could rent it out to others for 256 days at $0.3/hour, tinker, and sell it for parts after you get bored with it. ;)

Luc · 2024-02-24T18:08:52.000000Z

Breakeven point would be less than 128 days due to the (depreciating) resale value of the rig.

segmondy · 2024-02-24T18:28:19.000000Z

Well, almost. GPUs have not be depreciating. The cost of 3090's and 4090's have gone up. Folks are selling it for what they paid for or even more. With the recent 40's SUPER series from Nvidia, I'm not expecting any new releases in a year. AMD & Intel still have ways to go before major adoption. Startups are buying up consumer cards. So I sadly expect prices to stay more or less the same.

svnt · 2024-02-24T18:33:18.000000Z

If it isn’t depreciating that supports the parent’s bigger point even more.

karolist · 2024-02-24T22:44:27.000000Z

He can use these cards for 128days non stop and re-sell, claiming back the purchase price almost fully since OP bought them cheap. Buying doesn't mean you use the GPUs to a point where they end up costing 0, yes there is risk with GPUs going but but c'mon.... Renting is money you will never see again.

infogulch · 2024-02-24T13:51:56.000000Z

I'm eyeing Tinybox as a deep learning rig.

https://tinygrad.org/

https://twitter.com/__tinygrad__/status/1760988080754856210

tutfbhuf · 2024-02-24T14:41:01.000000Z

This is the new startup from George Hotz. I would like him to succeed, but I'm not so optimistic about their chances of selling a $15k box that is most likely less than $10k in parts. Most people would do much better by buying a second-hand 3090 or similar and connecting them into a rig.

segmondy · 2024-02-24T18:47:27.000000Z

Not necessarily, I'm not sure about AMD GPUs, but he tweeted that AMD supports linking all 6 together. If that's the case, then 6 of those XTX should crush 6 3090's. For us techies we definitely will decide to build vs buy. However businesses would definitely decide to buy vs build.

tutfbhuf · 2024-02-26T07:58:31.000000Z

I think businesses are more likley to rent and techies are more likely to build. So he is betting on a niche, techies who wants to buy a 15k device or companies who do not want to rent. Also companies who are willing to go the AMD GPU route instead of NVIDIA GPU, which as much better tooling and much more experts on the job market.

downrightmike · 2024-02-24T22:18:10.000000Z

He keeps hopping from thing to thing

treprinum · 2024-02-24T14:36:49.000000Z

Sure, if you want to waste your time on getting stuff working on AMD instead of spending it on actual model training...

kkielhofner · 2024-02-24T15:42:36.000000Z

This.

People complain about the "Nvidia tax". I don't like monopolies and I fully support the efforts of AMD, Intel, Apple, anyone to chip away at this.

That said as-is with ROCm you will:

- Absolutely burn hours/days/weeks getting many (most?) things to work at all. If you get it working you need to essentially "freeze" the configuration because an upgrade means do it all over again.

- In the event you get it to work at all you'll realize performance is nowhere near the hardware specs.

- Throw up your hands and go back to CUDA.

Between what it takes to get ROCm to work and the performance issues the Nvidia tax becomes a dividend nearly instantly once you factor in human time, less-than-optimal performance, and opportunity cost.

Nvidia says roughly 30% of their costs are on software. That's what you need to do to deliver something that's actually usable in the real world. With the "Nvidia tax" they're also reaping the benefit of the ~15 years they've been sinking resources into CUDA.

dkjaudyeqooe · 2024-02-24T17:20:13.000000Z

Wow! It's incredible how Nvidia has created the dark voodoo magic, and how only they can deliver the strong juju for AI. How are the so incredibly smart and powerful!?

I wonder if it has anything to do with the strategy they used in 3D graphics, where game developers ended up writing for NV drivers in order to maximise performance, and Nvidia abused their market position and used every trick in the book to make AMD cards run poorly. People complained about AMD driver quality, but the actual problem was that they were not NV drivers and AMD couldn't defeat their software moat.

So here we are again, this time with AI. You'd think we'd learnt our lesson but instead people are fooled yet again and instead of understanding the importance of diversity and competition in the lifeblood of their art, myopia and amnesia is the order of the day.

Tinygrad are doing god's work and I won't be giving Nvidia a single fucking cent of my money until the software is hardware neutral and there is real competition.

whimsicalism · 2024-02-24T17:34:04.000000Z

No, I think it is much more due to AMD absolutely failing to invest heavily in software at all. Honestly, they have had years - it is difficult to see how Nvidia abused their market position in AI when this is effectively a new market.

I find this vague reflexive anti-corpo leftism that seems to have become extremely popular post-2020 really tiresome.

infogulch · 2024-02-24T18:59:53.000000Z

For my part I'm not a leftist, and I'm not so much anti-corpo as pro-free market. I can still acknowledge that Nvidia is persistently pursuing anticompetitive policies and abusing their market position, without holing AMD on a pedestal assuming it would be different if the shoe was on the other foot.

dkjaudyeqooe · 2024-02-24T21:51:24.000000Z

> I find this vague reflexive anti-corpo leftism that seems to have become extremely popular post-2020 really tiresome.

Ah ideology, such a great alternative to actual thinking. Don't investigate or reason, just just blame it on the 'lefties'. Tiresome indeed.

Not sure how me simply stating the obvious makes me a 'lefty'. If you think monopiles, regardless of how they come about, are a good idea, that companies should be allowed to lock up an important market for any reason, then that makes you a corporatist fascist, right? Wow, this mindless name calling is so much fun! I feel like a total genius.

The simple fact is that it is the nature of software, its complexity and dependence on a multitude of fairly arbitrary technical choices makes it a very effective as a moat, even if its not intentional. CUDA, etc is 100% a software compatibility issue, and that's it. There's more than one way to skin a cat but we're stuck with this one. Nvidia isn't interested in interoperability, even though it's critical for the industry in the longer term. I'd wouldn't be either if it was money in my pocket.

The point that is entirely missed here is that we, as a community, are screwing up by not steering the field toward better hardware compatibility, as in anyone being able to produce new hardware. In the rush to improve or or try out the latest model or software we have lost sight of this, and it will be to our great detriment. With the concentration of money in one company we will have a lot less innovation overall. Prices will be higher and resources will be misallocated. Everyone suffers.

It's very possible that AI withers on the vine due to lagging hardware. It's going to need a lot of compute, and maybe a different kind of compute to boot. We may need a million or a billion times what we have to even get close to AGI. But if one company locks that up, and uses that position to squeeze out every dollar from its customers (really, have a look at the almost comical 'upgrades' Nvidia offers in their GPUs other than at the very high end) then it's going to take much longer to progress, and maybe we never get there because some small group of talented maverick researchers were never able to get their hands on the hardware they needed and never produce some critical breakthrough.

nemothekid · 2024-02-25T00:34:02.000000Z

>The point that is entirely missed here is that we, as a community, are screwing up by not steering the field toward better hardware compatibility,

No, there is a limit to the mount of handwringing, begging, and crying the community can do that would have forced AMD to take GPGPU computing seriously. CUDA didn't spring out of no where, it's 16 years old and in that time many people have begged AMD to properly support OpenCL or RocM. It's not the communities fault that AMD didn't take this field seriously until it was too late. Seriously, the consumer GPUs don't even get official RocM support, but somehow it's nvidia's fault that AMD didn't care to support RocM.

I'm sure AMD will wake up now that CUDA is a trillion dollar market, but it's unfair to blame users supporting CUDA. nvidia invested in open source for more than a decade now, and there were people who foresaw the current situation and tried to develop more open backends for frameworks like torch. Unfortunately developers don't work for free and nvidia spent the money and AMD did not. It's not users fault that they didn't work, for free, to get tensorflow working on AMD.

Geohot[1] nearly gave up on AMD entirely when their own drivers don't work. This isn't new, AMD is culpable for the current situation, the community didn't end up here due to indifference.

[1] https://github.com/ROCm/ROCm/issues/2198

kkielhofner · 2024-02-24T18:30:57.000000Z

Sarcasm aside...

Can we drop the "Nvidia is the only self-interested evil company in existence" schtick?

I'm not being "fooled" by anyone. I've been trying to use ROCm since the initial release six years ago (on Vega at the time). I've spent thousands of dollars on AMD hardware over the years hoping to see progress for myself. I've burned untold amounts of time fighting with ROCm, hoping it's even remotely a viable competitor to CUDA/Nvidia.

Here we are in 2024 and they're still doing braindead stuff like dropping a new ROCm release to support their flagship $1000 consumer card a full year after release...

ROCm 6 looks good? Check the docker containers[0]. Their initial release for ROCm only supported Python 3.9 for some strange reason even though the previous ROCm 5.7 containers were based on Python 3.10. Python 3.10 is more-or-less the minimum for nearly anything out there.

It took them 1.5 months to address this... This is merely one example, spend some time actually working with this and you will find dozens of similar "WTF?!?" bombs all over the place.

I suggest you put your money and time where your mouth is (as I have) to actually try to work with ROCm. You will find that it is nowhere near the point of actually being a viable competitor to CUDA/Nvidia for anyone who's trying to get work done.

> Tinygrad are doing god's work

Tinygrad is packaging hardware with off the shelf components plus substantial markup. There is nothing special about this hardware and they aren't doing anything you couldn't have done in the past year. They have been vocal on calling out AMD but show me their commits to ROCm and I'll agree they are "doing god's work".

We'll save the work being done on their framework for another thread.

[0] - https://hub.docker.com/r/rocm/pytorch/tags

latchkey · 2024-02-24T19:30:40.000000Z

> There is nothing special about this hardware and they aren't doing anything you couldn't have done in the past year.

What they are doing is all of the hardware engineering work that it takes to build something like this. You're dismissing the amount of time they spent on figuring stuff like this out:

"Beating back all the PCI-E AER errors was hard, as anyone knows who has tried to build a system like this."

kkielhofner · 2024-02-24T19:41:42.000000Z

> "Beating back all the PCI-E AER errors was hard, as anyone knows who has tried to build a system like this."

Define "hard".

The crypto mining community has had this working for at least half a decade with AMD cards. With Nvidia it's a non-issue. I'd be very, very curious to get more technical details on what new work they did here.

latchkey · 2024-02-24T19:49:27.000000Z

I ran 150,000 AMD cards for mining and we didn't run into that problem because we bought systems with PCIe baseboards (12x cards) instead of dumb risers. I'd be interested in finding out more details as well, but it seems he doesn't want to share that in public.

That said, if you think any of this is easy, you're the one who should define that word.

kkielhofner · 2024-02-24T22:49:27.000000Z

I never used the word easy, I never used the word hard. He used the word hard, you used the word easy.

With that said.

Easy: Assembling off the shelf PC components to provide what is fundamentally no different than what gamers/miners build every day. Six cards in a machine and two power supplies is low-end mining. Also see the x8 GPU machines with multiple power supplies that have been around forever. I'm not quite sure why you're arguing this so hard, you're more than familiar with these things.

Hard: Show me something with a BOM. Some manufacturing? PCB? Fab? Anything.

FWIW for someone that is frequently promoting their startup here you come across as pretty antagonistic. I'm not attacking you, just saying that for someone like myself that has been intrigued by what you're working on it gives me pause in terms of what I'd charitably refer to as potential personality/relationship issues.

Everyone has those days, just thought it was worth mentioning.

latchkey · 2024-02-25T00:17:59.000000Z

I totally get where you're coming from with the mining use case being straightforward. However, when it comes to AI, it's a different ball game. Each use case has its own set of requirements and optimizations, which is why many big mining operations find it challenging to shift towards AI. It's not just about assembling parts; it requires a deeper technical know-how.

For mining, the focus is mainly on GPUs, and the specifics like bus speed or other components aren't as critical. You could get by with a basic setup - a $35 CPU, 4GB of RAM, 100meg network, and PXE booting without any local storage. Even older GPUs like the RX470s did the job perfectly until the very end.

But what George is working on is something else entirely. It's not just about the number of GPUs; it's about creating a cohesive system where every component plays its part and is configured correctly. This complexity of tying everything together, is what makes it challenging. George is incredibly talented, and the fact that he's been dedicating himself to the tinybox project for a year now really speaks volumes about the intricacies involved.

Please don't think I'm trying to be confrontational - that's not my intention in the slightest. I appreciate your perspective, but I'm just trying to offer a different angle based on my own experience in this field.

While it might seem that this hardware isn't groundbreaking or that the developments could have been achieved earlier, it's important to recognize the innovation and hard work behind it. This isn't just about putting together existing pieces; it's about creating something that works better as a whole than the sum of its parts.

I'm confident that if we were to talk in person, we'd get along just fine.

fragmede · 2024-02-24T22:53:50.000000Z

isn't tinygrad's value add the software they provide on top of the open source drivers to make it all work? why would should they commit to ROCm if that's the product they're trying to sell?

hackerlight · 2024-02-24T22:52:48.000000Z

> Between what it takes to get ROCm to work

It's not that bad. You just copy and paste ~10 bash commands from their the official guide. 7900XTX is now officially supported by AMD. Andrew Ng says it's much better than 1 year ago and isn't as bad as people say.

dkjaudyeqooe · 2024-02-24T17:01:29.000000Z

Resistance is useless! Lets just accept our fate and toe the line. Why feel bad about paying essentially double, or getting half the compute for our money when we can just choose the easy route and accept our fate and feed the monopoly a little more money so they can charge us even more money. Who needs competition!

lbotos · 2024-02-24T17:33:14.000000Z

Isn't the entire point of tinygrad and the tinybox the "apple style" of we are building this software to work best on our hardware?

whimsicalism · 2024-02-24T17:54:39.000000Z

it's not there yet - and i don't really understand why they don't just try to upstream stuff to pytorch

lostmsu · 2024-02-25T06:41:59.000000Z

As I mentioned in comments to this post on twitter, you can beat this with a pretty regular $6000 2x4090 system on compute (but with less total VRAM).

Smith42 · 2024-02-24T14:18:31.000000Z

$15k!

KeplerBoy · 2024-02-24T14:45:09.000000Z

Which is not unreasonable for that amount of hardware.

You have to ask yourself if you want to drop that kind of money on consumer GPUs, which launched late 2022. But then again, with that kind of money you are stuck with consumer GPUs either way, unless you want to buy Ada workstation cards for 6k each and those are just 4090s with p2p memory enabled. Hardly worth the premium, if you don't absolutely have to have that.

cyanydeez · 2024-02-24T15:00:28.000000Z

I believe the ada workstation cards are typically 1slot cards

which means you could build a 4gpu server from normal cases.

most of the 4090 cards are 2-3 slot cards

KeplerBoy · 2024-02-24T15:07:15.000000Z

The beefy workstation cards are 2 slots, but yeah the 4090 cards are usually 3.something slots, which is ridiculous. The few dual slot ones are water cooled.

cyanydeez · 2024-02-24T15:15:52.000000Z

the work station cards also run on 300 watts and looks like the 4090 goes to 450.

so you are getting a better practical card for the price

if you are making a mining type rig, then yeah, the extra price is wasting money.

but if you wanted to build a normal machine, the workstation cards are the most reasonable choice for anything more than 2 gpus

kkielhofner · 2024-02-24T15:44:33.000000Z

I find it challenging to get my 4090s to consume more than 300 watts. There are also a lot of articles, benchmarks, etc around showing you can dramatically limit power while reducing perf by insignificant mounts (single digit %).

justsomehnguy · 2024-02-24T18:31:26.000000Z

> which means you could build a 4gpu server from normal cases.

Only if you are already live near an airport and you are accustomed to the sounds of the lift off and flying away.

bick_nyers · 2024-02-24T20:02:45.000000Z

Somewhat tangential question, but I'm wondering if anyone knows of a solution (or Google search terms for this):

I have a 3U supermicro server chassis that I put an AM4 motherboard into, but I'm looking at upgrading the Mobo so that I can run ~6 3090s in it. I don't have enough physical PCIE slots/brackets in the chassis (7 expansion slots), so I either need to try to do some complicated liquid cooling setup to make the cards single slot (I don't want to do this), or I need to get a bunch of riser cables and mount the GPU above the chassis. Is there like a JBOD equivalent enclosure for PCIE cards? I don't really think I can run the risers out the back of the case, so I'll likely need to take off/modify the top panel somehow. What I'm picturing in my head is basically a 3U to 6U case conversion, but I'm trying to minimize cost (let's say $200 for the chassis/mount component) as well as not have to cut metal.

ftufek · 2024-02-24T20:44:40.000000Z

You'll need something like EPYC/Xeon CPUs and motherboards which not only have many more PCIe lanes, but also allow bifurcation. Once you have that, you can get bifurcated risers and have many GPUs. And these risers use normal cables not the typical gamer pcie risers which are pretty hard to arrange. You won't get this for just $200 though.

For the chassis, you could try a 4U rosewill like this: https://www.youtube.com/watch?v=ypn0jRHTsrQ, not sure if 6 3090s would fit though. You're probably better off getting a mining chassis, it's easier to setup and cool down, also cheaper, unless you plan on putting them in a server rack.

choppaface · 2024-02-24T20:43:33.000000Z

Comino sells a 6x 4090 box as a product: https://www.comino.com/

They have single-slot GPU waterblocks but would want something like $400 or more each for them individually.

kaycebasques · 2024-02-24T15:24:12.000000Z

I really enjoy and am inspired by the idea that people like Dettmer (and probably this Samsja person) are the spiritual successors to homebrew hackers in the 70s and 80s. They have pretty intimate knowledge of many parts of the whole goddamn stack, from what's going on in each hardware component, to how to assemble all the components into a rig, up to all the software stuff: algorithms, data, orchestration, etc.

Am also inspired by embedded developers for the same reason

neilv · 2024-02-24T17:10:43.000000Z

For large VRAM models, what about selling one of the 3090s, and putting the money towards an NVLink and a motherboard with two x16 PCIe slots (and preferably spaced so you don't need riser cables)?

ImprobableTruth · 2024-02-24T19:32:06.000000Z

IME NVLink would be overkill for this. Model parallelism means you only need bandwidth to transfer the intermediate activations (/gradients + optimizer state) at the seams and inference speed is generally slow enough that even pcie x8 won't be a bottleneck.

segmondy · 2024-02-24T18:44:52.000000Z

full riser cables like they used doesn't impact performance. Hanging it off on open air frame IMO is better, keeps everything cooler, not just the GPU but the motherboard and surrounding components. With only 2 24gb GPU they are not going to be able to run larger models. You can't experiment with 70b models without offloading to CPU which is super slow. The best models are 70b+ models.

ImprobableTruth · 2024-02-24T19:25:34.000000Z

48 gb suffice for 4-bit inference and q-lora training of a 70b model. ~80 GB allows you to push it to 8-bit (which is nice of course), but full precision finetuning is completely out of reach either way.

Though you're right of course that pcie will totally suffice for this case.

p1esk · 2024-02-24T18:23:10.000000Z

Why do you need x16 pcie slots if you can use nvlink?

elorant · 2024-02-24T19:39:38.000000Z

NVlink is to connect the cards to each other. To connect them to the board you need the PCie slots.

p1esk · 2024-02-24T20:34:46.000000Z

We are talking about increasing the intercard bandwidth, assuming that’s a bottleneck. It can be done by either increasing pcie bandwidth, or using nvlink. If you use nvlink, increasing pcie does not provide any additional benefit because nvlink is much faster than pcie.

p.s. the mobo (B450 Steel Legend) already has 2 pcie x16 slots, so the recommendation does not make sense to me.

Uehreka · 2024-02-24T17:00:25.000000Z

> I just got my hands on a mining rig with 3 rtx 3090 founder edition for the modest sum of 1.7k euros.

I would prefer a tutorial on how to do this.

whoisthemachine · 2024-02-24T16:52:27.000000Z

I've been slowly expanding my HTPC/media server into a gaming server and box for running LLMs (and possibly diffusion models?) locally for playing around with. I think it's becoming clear that the future of LLM's will be local!

My box has a Gigabyte B450M, Ryzen 2700X, 32GB RAM, Radeon 6700XT (for gaming/streaming to steam link on Linux), and an "old" Geforce GTX 1650 with a paltry 6GB of RAM for running models on. Currently it works nicely with smaller models on ollama :) and it's been fun to get it set up. Obviously, now that the software is running I could easily swap in a more modern NVidia card with little hassle!

I've also been eyeing the b450 steel legend as a more capable board for expansion than the Gigabyte board, this article gives me some confidence that it is a solid board.

smokeydoe · 2024-02-24T17:45:09.000000Z

Does anyone have any good recommendations for an epyc server grade motherboard that can use 3x3090? My current motherboard (strix trx40-xe) has memory issues now. 2 slots cause boot errors no matter what memory is inserted. I plan to sell the threadripper. Other option is to just swap out the current motherboard with a trx zenith extreme but I feel server grade would be better at this point after experiencing issues. Is supermicro worth it?

segmondy · 2024-02-24T18:52:40.000000Z

If you're just going to stick to 3 GPUs. Then a lot of consumer gaming motherboards would be more than sufficient. Checkout the z270, x99, x299. If you really want epyc go to ebay search for "gigabyte mz32-ar0 motherboard" Majority of them are going to come form China and they are all pretty much used. If you have plans to go even bigger then I say go for a new wrx80

buildbot · 2024-02-24T19:25:05.000000Z

I have this motherboard - a big downside is many of the PCIE slots will overhang into the RAM if used for a GPU. I can't use two channels in my current ML machine because of this, and I have single slot 4090s.

KuriousCat · 2024-02-24T17:58:01.000000Z

It might not be the answer you are looking for, I would take a look at components published by System76/Lambda labs such as this to pick the one that would suit me: https://github.com/system76/thelio/blob/master/Thelio%20Comm...

moondev · 2024-02-25T00:26:44.000000Z

Heh I went the opposite direction from zenith to ASRock tnt

Are you on the latest bios? I have had good results flashing the "E" bios in the TNT folder based on the FTP site mentioned here

https://www.reddit.com/r/ASRock/s/GztBuD9INh

devbug · 2024-02-24T19:14:22.000000Z

H12SSL-i or H12SSL-NT

ROMED8U-2T

0x20cowboy · 2024-02-24T20:01:50.000000Z

If you would like to put Kubernetes on top of this kind of setup this repo is helpful https://github.com/robrohan/skoupidia

The main benefit is you can shut off nodes entirely when not using them, and then when you turn them back on they just rejoin the cluster.

It also helps managing different types of devices and workloads (tpu vs gpu vs cpu)

2OEH8eoCRo0 · 2024-02-24T22:42:41.000000Z

I love the idea of a "poor man's cluster" of hardware that I can continually add to. Old ereaders, phones, tablets, family laptops, everything.

I'm not sure what I'd use it for.

akasakahakada · 2024-02-25T01:55:58.000000Z

Just sharing.

2 x RTX4090 workstation guide

You can put two aircooled 4090 in the same ATX case if you do enough research.

https://github.com/eul94458/Memo/blob/main/dual_rtx4090works...

cyanydeez · 2024-02-24T14:48:30.000000Z

just ordered a 15k thread ripper platform because it's the only way to cheaply maximize the pcie16x bottleneck. the mining rigs are neat because the space you need for consumer GPU is a big issue.

those rigs need pcie riser slots that are also limited.

looks like the primary value is the rig and the cards. they'll need another 1-2k for a thread ripper and then the riser slots.

dijit · 2024-02-24T15:19:03.000000Z

availability is tight i think but check out the ampere altra stuff, they have an absurd number of pci’s lanes compared to AMD and especially intel, if you can suffer the ARM architecture.

They also have some ML inference stuff on chip themselves.

choppaface · 2024-02-24T16:02:56.000000Z

But then you need to deal with arm compile issues. A lot of common packages are available for arm, but x86 is still least likely to distract your development.

segmondy · 2024-02-24T18:34:03.000000Z

Unless you are training maximizing the PCIe lanes is truly overrated. You certainly don't want to be running at 1x speed. But 8x speed is enough with minimal impact. 8*3 = 32 lanes. Most CPUs can provide that. I'm running off a 2012 hp z820, that yields 3x16/1x8. So for anyone going for a build, don't throw money on CPUs. IMHO, GPU first, then your motherboard second (read the specs sheets), then CPU supported pcie lanes & Storage speed.

nirav72 · 2024-02-24T16:52:21.000000Z

This is nice. I would’ve used one of those ETH mining cases that support multiple GPUs. Ebay has them $100-150 these days.

whimsicalism · 2024-02-24T17:25:50.000000Z

I strongly, strongly suspect most people doing this are significantly short of the breakeven prices for transitioning from cloud 3090s.

inb4 there are no cloud 3090s: yes there are, just not in formal datacenters

soraki_soladead · 2024-02-24T18:11:50.000000Z

It's not always about cost. Sometimes the ergonomics of a local machine are nicer.

jeffybefffy519 · 2024-02-24T20:33:31.000000Z

Are m1/m2/m3 max mac's any good for this?

downrightmike · 2024-02-24T22:21:02.000000Z

Way slower than 1 gpu, at many times the cost. If you don't mind waiting minutes instead of seconds, macs are reasonable

fragmede · 2024-02-24T22:29:17.000000Z

It depends on what you're trying to do, but I've got an M1, and doing inference with llama2-uncensored using Ollama, I get results within seconds.

whywhywhywhy · 2024-02-25T00:47:31.000000Z

Depends what you're doing, M1 Max is around a minute for 1 SDXL image and the machine feels like it's choking while it does it while a 3090 will do it in 9 seconds and doesn't feel like it's breaking a sweat.

Llama definite a bit of a different story though.

jeffybefffy519 · 2024-02-25T23:06:58.000000Z

Im more thinking about the training side because it could be compelling to buy a beefily specced m3 max if it can replace what a dedicated gpu rig could do and also be a daily driver.

gigatexal · 2024-02-24T17:03:45.000000Z

I thought this looked like a cryptocurrency miner. Seems the crypto to AI pivot is legit happening. And good. Would rather we boiled the oceans for something marginally more valuable than in-game tokens we traded for fiat funds in this video game we call life.