Hacker News new | past | comments | ask | show | jobs | submit login
All GB/s without FLOPS – Nvidia CMP 170HX Review (niconiconi.neocities.org)
140 points by dannyw on Oct 29, 2023 | hide | past | favorite | 74 comments



I love finding hacky ways to save money on hardware, but unfortunately NVidia is just too good at the game and the 170HX was destined for the landfill at birth. God forbid a few of us enthusiasts get A100 performance for under a grand.

The next best thing is a 3090 or the like with a broken PCIe power connector or some other minor defect. My 3090 is simply missing the bit that holds the clip of the power connector in, however it’s a snug fit anyway and with the cables crammed into my case as they are, I don’t think it’s going anywhere. I paid $200 less than market for that 3090 as a result. Less than a gram of plastic. $200 off.

Meanwhile, as the article points out, AMD is nowhere near as hostile towards its customer base, and modified Radeon cards can apparently be had for $100 or so (from China). The caveat of course is no CUDA support, so it’s kind of moot.


There is some CUDA support on AMD. I'm using it on a daily basis, it's much more production ready than you would expect. Do you use pytorch or something else?


How do you have “some” CUDA support? I’m aware of the AMD HIP API that ports CUDA code to run on AMD GPUs, but that’s not CUDA at that point. Im also aware of the geohot (George Hotz) project for bringing native CUDA to AMD GPUs, but I think he abandoned it because AMD wasn’t throwing him any bones.

Cupy on Python is mostly what I use.


This is what they are referring to: https://github.com/ROCm-Developer-Tools/HIPIFY


Pytorch can use the 'cuda' device on AMD GPUs, not talking about HIP. It works as a drop-in replacement if your ROCm API level matches your target CUDA level. (i.e matmul on fp16 won't be supported on old AMD GPUs for example)


I love this, but in practice I find things like this are yet another problem I have to work through. I have a hard enough time getting both the nvcc and nvidia-smi commands to work at the same time or two Nvidia GPUs (one internal and one external via TB3) to work on the same driver version. PopOS usually “just works”, most other distros it’s some kind of hassle, especially if the Nvidia GPU has to handle the display output too. I try to have an AMD APU as the CPU so I can have solid graphics support on Linux.

Anyway, do you know if this is just for pytorch? Is performance roughly 100% of what you’d get on an “equivalent” Nvidia GPU?


This technically uses HIP. AFAIK, they ran the CUDA code for pytorch through HIPIFY


Fascinating, I didn't know about this. Thank you for sharing this bit!


> but I think he abandoned it because AMD wasn’t throwing him any bones.

I haven't heard about the "bringing native CUDA to AMD GPUs", sounds really interesting. I did come across a picture of Geohot with a ton of AMD GPUs though, wasn't that enough for him or what?


HN is where I heard about it! About 3 months ago I think.

I think this is it: https://news.ycombinator.com/item?id=36189705


Could a card like this work for something like object recognition? Those tiny google tpus that are supposed to be 25 dollars are 100 dollars now, and I’m wondering if the prices get low enough, would these be a viable alternative (along with an undervolt)?


Raspberry Pis and other “small things” like Project Coral are ridiculously overpriced. It’s ironic because whenever I see someone post on Reddit about their tiny rack of Raspberry Pis in their closet, people always ask in the comments what they’re using the cluster for, and the OP always says something along the lines of “just fucking around.”

If I’m on a budget, my first choice is probably going to be an older, used Nvidia GPU. Maybe a 1050 or one of the older Quadros. The Quadros will get you the most VRAM per dollar, but the GPU will be rather weak, in the realm of a 1050 perhaps, while the 1050 will have better/longer driver support and be physically smaller, quieter, and use less energy.

If those aren’t an option, I would take a closer look at ROCm and HIP. It seems like AMD is prioritizing support for newer GPUs like the Radeon VII, so a $50 RX480 is probably not a good investment, despite the low price.


AMD is hostile in a much more meaningful way - APIs are junk, drivers are buggy, hardware doesn't work. Useless for AI. You will get nothing done. Save your money. Better to get working hardware for more than non-working hardware for less. (Tokens/s)/dollar is 0 on AMD.

You will waste time.


That’s not hostility, that’s just suckage. Suckage can be solved with time, if years of it on occasion. Hostility can only be solved by the customer base jumping ship.


Has been a decade+. If you build company on their GPUs you will fail. Just adding implementation complexity for no reason. Unfixable software.

But all I'm doing is warning. The consensus viewpoint is not this so you can listen to HN consensus or you can listen to me.


> The consensus viewpoint is not this so you can listen to HN consensus or you can listen to me.

Can you corroborate your points? None of it really aligns with my experience. Nvidia hardware seems quite popular and effective for raster solutions, accelerated RT, dedicated AI and even low-power handheld gaming. I'm typing this out on a Linux box with an Nvidia GPU right now :P

It's worth noting that Nvidia isn't a saint, sure. They play for keeps, and CUDA is limited to paying customers only. CUDA doesn't have open source alternatives, though. Some things do part of what CUDA does really well (or better), but nobody is making a full-stack replacement. Apple is investing in the Accelerate framework which has almost no industry/datacenter application; AMD is doubling down on OpenBLAS and community support. Intel is half-assing some proprietary frameworks and pushing it into demos for a good look.

It would be great if these incumbent companies would pool their vast resource advantage to write, deliver, test and maintain a cross-platform GPGPU library. But that's a lot to ask, and it's easier to just disrupt the entire market with a single integrated package.


> Intel is half-assing some proprietary frameworks and pushing it into demos for a good look.

My understanding is the likes of oneAPI is supposed to be enable non nvidia gpus to work on CUDA workloads? Is oneAPI one of your these proprietary frameworks?


I should have said "The HN consensus viewpoint is apparently not this based on this thread"


Someone asking you to pay money for quality isn't hostility.


But them asking for as much money as (they think) you can bear because you’ve got nowhere else to go is. And on a market with a single real choice the difference between the two is of quantity, not of quality. I’d say Nvidia is leaning towards the hostility side these days, although my absolute revulsion for software locks may be colouring my perspective.

To be clear, “hostility” is not the word I would’ve chosen, as it attributes emotions to entities that don’t really experience them. Perhaps it’s more useful to talk about whether the company cares if the customers feel exploited or not; and I don’t think Nvidia does (think this will hurt their sales).


I agree it isn’t hostility it is monopoly


Only if you consider software a commodity.

Which... sorry to inject my personal opinion here, but it's not. Software is a finite intellectual product designed by motivated human laborers. The hardware can be a commodity, and the design can be a competitive advantage, but software layer is specifically what people consider "monopolized".

Nvidia is not the only company designing GPGPU hardware, and they're not the only company capable of affording commodity silicon from TSMC. The only high-demand thing they entirely control seems to be CUDA, a software feature other companies are too lazy to reproduce. Maybe it's the rest of the market that's being anticompetitive?


I love Nvidia. Sure it's "closed" in that there's no alternative that uses the same API. But they have wonderful developer support, solid APIs, and are primarily responsible for the rapid rise in GPGPU computing. And the costs aren't _that_ bad. I've been around a long time. The amount of computing power in a 4090 consumer GPU is mind blowing.


Driver side sucked on AMD since the cards were still ATI tho


I am hearing the same claims repeated over and over again.

On linux they are simply not true.

So are we talking about Windows? Are we talking games?


For GPU compute drivers on the majority of their consumer cards on Linux the claims are most certainly true.


Huh? Not sure if I'm misunderstanding but I'm on Arch and I've been running my 6750XT with SD since like February. Got SDXL running a few months ago and have played with oobabooga a bit. Also compiled whisper.cpp with HIPblas the other day.

I also play a few dozen hours of games a month, some new, some old, some AAA, some indie. All through Steam's Proton with no driver issues whatsoever.


You may be running sdxl but according to benchmarks I’ve seen, nowhere near the speed of say, a 3070, or a 3080 12gb (if you want a nvidia product with comparable vram)


The situation has been changing rapidly since summer. According to my research, if you combine their desktop and cuda bugginess (for 7900 XTX) they are approaching the nvidia level of bugginess on linux, which is surprisingly quite high (much much higher than what it was say 5 years ago when nvidia just worked). One just needs a glimpse at their forum https://forums.developer.nvidia.com/c/gpu-graphics/145 to gauge the situation.

I just bought a 4090 and the desktop experience I get is much worse than what I had with the gpu embeded in the Ryzen 7950x: Wayland doesn't work, in Xorg there is tearing in mpv, alt-tab sometimes breaks in gnome. When I launch memory intensive cuda kernels the whole desktop becomes unresponsive. The drivers spews Xid errors in dmesg and breaks for certain applications, such as embergen.


I’ve stopped trying to use NVidia on Linux. I will usually have the display plugged into a Ryzen APU and have the NVidia GPU headless for compute stuff.


There are some positive movements though. For example the addition of explicit synchronization that the nvidia driver needs to function properly into DRI3 / Xwayland https://gitlab.freedesktop.org/xorg/xserver/-/merge_requests... is active as of right now, and it seems that it will succeed. The corresponding mutter issue https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3300


I will say my experience with Nvidia + Wayland has been better than most it seems. It may be because I typically use Fedora which sticks to pure GNOME while Ubuntu and Pop_OS bastardize GNOME.

Still, my main complaint is that moving windows around and so on is not smooth. I forget what the term for it is… the window gets jagged and it’s like parts of it are moving at different speeds.


My last nvidia card was bought in 2007.

Since then I am on AMD. I game, I build games, work on GPU related stuff.

I refuse to buy Nvidia until they open source their drivers.

I don't care about windows, I am on linux solely and there, from my experience, AMD is doing an excellent job


AMD's drivers weren't open in 2007 either and for a number of years after that.


First open source amdgpu release was 2015, so 8 years later. You could probably reasonably stretch an Nvidia 8800 GTX (the absolute top card in 2007) to 2012 if you were thrifty with the settings and resolution. But by 2015 you couldn't run most games at all I would wager.


I never said they were in 2007, I said I bought my last Nvidia card in 2007.

In 2011 I was using R600, without any problems. Since then the situation improved steadily, especially when Steam got native support.


Maybe also blame Nvidia for making a closed source API in your rant.


> drivers are buggy

so you don't use linux


Anything that does not live in AMD's ecosystem manages to avoid suckage. The AMD linux drivers are fine because Linus would not let them get away with the sort of shit they have to be doing in their internal repos. As geohot memorably noted, "this will generate dead loop." If their commit messages are like that in Linux, what is their code like when nobody is looking?

The ROCM drivers are shit. They somehow manage to get an enormous edifice of effort 99% working, then they bungle their package repository. Repeatedly. AMD have a tremendous ability to shoot themselves in the foot five feet from the end of the race, and the thing is, at this point you have to anticipate it. They have the capability to succeed, but not the temperament.


ROCM drivers also puzzle me because it works on an exclusive selection of gpus instead of the whole lineup of a given architecture family. I keep reading about how theres ROCM support for the 7900xtx, and I guess theres no support for the 7600?


> Tokens/s)/dollar is 0 on AMD.

You do realise llama.cpp works on some and cards right?


You on windows by any chance?


170HX was created to solve two problems NVidia had,

    1. A small group of transient users were buying cards in bulk preventing the long-term users from getting it.
    2. They will dump these cards back in market in few years, creating a glut creating fluctuation in market.
170HX and its limitations make perfect sense when you look from that perspective. Had LLM boom and 170HX wouldn't have happened, NVidia would have been struggling with a saturated market and tanked stock price. So yeah, it sucks that you can't use the perfect piece of hardware but than that was its purpose.


Did this and similar GPUs actually solve the supply-demand problem of their consumer GTX/RTX cards? I remember the consumer GPUs being price-gouged until crypto crashed, not before.


This thing sold for 4200 dollars so no. Consumer nvidia cards were cheaper and offered good enough hashrates. Also you couldnt find any gpus on shelves for a solid year.


"Unfortunately, all Nvidia GPUs since recent years have VBIOS digital signature checks, making VBIOS modification impossible"

This is not necessarily true. As seen with android devices you can force digital signature checking mechanisms by varying voltage levels in order to get the device to completely skip the checks as if they were never there.

https://research.nccgroup.com/2020/10/15/theres-a-hole-in-yo...

I'm sure a similar strategy could be developed here.


> I'm sure a similar strategy could be developed here.

Slow down there. Glitching is almost never a practical long term strategy. It can take hours (or even days, depending on the target) to successfully bypass a check just once without other follow on effects. Glitching is useful if you need to bypass some mitigation once, such as to extract cryptographic keys, but it's not something you want to do every time you turn on your PC. Glitching gets substantially less reliable with every passing generation due to scaling (increased density/lower Vth increases the odds that you'll corrupt something else, particularly with EM fault injection) and design complexity (glitching out-of-order cores is a HUGE pain).


Well, I believe there are some modified versions of nvflash floating around that'll let you flash anything with a valid signature.

Of course, the only thing that'll POST are going to be just other vendor images from the same card model usually. (For different power limits, usually)


Glitching almost always requires removing capacitors. Good enough for dumping things out of a device once or twice. But GPUs that consume hundreds of watts will not be stable without those bypass capacitors.

So, sure, you can skip the verification checks, but your GPU won’t be stable enough to be useful for anything


Isn't the flash chip on nvidia boards a generic thing that someone could buy themself, flash using existing eeprom writing gear, then solder onto the board?

Also, as the chip and board here seems like an A100 reference design, using an A100 VBIOS image shouldn't fail any signature checks.


Probably, but the A100 BIOS probably can detect it is running on something other than a A100 and bail.

The lack of memory would be the most obvious difference. The A100 has 80GB, this has 8GB.

And I really suspect Nvidia probably some way of explicitly locking a chip to a given product ID, like efuses that the BIOS firmware can check on boot.


You can just buy same chip if someone somehow decided to check random flash chip vendor.

More sensible way to stopping that would be writing eeprom with encrypted key burned into the GPU itself but I doubt NVIDIA bothered, money loss for few people willing enough to take their GPU apart to replace a chip is insignificiant.


Yes, but the public key and product ID for the verification is in the GPU, not the external flash.


Ahhh. That sucks then. :/


That’s a lot to parse, so I’m kinda hand waving, but the memory bandwidth emphasis seems like a great fit for most LLMs at least if not also some ViTs and other attention-style architectures on both training and inference? Certainly sounds like the price is right.

Am I missing something key either conceptually or by failing to read all the stats closely?


It’s tensor cores and floating point math have been either artificially or actually disabled - it would be very very slow. And 8gb of vram is really low as well.


It is actually nice to see people trying to make old crypto mining hardware work. For the most part, it is all e-waste, but I'm sure there is a few gems out there.

My operation was majority on super old model AMD gpu's (RX470 - 580), so there really isn't much use for them now anyway.


Surely they’re good for budget ML hobbyists? I’m stuck inferencing on CPU and it’s pretty painful, I figure a GPU upgrade would be good for some workloads, though VRAM would be a struggle (maybe stick to whisper models…).


One would think, but only 8gb vram, no display port, and they are air cooled (big heat sink) too. Plus, there are about 120,000 of these cards sitting in a warehouse in the middle of nowhere; selling one or two at a time really isn't economical.

That said, if you have a few big container trucks and want to pick them all up, I can put you in touch with the right people... heh.


I'm interested! Email in profile.


Emailed


Could you send one over? my rx 580 is beginning to explode and as a student I dont have much money to buy another one


They don't explode.


The best value GPU right now for SoTA LLMs are probably the Nvidia RTX 6000. You can connect them together with NVLink, they have 48GB, and you can fit 4 of them in a PCI 4.0 high end consumer motherboard. Enough to fine tune even LLama-2, albeit with a batch size of 1 :)


Unfortunately, you have to specify which generation of "RTX 6000" you mean, since nVidia likes to reuse product names for no good reason besides confusing customers. The one you're talking about is around 8 grand USD. The one that comes up first when you Google "RTX 6000" is $2400 and change, and not particularly useful.


Thanks for the clarification. I bought them for $5k (without VAT) in Sweden last month.


Only The RTX A6000 (ampere) (Not RTX 6000! those are 24GB Turings!) has NVlink, and as far as Nevada states, they only work in 2x NVLink. Memory pooling is not really a real thing as much as they say it is.

They run about 4K on eBay, and have 303TFlops of tensor core perf in sparse Bfloat 16. So 150 Dense. They do have 48GB of memory which is great, but at 768GB/s. Source: https://www.nvidia.com/content/dam/en-zz/Solutions/design-vi...

The 4090s run 1599 for a founders edition or a Gigabyte Windforce V2 (my choice). They have at least 165 Tflops of Bfloat16, 330 TFlops of FP16 with FP16 accumulate, and 660 TFlops if you use sparsity. They also support FP8 at 660 TFlops Dense, 1321 TFlops Sparse(!!).

Unless you need the 2x 25% slower memory, the 4090s are much better choices. You get the same scaling over 2 cards anyway.

There is also the RTX A6000 Ada, which is 8K, and based on the same Ada chip as the 4090, except with 48GB of memory. Lower power and clock speeds result in slightly lower peak TFlops numbers. You really really pay for memory.


> Enough to fine tune even LLama-2, albeit with a batch size of 1 :)

At FP16? I think you need much less (~2 48GB cards) to finetune 70B with increasing levels of optimization.

I think you can even do it on a single card with QLORA.


I have not been able to get it working with 2 4090s yet, but I think that's because the library I choose to use (axolotl) does not support model parallel at all.


I wish Facebook would release the raw 34B model... Or better yet, that Mistral would release one.


As if NVIDIA is going to let the non-wealthy plebs anywhere near DYI AI


I wasn't able to block the obnoxious anime girl on the right side with ublock. I've never had that before. Does anyone know how to find a way around that?


The problem is that it isn't an element, it's a background-image to the body. You'd need to overwrite the stylesheet for body, or block the image's url (/img/niconiconi.png)


Thank you!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: