I remember Nvidia getting hacked pretty bad a few years ago. IIRC, the hackers t...

justinclift · on July 18, 2024

For Nvidia, the most likely reason they've strongly avoided Open Sourcing their drivers isn't anything like that.

It's simply a function of their history. They used to have high priced professional level graphics cards ("Nvidia Quadro") using exactly the same chips as their consumer graphics cards.

The BIOS of the cards was different, enabling different features. So people wanting those features cheaply would buy the consumer graphics cards and flash the matching Quadro BIOS to them. Worked perfectly fine.

Nvidia naturally wasn't happy about those "lost sales", so began a game of whack-a-mole to stop BIOS flashing from working. They did stuff like adding resistors to the boards to tell the card whether it was a Geforce or Quadro card, and when that was promptly reverse engineered they started getting creative in other ways.

Meanwhile, they couldn't really Open Source their drivers because then people could see what the "Geforce vs Quadro" software checks were. That would open up software countermeasures being developed.

---

In the most recent few years the professional cards and gaming cards now use different chips. So the BIOS tricks are no longer relevant.

Which means Nvidia can "safely" Open Source their drivers now, and they've begun doing so.

--

Note that this is a copy of my comment from several months ago, as it's just as relevant now as it was then: https://news.ycombinator.com/item?id=38418278

SuperNinKenDo · on July 18, 2024

Very interesting, thanks for the perspective. I suspect all the recent loss of face they experienced with the transition to Wayland happening around the time that this motivation evaporated also probably plays a part too though.

I swore off ever again buying Nvidia, or any laptops that come with Nvidia, after all this. Maybe in 10 years they'll have managed to right the brand perceptions of people like myself.

1oooqooq · on July 18, 2024

interesting timing to recall that story. now the same trick is used for h100 vs whatever the throttled-for-embargo-wink-wink Chinese version is called.

but those companies are really adverse to open sourcing because they can't be sure they own all the code. it's decades of copy pasting reference implementations after all

rfoo · on July 18, 2024

> now the same trick is used for h100 vs whatever the throttled-for-embargo-wink-wink Chinese version

No. H20 is a different chip designed to be less compute-dense (by having different combinations of SM/L2$/HBM controller). It is not a throttled chip.

A800 and H800 are A100/H100 with some area of the chip physically blown up and reconfigured. They are also not simply throttled.

1oooqooq · on July 18, 2024

that's what nvidia told everyone in mar 23... but there's a reason why h800 were included last minute on the embargo in oct 23.

rfoo · on July 18, 2024

That's not what NVIDIA claimed, that's what I have personally verified.

> there's a reason why h800 were included last minute

No. Oct 22 restrictions are by itself significantly easier than Oct 23 one. NVIDIA just need to kill 4 NVLink lanes off A100 and you get A800. For H100 you kill some more NVLink until on paper NVLink bandwidth is roughly at A800 level again and then voila.

BIS is certainly pissed off by NVIDIA's attempt at being creative to sell the best possible product to China. So they actually lowered allowed compute number AGAIN in Oct 23. That's what killed H800.

1oooqooq · on July 20, 2024

I see. thanks for the details.

CamperBob2 · on July 18, 2024

The explanation could also be as simple as fear of patent trolls.

dralley · on July 17, 2024

I doubt it. It's probably a matter of constantly being prodded by their industry partners (i.e. Red Hat), constantly being shamed by the community, and reducing the amount of maintenance they need to do to keep their driver stack updated and working on new kernels.

The meat of the drivers is still proprietary, this just allows them to be loaded without a proprietary kernel module.

chillfox · on July 18, 2024

Nvidia has historically given zero fucks about the opinions of their partners.

So my guess is it's to do with LLMs. They are all in on AI, and having more of their code be part of training sets could make tools like ChatGPT/Claude/Copilot better at generating code for Nvidia GPUs.

da_chicken · on July 18, 2024

Yup. nVidia wants those fat compute center checks to keep coming in. It's an unsaturated market, unlike gaming consoles, home gaming PCs, and design/production workstations. They got a taste of that blockchain dollar, and now AI looks to double down on the demand.

The best solution is to have the industry eat their dogfood.

jmorenoamor · on July 18, 2024

I also see this as the main reason. GPU drivers for Linux, as far as I know, were just a niche use case, maybe CUDA planted a small seed, and the AI hype is the flower. Now the industry, not the users, demand drivers, so this became a demanded feature instead of a niche user wish.

A bit sad, but hey, welcome anyways.

p_l · on July 17, 2024

I suspect it's mainly the reduced maintenance and reduction of workload needed to support, especially with more platforms coming to be supported (not so long ago there was no ARM64 nvidia support, now they are shipping their own ARM64 servers!)

What really changed the situation is that Turing architecture GPUs bring new, more powerful management CPU, which has enough capacity to essentially run the OS-agnostic parts of driver that used to be provided as blob on linux.

knotimpressed · on July 17, 2024

Am I correct in reading that as Turing architecture cards include a small CPU on the GPU board, running parts of the driver/other code?

p_l · on July 17, 2024

In Turing microarchitecture, nVidia replaced their old "falcon" cpu with NV-RISCV RV64 chip, running various internal tasks.

"Open Drivers" from nVidia include different firmware that utilizes the new-found performance.

matheusmoreira · on July 18, 2024

How well isolated is this secondary computer? Do we have reason to fear the proprietary software running on it?

p_l · on July 18, 2024

As well isolated as anything else on the bus.

So you better actually use IOMMU

stragies · on July 18, 2024

Ah, yes, the magical IOMMU controller, that everybody just assumes to be implemented perfectly across the board. I'm expecting this to be like Hyperthreading, where we find out 20 years later, that the feature was faulty/maybe_bugdoored since inception in many/most/all implementations.

Same thing with USB3/TB-controllers, NPUs, etc that everybody just expects to be perfectly implemented to spec, with flawless firmwares.

p_l · on July 18, 2024

It's not perfect or anything, but it's usually a step up[1], and the funniest thing is that GPUs generally had less of ... "interesting" compute facilities to jump over from, just easier to access usually. My first 64 bit laptop, my first android smartphone, first few iPhones, had more MIPS32le cores with possible DMA access to memory than the main CPU cores, and that was just counting one component of many (the wifi chip).

Also, Hyperthreading wasn't itself faulty or "bugdoored". The tricks necessary to get high performance out of CPUs were, and then there was intel deciding to drop various good precautions in name of still higher single core performance.

Fortunately, after several years, IOMMU availability becomes more common (current laptop I'm writing this on has proper separate groups for every device it seems)

[1] There's always the OpenBSD of navel gazing about writing "secure" C code, becoming slowly obsolescent thanks to being behind in performance and features, and ultimately getting pwned because your C focus and not implementing "complex" features helping mitigate access results in pwnable SMTPd running as root.

stragies · on July 18, 2024

All fine and well, but I always come back to "If I were a manufacturer/creator of some work/device/software, that does something in the plausible realm of 'telecommunication', how do make sure, that my product can always comply with https://en.wikipedia.org/wiki/Lawful_interception requests? Allow for ingress/egress of data/commands at as low a level as possible!"

So as a chipset creator company director it would seem like a no-brainer to me to have to tell my engineers unfortunately to not fix some exploitable bug in the IOMMU/Chipset. Unless I want to never sell devices that could potentially be used to move citizens internet packets around in a large scale deployment.

And implement/not_fix something similar in other layers as well, e.g. ME.

p_l · on July 18, 2024

If your product is supposed to comply with Lawful Interception, you're going to implement proper LI interfaces, not leave bullshit DMA bugs in.

The very point of Lawful Interception involves explicit, described interfaces, so that all parties involved can do the work.

The systems with LI interfaces also often end up in jurisdictions that simultaneously put high penalties on giving access to them without specific authorizations - I know, I had to sign some really interesting legalese once due to working in environment where we had to balance both Lawful Interception, post-facto access to data, and telecommunications privacy laws.

Leaving backdoors like that is for Unlawful Interception, and the danger of such approaches is greatly exposed in form of Chinese intelligence services exploiting NSA backdoor in Juniper routers (infamous DRBG_EC RNG)

matheusmoreira · on July 18, 2024

> you better actually use IOMMU

Is this feature commonly present on PC hardware? I've only ever read about it in the context of smartphone security. I've also read that nvidia doesn't like this sort of thing because it allows virtualizing their cards which is supposed to be an "enterprise" feature.

brendank310 · on July 18, 2024

Relatively common nowadays. It used to be delineated as a feature in Intel chips as part of their vPro line, but I think it’s baked in. Generally an IOMMU is needed for performant PCI passthrough to VMs, and Windows uses it for DeviceGuard which tries to prevent DMA attacks.

wtallis · on July 18, 2024

Mainstream consumer x86 processors have had IOMMU capability for over a decade, but for the first few years it was commonly disabled on certain parts for product segmentation (eg. i5-3570K had overclocking but no IOMMU, i5-3570 had IOMMU but limited overclocking). That practice died off approximately when Thunderbolt started to catch on, because not having an IOMMU when using Thunderbolt would have been very bad.

p_l · on July 18, 2024

Seems to me that Zen 4 has no issues at all, but bridges/switches require additional interfaces to further fan-out access controls.

kabes · on July 17, 2024

It's hard to believe one of the highest valued companies in the world cares about being shamed for not having open source drivers.

commodoreboxer · on July 17, 2024

They care when it affects their bottom line, and customers leaving for the competition does that.

I don't know if that's what's happening here, honestly, but you're right that they don't care about being shamed, but building a reputation of being hard to work with and target, especially in a growing market like Linux (still tiny, but growing nonetheless, and becoming significantly more important in the areas where non-gaming GPU use is concerned) can start to erode sales and B2B relationships, and the latter particularly if you make the programmers and PMs hate using your products.

gessha · on July 18, 2024

> customers leaving for the competition does that

What competition?

I do agree that companies don’t really care for public sentiment as long as business is going as usual. Nvidia is printing money with their data center hardware [1] where half of their yearly revenue comes from.

https://nvidianews.nvidia.com/news/nvidia-announces-financia...

bryanlarsen · on July 17, 2024

> in a growing market like Linux

Isn't Linux 80% of their market? ML et al is 80% of their sales, and ~99% of that is Linux.

fngjdflmdflg · on July 17, 2024

True, although note that the Linux market itself is increasing in size due to ML. Maybe "increasingly dominant market" is a better phrase here.

bryanlarsen · on July 18, 2024

Hah, good point. The OP was pedantically correct. The implication in "growing market share" is that "market share" is small, but that's definitely reading between the lines!

lmm · on July 17, 2024

Right, and that's where most of their growth is.

nailer · on July 17, 2024

Having products that require a bunch of extra work due to proprietary drivers, especially when their competitors don't require that work, is not good.

josefx · on July 18, 2024

The biggest chunk of that "extra work" would be installing Linux in the first place, given that almost everything comes with Windows out of the box. An additional "sudo apt install nvidia-drivers" isn't going to stop anyone who already got that far.

sam_bristow · on July 18, 2024

Does the "everything comes with Windows out of the box" still apply for the servers and workstations where I imagine the vast majority of these high-end GPUs are going these days?

Arch-TK · on July 18, 2024

Tainted kernel. Having to sort out secure boot problems caused by use of an out of tree module. DKMS. Annoying weird issues with different kernel versions and problems running the bleeding edge.

nailer · on July 18, 2024

Most cloud instances come with Linux out of the box.

ZeroCool2u · on July 18, 2024

I mean I've personally given our Nvidia rep some light hearted shit for it. Told him I'd appreciate if he passed the feedback up the chain. Can't hurt to provide feedback!

nicce · on July 17, 2024

Kernel modules are not user-space drivers which are still proprietary.

bradyriddle · on July 17, 2024

Ooops. Missed that part.

Re-reading that story is kind of wild. I don't know how valuable what they allegedly got would be (silicon, graphics and chipset files) but the hackers accused Nvidia of 'hacking back' and encrypting their data.

Reminds me of a story I heard about Nvidia hiring a private military to guard their cards after entire shipments started getting 'lost' somewhere in asia.

spookie · on July 18, 2024

Wait what? That PMC story got me. Where can I find more info on that lmao?

bradyriddle · on July 18, 2024

I'd heard the story first hand from a guy in san jose. Never looked it up until now. This is the closest thing I could find to it. In which case it sounds like it's been debunked.

[0] https://www.pcgamer.com/no-half-a-million-geforce-rtx-30-ser...

[1] https://www.geeknetic.es/Noticia/20794/Encuentran-en-Corea-5...

porphyra · on July 17, 2024

Much of the black magic has been moved from the drivers to the firmware anyway.

nicman23 · on July 18, 2024

they did release it. a magic drive i have seen, but totally do not own, has it