NVIDIA Transitions Fully Towards Open-Source Linux GPU Kernel Modules

shanoaice · on July 18, 2024

There is little meaning for NVIDIA to open-source only the driver portion of their cards, since they heavily rely on proprietary firmware and userspace lib (most important!) to do the real job. Firmware is a relatively small issue - this is mostly same for AMD and Intel, since encapsulation reduces work done on driver side and open-sourcing firmware could allow people to do some really unanticipated modification which might heavily threaten even commercial card sale. Nonetheless at least for AMD they still keep a fair share of work done by driver compared to Nvidia. Userspace library is the worst problem, since they handle a lot of GPU control related functionality and graphics API, which is still kept closed-source.

The best thing we can hope is improvement on NVK and RedHat's Nova Driver can put pressure on NVIDIA releasing their user space components.

gpderetta · on July 18, 2024

It is meaningful because, as you note, it enables a fully opensource userspace driver. Of course the firmware is still proprietary and it increasingly contains more and more logic.

sscarduzio · on July 18, 2024

Which in a way is good because the hardware will more and more perform identically on Linux as on Windows.

matheusmoreira · on July 18, 2024

Doesn't seem like a bad tradeoff so long as the proprietary stuff is kept completely isolated with no access to any other parts of my system.

justinclift · on July 18, 2024

Personally, I somewhat wonder about that. The firmware (proprietary) which runs on the gpu seems like it'll have access to do things over the gpu PCIe bus, including read system memory, and access other devices (including network gear). Reading memory of remote hosts (ie RDMA) is also a thing which Nvidia gpus can do.

foresto · on July 18, 2024

Is that not solvable using an IOMMU (assuming hardware that has one)?

justinclift · on July 18, 2024

No idea personally. :)

riehwvfbk · on July 19, 2024

An IOMMU does solve it, at the cost of some performance. The GPU can only access memory that the IOMMU allows, and the part that programs the IOMMU is open source.

RDMA requires a special network card and is opt-in - an RDMA NIC cannot access any random memory, only specially registered regions. One could argue that a NIC FW bug could cause arbitrary memory accesses, but that's another place where an IOMMU would help.

justinclift · on July 19, 2024

Awesome, thanks. :)

bayindirh · on July 18, 2024

The GLX libraries are the elephant(s) in the room. Open source kernel modules mean nothing without these libraries. On the other hand AMD and Intel uses "pltform GLX" natively, and with great success.

gpderetta · on July 23, 2024

Mesa already provides good open source GLX and Vulkan libraries. An open source NVIDIA kernel driver enables interoperability with Mesa exactly like Intel and AMD.

bayindirh · on July 23, 2024

Half of the trade secrets NVIDIA has are living in their own GLX libraries. Even if you install the open source kernel module, these GLX libraries are installed (just did it on a new cluster).

I’m not holding my breath about these libraries to be phased out and NVIDIA integrates to the platform GLX any time soon.

I think NVIDIA will resist moving to a firmware only model (ala AMD & Intel) as long as they can, preferably forever.

pabs3 · on July 18, 2024

The firmware is also signed, so you can't even do reverse engineering to replace it.

paulmd · on July 18, 2024

the open kernel driver also fundamentally breaks the limitation about geforce gpus not being licensed for use in the datacenter. that provision is a driver provision and CUDA does not follow the same license as the driver... really the only significant limitation is that you aren't allowed to use the CUDA toolkit to develop for non-NVIDIA hardware, and some license notice requirements if you redistribute the sample projects or other sample sourcecode. and yeah they paid to develop it, it's proprietary source code, that's reasonable overall.

https://docs.nvidia.com/cuda/eula/index.html

ctrl-f "datacenter": none

so yeah, I'm not sure where the assertion of "no progress" and "nothing meaningful" and "this changes nothing" come from, other than pure fanboyism/anti-fans. before you couldn't write a libre CUDA userland even if you wanted to - the kernel side wasn't there. And now you can, and this allows retiming and clock-up of supported gpus even with nouveau-style libre userlands. Which of course don't grow on trees, but it's still progress.

honestly it's kinda embarrassing that grown-ass adults are still getting their positions from what is functionally just some sick burn in a 2004 viral video or whatever, to the extent they actively oppose the company moving in the direction of libre software at all. but I think with the "linus torvalds" citers, you just can't reason those people out of a position that they didn't reason themselves into. Not only is it an emotionally-driven (and fanboy-driven) mindset, but it's literally not even their own position to begin with, it's just something they're absorbing from youtube via osmosis.

Apple debates and NVIDIA debates always come down to the anti-fans bringing down the discourse. It's honestly sad. https://paulgraham.com/fh.html

it also generally speaks to the long-term success and intellectual victory of the GPL/FSF that people see proprietary software as somehow inherently bad and illegitimate... even when source is available, in some cases. Like CUDA's toolchain and libraries/ecosystem is pretty much the ideal example of a company paying to develop a solution that would not otherwise have been developed, in a market that was (at the time) not really interested until NVIDIA went ahead and proved the value. You don't get to ret-con every single successful software project as being retroactively open-source just because you really really want to run it on a competitor's hardware. But people now have this mindset that if it's not libre then it's somehow illegitimate.

Again, most CUDA stuff is distributed as source, if you want to modify and extend it you can do so, subject to the terms of the CUDA license... and that's not good enough either.

Zambyte · on July 19, 2024

Can you link the source code for CUDA please? Thanks.

Edit since I'm being downvoted: I did search for it and could not find it.

paulmd · on July 19, 2024

https://github.com/NVIDIA/cccl

AshamedCaptain · on July 18, 2024

I really don't know where this crap about "Moving everything to the firmware" is coming from. The kernel part of the nvidia driver has always been small, and this is the only thing they are open-sourcing (they have been announcing it for months now......). The immense majority of the user-space driver is still closed and no one has seen any indications that this may change.

I see no indications either that either nvidia nor any of the rest of the manufacturers has moved any respectable amount of functionality to the firmware. If you look at the opensource drivers you can even confirm by yourself that the firmware does practically nothing -- the size of the binary blobs of AMD cards are minuscule for example, and long are the times of ATOMBIOS. The drivers are literally generating bytecode-level binaries for the shader units in the GPU, what do you expect the firmware could even do at this point? Re-optimize the compiler output?

There was an example of a GPU that did move everything to the firmware -- the videocore on the raspberry pi, and it was clearly a completely distinct paradigm, as the "driver" would almost literally pass through OpenGL calls to a mailbox, read by the secondary ARM core (more powerful than the main ARM core!) that was basically running the actual driver as "firmware". Nothing I see on nvidia indicates a similar trend, otherwise RE-ing it would be trivial, as happened with the VC.

ploxiln · on July 18, 2024

https://lwn.net/Articles/953144/

> Recently, though, the company has rearchitected its products, adding a large RISC-V processor (the GPU system processor, or GSP) and moving much of the functionality once handled by drivers into the GSP firmware. The company allows that firmware to be used by Linux and shipped by distributors. This arrangement brings a number of advantages; for example, it is now possible for the kernel to do reclocking of NVIDIA GPUs, running them at full speed just like the proprietary drivers can. It is, he said, a big improvement over the Nouveau-only firmware that was provided previously.

> There are a number of disadvantages too, though. The firmware provides no stable ABI, and a lot of the calls it provides are not documented. The firmware files themselves are large, in the range of 20-30MB, and two of them are required for any given device. That significantly bloats a system's /boot directory and initramfs image (which must provide every version of the firmware that the kernel might need), and forces the Nouveau developers to be strict and careful about picking up firmware updates.

noch · on July 18, 2024

>> I see no indications either that either nvidia nor any of the rest of the manufacturers has moved any respectable amount of functionality to the firmware.

Someone who believes this could easily prove that they are correct by "simply" taking their 4090 and documenting all its functionality, as was done with the [7900 xtx](https://github.com/geohot/7900xtx).

You can't say "I see no indications/evidence" unless you have proven that there is no evidence, no?

paulmd · on July 18, 2024

so basically “if you really think there’s no proof of a positive claim, then you won’t mind conclusively proving the negation”?

no, that’s not how either logical propositions or burden of proof works

exe34 · on July 19, 2024

He has already told you how to prove it: enumerate the functionality of the driver - the GPU and the code are finite, bounded environments. You can absolutely prove that there is no tea in a cup, that there are no coins in a purse, that there is no cat in a box, etc.

noch · on July 19, 2024

> no, that’s not how either logical propositions or burden of proof works

I think you're missing the point, perhaps intentionally to make a smart-sounding point?

We're programmers, working on _specific physical things_. If I claim that my CPU's branch predictor is not doing something, it is only prudent to find out what it is doing, and enumerate the finite set of what it contains.

Does that make sense? The goal is to figure out _how things actually work_ rather than making claims and arguing past each other until the end of time.

Perhaps you don't care about what the firmware blobs contain, and so you'd rather have an academic debate about logical propositions, but I care about the damn blobs, because it matters for my present and future work.

cpgxiii · on July 18, 2024

These aren't necessarily conflicting assessments. The addition of the GSP to Turing and later GPUs does mean that some behavior can be moved on-device from the drivers. Device initialization and management is an important piece of behavior, certainly, but in the context of the all work done by the Nvidia driver (both kernel and user-space), it is a relatively tiny portion (e.g. compiling/optimizing shaders and kernels, video encode/decode, etc).

phendrenad2 · on July 19, 2024

There IS meaning because this makes it easier to install Nvidia drivers. At least, it reduces the number of failure modes. Now the open-source component can be managed by the kernel team, while the closed-source portion can be changed as needed, not dictated by kernel API changes.

matheusmoreira · on July 18, 2024

Why is the user space component required? Won't they provide sysfs interfaces to control the hardware?

cesarb · on July 18, 2024

It's something common to all modern GPUs, not just NVIDIA: most of the logic is in a user space library loaded by the OpenGL or Vulkan loader into each program. That library writes a stream of commands into a buffer (plus all the necessary data) directly into memory accessible to the GPU, and there's a single system call at the end to ask the operating system kernel to tell the GPU to start reading from that command buffer. That is, other than memory allocation and a few other privileged operations, the user space programs talk directly to the GPU.

bradyriddle · on July 17, 2024

I remember Nvidia getting hacked pretty bad a few years ago. IIRC, the hackers threatened to release everything they had unless they open sourced their drivers. Maybe they got what they wanted.

[0] https://portswigger.net/daily-swig/nvidia-hackers-allegedly-...

justinclift · on July 18, 2024

For Nvidia, the most likely reason they've strongly avoided Open Sourcing their drivers isn't anything like that.

It's simply a function of their history. They used to have high priced professional level graphics cards ("Nvidia Quadro") using exactly the same chips as their consumer graphics cards.

The BIOS of the cards was different, enabling different features. So people wanting those features cheaply would buy the consumer graphics cards and flash the matching Quadro BIOS to them. Worked perfectly fine.

Nvidia naturally wasn't happy about those "lost sales", so began a game of whack-a-mole to stop BIOS flashing from working. They did stuff like adding resistors to the boards to tell the card whether it was a Geforce or Quadro card, and when that was promptly reverse engineered they started getting creative in other ways.

Meanwhile, they couldn't really Open Source their drivers because then people could see what the "Geforce vs Quadro" software checks were. That would open up software countermeasures being developed.

---

In the most recent few years the professional cards and gaming cards now use different chips. So the BIOS tricks are no longer relevant.

Which means Nvidia can "safely" Open Source their drivers now, and they've begun doing so.

--

Note that this is a copy of my comment from several months ago, as it's just as relevant now as it was then: https://news.ycombinator.com/item?id=38418278

SuperNinKenDo · on July 18, 2024

Very interesting, thanks for the perspective. I suspect all the recent loss of face they experienced with the transition to Wayland happening around the time that this motivation evaporated also probably plays a part too though.

I swore off ever again buying Nvidia, or any laptops that come with Nvidia, after all this. Maybe in 10 years they'll have managed to right the brand perceptions of people like myself.

1oooqooq · on July 18, 2024

interesting timing to recall that story. now the same trick is used for h100 vs whatever the throttled-for-embargo-wink-wink Chinese version is called.

but those companies are really adverse to open sourcing because they can't be sure they own all the code. it's decades of copy pasting reference implementations after all

rfoo · on July 18, 2024

> now the same trick is used for h100 vs whatever the throttled-for-embargo-wink-wink Chinese version

No. H20 is a different chip designed to be less compute-dense (by having different combinations of SM/L2$/HBM controller). It is not a throttled chip.

A800 and H800 are A100/H100 with some area of the chip physically blown up and reconfigured. They are also not simply throttled.

1oooqooq · on July 18, 2024

that's what nvidia told everyone in mar 23... but there's a reason why h800 were included last minute on the embargo in oct 23.

rfoo · on July 18, 2024

That's not what NVIDIA claimed, that's what I have personally verified.

> there's a reason why h800 were included last minute

No. Oct 22 restrictions are by itself significantly easier than Oct 23 one. NVIDIA just need to kill 4 NVLink lanes off A100 and you get A800. For H100 you kill some more NVLink until on paper NVLink bandwidth is roughly at A800 level again and then voila.

BIS is certainly pissed off by NVIDIA's attempt at being creative to sell the best possible product to China. So they actually lowered allowed compute number AGAIN in Oct 23. That's what killed H800.

1oooqooq · on July 20, 2024

I see. thanks for the details.

CamperBob2 · on July 18, 2024

The explanation could also be as simple as fear of patent trolls.

dralley · on July 17, 2024

I doubt it. It's probably a matter of constantly being prodded by their industry partners (i.e. Red Hat), constantly being shamed by the community, and reducing the amount of maintenance they need to do to keep their driver stack updated and working on new kernels.

The meat of the drivers is still proprietary, this just allows them to be loaded without a proprietary kernel module.

chillfox · on July 18, 2024

Nvidia has historically given zero fucks about the opinions of their partners.

So my guess is it's to do with LLMs. They are all in on AI, and having more of their code be part of training sets could make tools like ChatGPT/Claude/Copilot better at generating code for Nvidia GPUs.

da_chicken · on July 18, 2024

Yup. nVidia wants those fat compute center checks to keep coming in. It's an unsaturated market, unlike gaming consoles, home gaming PCs, and design/production workstations. They got a taste of that blockchain dollar, and now AI looks to double down on the demand.

The best solution is to have the industry eat their dogfood.

jmorenoamor · on July 18, 2024

I also see this as the main reason. GPU drivers for Linux, as far as I know, were just a niche use case, maybe CUDA planted a small seed, and the AI hype is the flower. Now the industry, not the users, demand drivers, so this became a demanded feature instead of a niche user wish.

A bit sad, but hey, welcome anyways.

p_l · on July 17, 2024

I suspect it's mainly the reduced maintenance and reduction of workload needed to support, especially with more platforms coming to be supported (not so long ago there was no ARM64 nvidia support, now they are shipping their own ARM64 servers!)

What really changed the situation is that Turing architecture GPUs bring new, more powerful management CPU, which has enough capacity to essentially run the OS-agnostic parts of driver that used to be provided as blob on linux.

knotimpressed · on July 17, 2024

Am I correct in reading that as Turing architecture cards include a small CPU on the GPU board, running parts of the driver/other code?

p_l · on July 17, 2024

In Turing microarchitecture, nVidia replaced their old "falcon" cpu with NV-RISCV RV64 chip, running various internal tasks.

"Open Drivers" from nVidia include different firmware that utilizes the new-found performance.

matheusmoreira · on July 18, 2024

How well isolated is this secondary computer? Do we have reason to fear the proprietary software running on it?

p_l · on July 18, 2024

As well isolated as anything else on the bus.

So you better actually use IOMMU

stragies · on July 18, 2024

Ah, yes, the magical IOMMU controller, that everybody just assumes to be implemented perfectly across the board. I'm expecting this to be like Hyperthreading, where we find out 20 years later, that the feature was faulty/maybe_bugdoored since inception in many/most/all implementations.

Same thing with USB3/TB-controllers, NPUs, etc that everybody just expects to be perfectly implemented to spec, with flawless firmwares.

p_l · on July 18, 2024

It's not perfect or anything, but it's usually a step up[1], and the funniest thing is that GPUs generally had less of ... "interesting" compute facilities to jump over from, just easier to access usually. My first 64 bit laptop, my first android smartphone, first few iPhones, had more MIPS32le cores with possible DMA access to memory than the main CPU cores, and that was just counting one component of many (the wifi chip).

Also, Hyperthreading wasn't itself faulty or "bugdoored". The tricks necessary to get high performance out of CPUs were, and then there was intel deciding to drop various good precautions in name of still higher single core performance.

Fortunately, after several years, IOMMU availability becomes more common (current laptop I'm writing this on has proper separate groups for every device it seems)

[1] There's always the OpenBSD of navel gazing about writing "secure" C code, becoming slowly obsolescent thanks to being behind in performance and features, and ultimately getting pwned because your C focus and not implementing "complex" features helping mitigate access results in pwnable SMTPd running as root.

stragies · on July 18, 2024

All fine and well, but I always come back to "If I were a manufacturer/creator of some work/device/software, that does something in the plausible realm of 'telecommunication', how do make sure, that my product can always comply with https://en.wikipedia.org/wiki/Lawful_interception requests? Allow for ingress/egress of data/commands at as low a level as possible!"

So as a chipset creator company director it would seem like a no-brainer to me to have to tell my engineers unfortunately to not fix some exploitable bug in the IOMMU/Chipset. Unless I want to never sell devices that could potentially be used to move citizens internet packets around in a large scale deployment.

And implement/not_fix something similar in other layers as well, e.g. ME.

p_l · on July 18, 2024

If your product is supposed to comply with Lawful Interception, you're going to implement proper LI interfaces, not leave bullshit DMA bugs in.

The very point of Lawful Interception involves explicit, described interfaces, so that all parties involved can do the work.

The systems with LI interfaces also often end up in jurisdictions that simultaneously put high penalties on giving access to them without specific authorizations - I know, I had to sign some really interesting legalese once due to working in environment where we had to balance both Lawful Interception, post-facto access to data, and telecommunications privacy laws.

Leaving backdoors like that is for Unlawful Interception, and the danger of such approaches is greatly exposed in form of Chinese intelligence services exploiting NSA backdoor in Juniper routers (infamous DRBG_EC RNG)

matheusmoreira · on July 18, 2024

> you better actually use IOMMU

Is this feature commonly present on PC hardware? I've only ever read about it in the context of smartphone security. I've also read that nvidia doesn't like this sort of thing because it allows virtualizing their cards which is supposed to be an "enterprise" feature.

brendank310 · on July 18, 2024

Relatively common nowadays. It used to be delineated as a feature in Intel chips as part of their vPro line, but I think it’s baked in. Generally an IOMMU is needed for performant PCI passthrough to VMs, and Windows uses it for DeviceGuard which tries to prevent DMA attacks.

wtallis · on July 18, 2024

Mainstream consumer x86 processors have had IOMMU capability for over a decade, but for the first few years it was commonly disabled on certain parts for product segmentation (eg. i5-3570K had overclocking but no IOMMU, i5-3570 had IOMMU but limited overclocking). That practice died off approximately when Thunderbolt started to catch on, because not having an IOMMU when using Thunderbolt would have been very bad.

p_l · on July 18, 2024

Seems to me that Zen 4 has no issues at all, but bridges/switches require additional interfaces to further fan-out access controls.

kabes · on July 17, 2024

It's hard to believe one of the highest valued companies in the world cares about being shamed for not having open source drivers.

commodoreboxer · on July 17, 2024

They care when it affects their bottom line, and customers leaving for the competition does that.

I don't know if that's what's happening here, honestly, but you're right that they don't care about being shamed, but building a reputation of being hard to work with and target, especially in a growing market like Linux (still tiny, but growing nonetheless, and becoming significantly more important in the areas where non-gaming GPU use is concerned) can start to erode sales and B2B relationships, and the latter particularly if you make the programmers and PMs hate using your products.

gessha · on July 18, 2024

> customers leaving for the competition does that

What competition?

I do agree that companies don’t really care for public sentiment as long as business is going as usual. Nvidia is printing money with their data center hardware [1] where half of their yearly revenue comes from.

https://nvidianews.nvidia.com/news/nvidia-announces-financia...

bryanlarsen · on July 17, 2024

> in a growing market like Linux

Isn't Linux 80% of their market? ML et al is 80% of their sales, and ~99% of that is Linux.

fngjdflmdflg · on July 17, 2024

True, although note that the Linux market itself is increasing in size due to ML. Maybe "increasingly dominant market" is a better phrase here.

bryanlarsen · on July 18, 2024

Hah, good point. The OP was pedantically correct. The implication in "growing market share" is that "market share" is small, but that's definitely reading between the lines!

lmm · on July 17, 2024

Right, and that's where most of their growth is.

nailer · on July 17, 2024

Having products that require a bunch of extra work due to proprietary drivers, especially when their competitors don't require that work, is not good.

josefx · on July 18, 2024

The biggest chunk of that "extra work" would be installing Linux in the first place, given that almost everything comes with Windows out of the box. An additional "sudo apt install nvidia-drivers" isn't going to stop anyone who already got that far.

sam_bristow · on July 18, 2024

Does the "everything comes with Windows out of the box" still apply for the servers and workstations where I imagine the vast majority of these high-end GPUs are going these days?

Arch-TK · on July 18, 2024

Tainted kernel. Having to sort out secure boot problems caused by use of an out of tree module. DKMS. Annoying weird issues with different kernel versions and problems running the bleeding edge.

nailer · on July 18, 2024

Most cloud instances come with Linux out of the box.

ZeroCool2u · on July 18, 2024

I mean I've personally given our Nvidia rep some light hearted shit for it. Told him I'd appreciate if he passed the feedback up the chain. Can't hurt to provide feedback!

nicce · on July 17, 2024

Kernel modules are not user-space drivers which are still proprietary.

bradyriddle · on July 17, 2024

Ooops. Missed that part.

Re-reading that story is kind of wild. I don't know how valuable what they allegedly got would be (silicon, graphics and chipset files) but the hackers accused Nvidia of 'hacking back' and encrypting their data.

Reminds me of a story I heard about Nvidia hiring a private military to guard their cards after entire shipments started getting 'lost' somewhere in asia.

spookie · on July 18, 2024

Wait what? That PMC story got me. Where can I find more info on that lmao?

bradyriddle · on July 18, 2024

I'd heard the story first hand from a guy in san jose. Never looked it up until now. This is the closest thing I could find to it. In which case it sounds like it's been debunked.

[0] https://www.pcgamer.com/no-half-a-million-geforce-rtx-30-ser...

[1] https://www.geeknetic.es/Noticia/20794/Encuentran-en-Corea-5...

porphyra · on July 17, 2024

Much of the black magic has been moved from the drivers to the firmware anyway.

nicman23 · on July 18, 2024

they did release it. a magic drive i have seen, but totally do not own, has it

creata · on July 17, 2024

Huh. Sway and Wayland was such a nightmare on Nvidia that it convinced me to switch to AMD. I wonder if it's better now.

(IIRC the main issue was https://gitlab.freedesktop.org/xorg/xserver/-/issues/1317 , which is now complete.)

snailmailman · on July 17, 2024

Better as of extremely recently. Explicit sync fixes most of the issues with flickering that I’ve had on Wayland. I’ve been using the latest (beta?) driver for a while because of it.

I’m using Hyprland though so explicit sync support isn’t entirely there for me yet. It’s actively being worked on. But in the last few months it’s gotten a lot better

JasonSage · on July 17, 2024

> Better as of extremely recently.

Yup. Anecdotally, I see a lot of folks trying to run wine/games on Wayland reporting flickering issues that are gone as of version 555, which is the most recent release save for 560 coming out this week. It's a good time to be on the bleeding edge.

Fr0styMatt88 · on July 18, 2024

On latest NixOS unstable and KDE + Wayland is still a bit of a dumpster fire for me (3070 + latest NV drivers). In particular there’s a buffer wait bug in EGL that needs fixing on the Nvidia side that causes the Plasma UI to become unresponsive. Panels are also broken for me, with icons not showing.

Having said that, the latest is a pain on X11 right now as well, with frequent crashing of Plasma, which atleast restarts itself.

There’s a lot of bleeding on the bleeding edge right at this moment :)

JasonSage · on July 18, 2024

That's interesting, maybe it's hardware-dependent? I'm doing nixos + KDE + Wayland and I've had almost no issues in day-to-day usage and productivity.

I agree with you that there's a lot of bleeding. Linux is nicer than it used to be and there's less fiddling required to get to a usable base, but still plenty of fiddling as you get into more niche usage, especially when it involves any GPU hardware/software. Yet somehow one can run Elden Ring on Steam via Proton with a few mouse clicks and no issues, which would've been inconceivable to me only a few years ago.

Fr0styMatt88 · on July 18, 2024

Yeah it’s pretty awesome overall. I think the issues are from a few things on my end:

- I’ve upgraded through a few iterations starting with Plasma 6, so my dotfiles might be a bit wonky. I’m not using Home Manager so my dotfiles are stateful.

- Could be very particular to my dock setup as I have two docks + one of the clock widgets.

- Could be the particular wallpaper I’m using (it’s one of the dynamic ones that comes with KDE).

- It wouldn’t surprise me if it’s related to audio somehow as I have Bluetooth set-up for when I need it.

I’m sure it’ll settle soon enough :)

postcert · on July 18, 2024

I've been having a similar flakiness with plasma on Nixos (proprietary + 3070 as well). Sadly can't say whether it did{n't} happen on another distro as I last used Arch around the v535 driver.

I found it funny how silently it would fail at times. After coming out of a game or focusing on something I'd scratch my head as to where did the docks/background went. I'd say you're lucky in that it recovered itself, generally I needed to run `plasmashell` in the alt+f2 run prompt.

asyx · on July 17, 2024

I think it's X11 stuff that is using Vulkan for rendering that is still flickering in 555. This probably affects pretty much all of Proton / Wine gaming.

doix · on July 18, 2024

Any specific examples that you know should be broken? I am on X11 with 555 drivers and an nvidia gpu. I don't have any flickering when I'm gaming, it's actually why I stay on X11 instead of transitioning to wayland.

johnny22 · on July 18, 2024

They are probably talking about running the game in a wayland session via xwayland, since wine's wayland driver is not part of proton yet.

hulitu · on July 17, 2024

You can always use X11. /s

bornfreddy · on July 18, 2024

I know that was a joke, but - as someone who is still on X, what am I missing? Any practical advantages to using Wayland when using a single monitor on desktop computer?

vetinari · on July 18, 2024

Even that single monitor can be hidpi, vrr or hdr (this one is still wip).

Arch-TK · on July 18, 2024

I have a 165 DPI monitor. This honestly just works with far less hassle on X. I don't have to listen to anyone try to explain to me how fractional scaling doesn't make sense (real explanation for why it wasn't supported). I don't have to deal with some silly explanation for why XWayland applications just can't be non-blurry with a fractional or non-1 scaling factor. I can just set the DPI to the value I calculated and things work in 99% of cases. In 0.9% of the remaining cases I need to set an environment variable or pass a flag to fix a buggy application and in the 0.1% of cases I need to make a change to the code.

VRR has always worked for me on single monitor X. I use it on my gaming computer (so about twice a year).

guilhas · on July 19, 2024

Same, can't understand people evangelizing Wayland

I have a laptop 10.1 2560x1600 with a 32' monitor, and another 27', never had any problem

Wayland has practically no advantages, you have to spend hours configuring, and still have apps working badly... they are always just a month away from having "everything" fixed

Maybe Wayland is the future but I'll keep using Xorg distros for the foreseeable future

vetinari · on July 19, 2024

You guys must be using some different X11 than the rest of us.

Basically, with X11 and hidpi, all you can do is to set up the system to announce dpi with certain value and hope, that the clients will cope. Some can (I know of exactly two of them: Chrome and Firefox), others will up bump up the font size and hopefully are using a layout, so the window sizes will adjust to accommodate the textboxes, but all the non-text assets will stay low-res how they were, because they do not have any other. Apps for remote desktop access or vm console won't be able to display remote/vm correctly. And the rest will just ignore that and you get tiny stuff on the display.

And this is just the hidpi issue with single display. Won't go into the problems when running with multiple displays, with different dpi.

I also do not have a faintest idea of what "setting up Wayland" might mean. What did you set up? How? The only thing that needs to "set up" is to pick a wayland session in the display manager. There's no xorg.conf for wayland, setting up drivers, etc. What did you configure "for hours"?

I've been using 4K 27" for over a decade, and Wayland, since Fedora made it default. Since I have no 20-year old xdotool scripts, or others that inject events or try to grab pixmaps, I've had no problem.

Arch-TK · on July 20, 2024

It's possible there might be a misunderstanding as to what "working" means. For me, if there's vaseline anywhere on my screen, that's strictly worse than tiny fonts I need a magnifying glass for. I'd rather have no scaling than nearest neighbour interpolation.

> You guys must be using some different X11 than the rest of us.

Speak for yourself, I know plenty of people who are able to get non-96-DPI working on X with just Xft.dpi and some environment variables.

> Some can (I know of exactly two of them: Chrome and Firefox), others will up bump up the font size and hopefully are using a layout, so the window sizes will adjust to accommodate the textboxes, but all the non-text assets will stay low-res how they were, because they do not have any other.

This is an application bug (non text assets not getting scaled up) and will hardly be fixed with anything other than vaseline the text and icons on an equivalently non-DPI-change supporting application on wayland.

The vast majority of modern software works just fine.

> Apps for remote desktop access or vm console won't be able to display remote/vm correctly.

Does Wayland solve this in any other way other than to vaseline it all up? xfreerdp has /scale. When it comes to VMs I use through spice you just set their DPI settings individually to match your host, then you get nice scaling without vaseline. AFAIK in wayland this all gets vaselined.

> And this is just the hidpi issue with single display. Won't go into the problems when running with multiple displays, with different dpi.

Don't run multiple displays with different DPI. It's an unsolvable problem in the X11/Wayland ecosystem. You need to keep everything as postscript or something equivalent all the way up until the point you know which monitor it's rendered on.

Of all the things Wayland could have actually gone out and fixed, this is one they eschewed in favour of "ah screw it, just give all the applications some graphics buffers and let them figure it out".

> I also do not have a faintest idea of what "setting up Wayland" might mean. What did you set up? How? The only thing that needs to "set up" is to pick a wayland session in the display manager. There's no xorg.conf for wayland, setting up drivers, etc. What did you configure "for hours"?

I know exactly what guilhas means.

Some people are not content with Ubuntu Gnome at a integer scaling factor, they're running highly bespoke setups where everything from the display manager to the screen-grab stuff is customized or custom written. So you spend a lot of time and effort switching to sway, switching to wayland, switching to wayland native versions of a terminal, fixing firefox so it starts in wayland mode, fiddling with the nonsensical scaling settings to actually get firefox to render at the right size, figuring out how to get your screenshot binding to work again, figuring out how to get all your applications to start in the right version, being dismayed when something which still uses X11 runs in XWayland and looks like vaseline because of weird design decisions which are incomprehensible (meanwhile that same application with Xft.dpi set to the right value renders flawlessly).

Eventually you get it all back up and running and you play with it for a week and you spot 20 things which subtly work differently or outright break, you spend hours looking for a solution to only get half of it working.

Right now wayland works mostly fine for the Ubuntu Gnome user or the Kubuntu user (except issues getting non-integer scaling factors working or issues with things needing XWayland) but it's nowhere near as easy to get up and running for someone running a non-standard setup.

joecool1029 · on July 17, 2024

It's buggy still with sway on nvidia. I really thought the 555 driver would wrinkle out last of the issues but it still has further to go. Switched to kde plasma 6 on wayland since then and it's been great, not buggy at all.

XorNot · on July 18, 2024

Easy Linux use is what keeps me firmly on AMD. This move may earn them a customer.

modzu · on July 18, 2024

why switch to amd and not just switch to X? :D

whalesalad · on July 18, 2024

once you go Wayland you usually don’t go back :)

Void_Kitty · on July 22, 2024

I tried wayland (on amd) and found it annoying to work with compaired to x11 without any apparent benefits, wayland is definitely the future, but i don't think the future is now

kiney · on July 18, 2024

I tested wayland for a while to see what the hype is about. No uoside lits of small workflows broken. Back to Xorg it was.

account42 · on July 18, 2024

Why not both?

sillywalk · on July 17, 2024

From the github repo[0]:

Most of NVIDIA's kernel modules are split into two components:

    An "OS-agnostic" component: this is the component of each kernel module that is independent of operating system.

    A "kernel interface layer": this is the component of each kernel module that is specific to the Linux kernel version and configuration.

When packaged in the NVIDIA .run installation package, the OS-agnostic component is provided as a binary:

[0] https://github.com/NVIDIA/open-gpu-kernel-modules

p_l · on July 17, 2024

That was the "classic" drivers.

The new open source ones effectively move majority of the OS-agnostic component to run as blob on-GPU.

arghwhat · on July 17, 2024

Not quite - it moves some logic to the GSP firmware, but the user-space driver is still a significant portion of code.

The exciting bits there is the work on NVK.

p_l · on July 17, 2024

Yes, I was not including userspace driver in this, as a bit "out of scope" for the conversation :D

hypeatei · on July 17, 2024

How is the NVIDIA driver situation on Linux these days? I built a new desktop with an AMD GPU since I didn't want to deal with all the weirdness of closed source or lacking/obsolete open source drivers.

jcranmer · on July 17, 2024

I built my new-ish computer with an AMD GPU because I trusted in-kernel drivers better than out-of-kernel DKMS drivers.

That said, my previous experience with the DKMS driver stuff hasn't been bad. If you use Nvidia's proprietary driver stack, then things should generally be fine. The worst issues are that Nvidia has (historically, at least; it might be different for newer cards) refused to implement some graphics features that everybody else uses, which means that you basically need entirely separate codepaths for Nvidia in window managers, and some of them have basically said "fuck no" to doing that.

mepian · on July 17, 2024

The current stable proprietary driver is a nightmare on Wayland with my 3070, constant flickering and stuttering everywhere. Apparently the upcoming version 555 is much better, I'm sticking with X11 until it comes out. I never tried the open-source one yet, not sure if it supports my GPU at all.

bcrescimanno · on July 17, 2024

The 555 version is the current version. It was officially released on June 27.

https://www.phoronix.com/news/NVIDIA-555.58-Linux-Driver

JasonSage · on July 17, 2024

In defense of the parent, upcoming can still be a relative term, albeit a bit misleading. For example: I'm running the 550 drivers still because my upstream nixos-unstable doesn't have 555 for me yet.

zxexz · on July 18, 2024

I love NixOS, and the nvidia-x11 package is truly wonderful and captures so many options. But having such a complex package makes updating and regression testing take time. For ML stuff I ended up using it as the basis for an overlay, and ripping out literally everything I don’t need, which makes it a matter of minutes usually to make the changes requires to upgrade when a new driver is released I’m running completely headless because these are H100 nodes, and I just need persistenced and fabricmanager, and GDRMA (which wasn’t working at all, causing me to go down this rabbit hole of stripping everything away until I could figure out why).

postcert · on July 18, 2024

I was going to say specialisations might be useful for you to keep a previous driver version around for testing but you might be past that point!

Having the ability to keep alternate configurations for $previous_kernel and $nvidia_stable have been super helpful in diagnosing instead of rolling back.

mananaysiempre · on July 17, 2024

> nixos-unstable doesn't have 555

Version 555.58.02 is under “latest” in nixos-unstable as of about three weeks ago[1]. (Somebody should check with qyliss if she knows the PR tracker is dead... But the last nixos-unstable bump was two days ago, so it’s there.)

[1] https://github.com/NixOS/nixpkgs/commit/4e15c4a8ad30c02d6c26...

JasonSage · on July 17, 2024

`nvidia-smi` shows that my driver version is 550.78. I ran `nixos-rebuild switch --upgrade` yesterday. My nixos channel is `nixos-unstable`.

Do you know something I don't? I'd love to be on the latest version.

I should have written my post better, it implies that 555 does not exist in nixpkgs, which I never meant. There's certainly a phrasing that captures what I'm seeing more accurately.

atrus · on July 18, 2024

Are you using flakes? If you don't do `nix flake update` there won't be all that much to update.

JasonSage · on July 18, 2024

I am! I forgot about this. Mental model check happening.

(Still on 550.)

mananaysiempre · on July 18, 2024

I did not mean to chastise you or anything, just to suggest you could be able to have a newer driver if you had missed the possibility.

The thing is, AFAIU, NVIDIA has several release channels for their Linux driver[1] and 555 is not (yet?) the "production" one, which is what NixOS defaults to (550 is). If you want a different degree of freshness for your NVIDIA driver, you need to say so explicitly[2]. The necessary incantation should be

  hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.latest;

This is somewhat similar to how you get a newer kernel by setting boot.kernelPackages to linuxPackages_latest, for example, if case you've ever done that.

[1] https://www.nvidia.com/en-us/drivers/unix/

[2] https://nixos.wiki/wiki/Nvidia

JasonSage · on July 18, 2024

I had this configuration but was lacking a flake update to move my nixpkgs forward despite the channel, which I can understand much better looking back.

Thanks for the additional info, this HM thread has helped me quite a bit.

SushiHippie · on July 17, 2024

The versions that nixos provides are based on the files in this repo

https://github.com/aaronp24/nvidia-versions

See: https://github.com/NixOS/nixpkgs/blob/9355fa86e6f27422963132...

You could also opt to use the latest driver instead of stable: https://nixos.wiki/wiki/Nvidia

mepian · on July 17, 2024

Yep, I'm on openSUSE Tumbleweed, and it's not rolled out there yet. I would rather wait than update my drivers out-of-band.

gmokki · on July 18, 2024

I switched to Wayland 10 years ago when it became an option ok Fedora. First thing I had to do was to drop NVIDIA and switch to Intel GPU, and past 5 years to AMD GPU. Makes a big difference if the upstream kernel is supported.

Maybe NVIDIA drivers have kind of worked on 12 month old kernels that Ubuntu on average uses.

misterbishop · on July 17, 2024

this is resolved in 555 (currently running 555.58.02). my asus zephyrus g15 w/ 3060 is looking real good on Fedora 40. there's still optimizations needed around clocking, power, and thermals. but the graphics presentation layer has no issues on wayland. that's with hybrid/optimus/prime switching, which has NEVER worked seamlessly for me on any laptop on linux going back to 2010. gnome window animations remain snappy and not glitchy while running a game. i'm getting 60fps+ running baldurs gate 3 @ 1440p on the low preset.

robviren · on July 17, 2024

Had similar experience with my Legion 5i 3070 with Wayland and Nvidia 555, but my HDMI out is all screwed up now of course. Working on 550. One step forward and one step back.

misterbishop · on July 18, 2024

is there a mux switch?

llmblockchain · on July 17, 2024

I have a 3070 on X and it has been great.

levkk · on July 17, 2024

Same setup here. Multiple displays don't work well for me. One of the displays doesn't often get detected after resuming screen saver.

llmblockchain · on July 17, 2024

I have two monitors connected to the 3070 and it works well. The only issue I had was suspending, the GPU would "fall of the bus" and not get its power back when the PC woke up. I had to add the kernel line "pcie_aspm=off" to prevent the GPU from falling asleep.

So... not perfect, but it works.

josephg · on July 17, 2024

Huh. I’m using 2 monitors connected to a 4090 on Linux mint - which is still using X11. It works flawlessly, including DPI scaling. Wake from sleep is fine too.

I haven’t tried wayland yet. Sounds like it might be time soon given other comments in this thread.

anon291 · on July 17, 2024

I've literally never had an issue in decades of using NVIDIA and linux. They're closed source, but the drivers work very consistently for me. NVIDIA's just the only option if you want something actually good and to run ML workloads as well.

sqeaky · on July 17, 2024

> but the drivers work very consistently for me

The problem with comments like this is that you never know if you will be me or you on your graphics card or laptop.

I have tried nvidia a few times and kept getting burnt. AMD just works. I don't get the fastest ML machine, but I am just a tinkerer there and OpenCL works fine for my little toy apps and my 7900XTX blazes through every wine game.

If you need it professionally than you need it, warts an all. For any casual user that 10% extra gaming performance needs to weighed against reliability.

Workaccount2 · on July 17, 2024

It also depends heavily on the user.

A mechanic might say "This car has never given me a problem" because the mechanic doesn't consider cleaning an idle bypass circuit or adjusting valve clearances to be a "problem". To 99% percent of the population though, those are expensive and annoying problems because they have no idea what those words even mean, much less the ability to troubleshoot, diagnose, and repair.

lyu07282 · on July 17, 2024

a lot has probably to do with not really understanding their distributions package manager and lkms specifically, I also always suspected that most Linux users don't know if they are using Wayland or X11 and the issues they had were actually Wayland specific ones they wouldn't have with Nvidia/x11 and come to think of it, how would they even know if it's a GPU driver issue in the first place? Guess I'm the mechanic in your analogy.

vetinari · on July 18, 2024

If there's an issue with Nvidia/Wayland and there isn't with AMD/Wayland or Intel/Wayland, it is Nvidia issue then, not Wayland one.

sqeaky · on July 17, 2024

When I run Gentoo or Arch, I know. But when I run Ubuntu or Fedora, should I have needed to know?

On plenty of distros "I want to install it and forget about is reasonable" and on both Gentoo and Ubuntu I have rebooted from a working system into a system where the display stopped working, at least on Gentoo I was ready because I broke it somehow.

lyu07282 · on July 18, 2024

Absolutely I once had an issue with kernel/user-space driver version mismatch in Ubuntu, trivial to fix and the kernel logs tell you what's wrong. But yeah I get that most users don't read their kernel logs and it shouldn't be an expectation to do so for normal users of linux. The experiences are just very different, it's why the car mechanic analogy fits so well.

I think it also got so much better over time, I've been using Linux since debian woody (22 years ago) the stuff you had to deal with back then heavily skews my perspective on what users today see as unacceptable brokenness in the Nvidia driver.

anon291 · on July 18, 2024

I've run NixOS for almost a decade now and I honestly would not recommend anything else. I've had many issues with booting on almost every distro. They're about as reliable as Windows in that regard. NixOS has been absolutely rock solid; beyond anything I could possibly have hoped for. In the extremely rare case my system would not boot, I've either found a hardware problem that would affect anyone, or I could just revert to a previous system revision and boot up. Never had any problem. No longer use anything else because it's just too risky

chasil · on July 17, 2024

If you use a search engine for "Torvalds Nvidia" you will discern a certain attitude towards Nvidia as a corporation and its products.

This might provide you a suggestion that alternate manufacturers should be considered.

I have confirmed this to be the case on Google and Bing, so DuckDuckGo and Startpage will also exhibit this phenomena.

Dylan16807 · on July 18, 2024

An opinion on support from over ten years ago is not a very strong suggestion.

chasil · on July 18, 2024

Your problem there is that both search engines place this image and backstory at the top of the results, so neither Google nor Bing agree with any of you.

If you think they're wrong, be sure to let them know.

lyu07282 · on July 18, 2024

What torvalds is complaining about is absolutely true, but the problem is that most users do not give a shit about those issues. Torvalds disagreement wasn't about bugs in-, or complains about the quality of the proprietary driver, he complained about nvidias lack of open source contributions and bad behavior towards the kernel developer community. But users don't care if they run a proprietary driver as long as it works (and it does work fine for most people).

So you see now why that's not very relevant to end-users experiences they were talking about?

chasil · on July 18, 2024

Dylan16807 · on July 18, 2024

Do you think Google and Bing are endorsing top results, and in particular endorsing a result like that in the specific context of what manufacturers I consider buying from?

That's the only way they would be disagreeing with me.

dahart · on July 18, 2024

Torvalds has said nasty mean things to a lot of people in the past, and expressed regret over his temper & hyperbole. Try searching for something more recent https://youtu.be/wvQ0N56pW74

lmm · on July 17, 2024

> AMD just works. I don't get the fastest ML machine, but I am just a tinkerer there and OpenCL works fine for my little toy apps and my 7900XTX blazes through every wine game.

That's the opposite of my experience. I'd love to support open-source. But the AMD experience is just too flaky, too card-dependent. NVidia is rock-solid (maybe not for Wayland, but I never wanted Wayland in the first place).

sqeaky · on July 18, 2024

What kind of flakiness? The only AMD GPU problem I have had involved a lightning strike killing a card while I was gaming.

My nvidia problems are generally software and update related. The NVidia stuff usually works on popular distros, but as soon anything custom or a surprise update happens then there is a chance things break.

lmm · on July 19, 2024

> What kind of flakiness?

Black screens, X server crashes, OpenGL programs either crashing or running slow. Just general unreliability. Different driver versions seemed more reliable than others, which meant I was always very reluctant to upgrade, which then gives you more problems as you end up pinning old versions which then makes it harder to troubleshoot online...

> My nvidia problems are generally software and update related. The NVidia stuff usually works on popular distros, but as soon anything custom or a surprise update happens then there is a chance things break.

I mean if you run mixed versions then yeah that will work for some upgrades and not others. A decent package manager should prevent that; some distros refuse to put effort into packaging the nvidia-drivers out of principle. But if you keep the drivers in sync (which is what the official package from NVidia themselves does, it's not their fault some distros choose to explode it into multiple packages) and properly rebuild just the kernel module every time you do a kernel upgrade (or just reinstall the whole driver if you prefer), then it's rock solid.

pizza234 · on July 17, 2024

Up to a couple of years ago, before permanently moving to AMD GPUs, I couldn't even boot Ubuntu with an Nvida GPU. This was because Ubuntu booted by default with Nouveau, which didn't support a few/several series (I had at least two different series).

The cards worked fine with binary drivers once the system was installed, but AFAIR, I had to integrate the binary driver packages in the Ubuntu ISO in order to boot.

I presume that now, the situation is much better, but necessiting binary drivers can be a problem in itself.

resoluteteeth · on July 17, 2024

Are you using wayland or are you still on x11? My experience was that the closed source drivers were fine with x11 but a nightmare with wayland.

bobajeff · on July 17, 2024

I did when my card stopped being supported by all the distros because it was too old while the legacy driver didn't fully work the same.

Keyframe · on July 18, 2024

Me too. Now I have a laptop with discrete nvidia and an eGPU with 3090 in it, a desktop with 4090, another laptop with another discrete nvidia.. all switching combinations work, acceleration works, game performance is on par with windows (even with proton to within a small percentage or even sometimes better). All out of the box with stock Ubuntu and installing driver from Nvidia site.

The only "trick" is I'm still on X11 and probably will stay. Note that I did try wayland on few occasions but I steered away (mostly due to other issues with it at the time).

isatty · on July 18, 2024

Likewise. Rock solid for decades in intel + nvidia proprietary drivers even when doing things like hot plugging for passthroughs.

anon291 · on July 18, 2024

Yeah I once worked at a cloud gaming company that used Wine on Linux on NVIDIA to stream cloud games. They were the only real option for multi-game performance, and very rock solid in terms of uptime. I truly have no idea what people are talking about. Yes I use X11.

l33tman · on July 17, 2024

Same here, been using the nvidia binary drivers on a dozen computers with various other HW and distros for decades with never any problems whatsoever.

drdaeman · on July 17, 2024

3090 owner here.

Wayland is even worse mess than it normally is. Used to flicker real bad before 555.58.02, less so with the latest driver - but still has some glitches with games. A bunch of older Electron apps still fail to render anything and require hardware acceleration disabled. I gave up trying to make it all work - can't get rid of all the flicker and drawing issues, plus Wayland seems to be a real pain in the ass with HiDPI displays.

X11 sort of works, but I had to entirely disable DPMS or one of my monitors never comes back online after going to sleep. I thought it was my KVM messing up, but that happened even with a direct connection... no idea what's going on there.

CUDA works fine, save for the regular version compatibility hiccups.

senectus1 · on July 17, 2024

4070ti super here, X11 is fine, i have zero issues.

Wayland is mostly fine, though i get some windowframe glitches when maxing them to the monitor and a another issue that i'm pretty sure is wayland but it has obnly happened a couple of times and it locks the whole device up. I cant prove it yet.

adrian_b · on July 18, 2024

I am not using Wayland and I do not have any intention to use it, therefore I do not care for any problems caused by Wayland not supporting NVIDIA and demanding that NVIDIA must support Wayland.

I am using only Linux or FreeBSD on all my laptop, desktop or server computers.

On desktop and server computers I did not ever have the slightest difficulty with the NVIDIA proprietary drivers, either for OpenGL or for CUDA applications or for video decoding/encoding or for multiple monitor support, with high resolution and high color depth, on either Gentoo/Funtoo Linux or FreeBSD, during the last two decades. I also have AMD GPUs, which I use for compute applications (because they are older models, which still had FP64 support). For graphics applications they frequently had annoying bugs, unlike NVIDIA (however my AMD GPUs have been older models, preceding RDNA, which might be better supported by the open-source AMD drivers).

The only computers on which I had problems with NVIDIA on Linux were those laptops that used the NVIDIA Optimus method of coexistence with the Intel integrated GPUs. Many years ago I have needed a couple of days to properly configure the drivers and additional software so that the NVIDIA GPU was selected when desired, instead of the Intel iGPU. I do not know if any laptops with NVIDIA Optimus still exist. The laptops that I bought later had video outputs directly from the NVIDIA GPU, so there was no difference between them and desktops and the NVIDIA drivers worked flawlessly.

Both on Gentoo/Funtoo Linux and FreeBSD I never had to do anything else but to give the driver update command and everything worked fine. Moreover, NVIDIA has always provided a nice GUI application "NVIDIA X Server Settings", which provides a lot of useful information and which makes very easy any configuration tasks, like setting the desired positions of multiple monitors. A few years ago there was nothing equivalent for the AMD or Intel GPU drivers, but that might have changed meanwhile.

tadasv · on July 17, 2024

great. rtx 4090 works out of the box after installing drivers from non-free. That's on debian bookworm.

littlecranky67 · on July 20, 2024

I got my nvidia 1060 back then during the crypto crysis when the price of AMD GPUs were inflated due to miners. Hesitant and scepital about Linux support, I upgraded the same machine with that GPU since 2016 von Ubuntu 14.04, to 18.04 and now 24.04 - without any nvidia driver issues anytime whatsoever. When I read about issues with nvidias drivers, it is mostly people with rare distro or rolling release ones, with changing kernel versions very frequently and failure to recompile with the binary drivers. For LTS distros you will likely have no issues.

jppittma · on July 17, 2024

4070 worked out of the box on my arch system. I used the closed source drivers and X11 and I've not encountered a single problem.

My prediction is that it will continue to improve if only because people want to run nvidia on workstations.

tgsovlerkhgsel · on July 18, 2024

My experience with an AMD iGPU on Linux was so bad that my next laptop will be Intel. Horrible instability to the point where I could reliably crash my machine by using Google Maps for a few minutes, on both Chrome and Firefox. It got fixed eventually - with the next Ubuntu release, so I had a computer where I was afraid to use anything with WebGL for half a year.

mathfailure · on July 17, 2024

Depends on the version of drivers: 550 version results into black screen (you have to kill and restart X server) after waking up from sleep. 535 version doesn't have this bug. Don't know about 555.

Also tearing is a bitch. Still. Even with ForceCompositionPipeline.

art0rz · on July 17, 2024

I've been running Arch with KDE under Wayland on two different laptops both with NVIDIA GPUs using proprietary drivers for years and have not run into issues. Maybe I'm lucky? It's been flawless for me.

lyu07282 · on July 17, 2024

The experiences always vary quite a lot, it depends so much on what you do with it. For example discord doesn't support screen sharing with Wayland, it's just one small example but those can add up over time. Another example is display rotation which was broken in kde for a long time (recently fixed).

DaoVeles · on July 17, 2024

I have never had an issue with them. That said I typically go mid range on cards so they are usually hardened architecture due to a year or two of being in the high end.

devwastaken · on July 18, 2024

KDE plasma 6 + Nvidia beta 555 works well. Have to make .desktop files to launch some applications explicitly Wayland.

green-salt · on July 17, 2024

Whatever pop_os uses has been quite stable for my 4070.

tormeh · on July 17, 2024

Pop uses X by default because of Nvidia.

segmondy · on July 17, 2024

plug, install then play, I got 3 different Nvidia GPU sets and all running without any issue, nothing crazy to do but follow installation instructions.

anonym29 · on July 18, 2024

To some of us, running any closed source software in userland qualifies as quite crazy indeed.

jcalvinowens · on July 17, 2024

Throwing the tarball over the wall and saying "fetch!" is meaningless to me. Until they actually contribute a driver to the upstream kernel, I'll be buying AMD.

aseipp · on July 18, 2024

You can just use Nouveau and NVK for that if you just need workstation graphics (and the open-gpu-modules source code/separate GSP release has been a big uplift to Nouveau too, at least.)

jcalvinowens · on July 18, 2024

Nouveau is great, and I absolutely admire what the community around it has been able to achieve. But I can't imagine choosing that over AMD's first class upstream driver support today.

neop1x · on July 20, 2024

IIRC hardware video decoding of HEVC didn't work for me with nouveau

einpoklum · on July 17, 2024

The title of this statement is misleading:

NVIDIA is not transitioning to open-source drivers for its GPUs; most or all user-space parts of the drivers (and most importantly for me, libcuda.so) are closed-source; and as I understand from others, most of the logic is now in a binary blob that gets sent to the GPU.

Now, I'm sure this open-sourcing has its uses, but for people who want to do something like a different hardware backend for CUDA with the same API, or to clear up "corners" of the API semantics, or to write things in a different-language without going through the C API - this does not help us.

floam · on July 17, 2024

NVIDIA Transitions Fully Towards Open-Source GPU Kernel Modules

or

NVIDIA Transitions Towards Fully Open-Source GPU Kernel Modules?

slashdave · on July 17, 2024

Not much point in a "partially" open-source kernel module.

floam · on July 17, 2024

But “fully towards” is pretty ambiguous, like an entire partial implementation.

Anyhow I read the article, I think they’re saying fully as in exclusively, like there eventually will not be both a closed source and open source driver co-maintained. So “fully open source” does make more sense. The current driver situation IS partially open source, because their offerings currently include open and closed source drivers and in the future the closed source drivers may be deprecated?

einpoklum · on July 17, 2024

See my answer. It's not going to be fully-open-source drivers, it's rather that all drivers will have open-source kernel modules.

slashdave · on July 18, 2024

You can argue against proprietary firmware, but is this all that different from other types of devices?

einpoklum · on July 18, 2024

Other device manufacturers with proprietary drivers don't engage in publicity stunts to make it sound like their drivers are FOSS or that they embrace FOSS (or just OSS).

j4hdufd8 · on July 17, 2024

haven't read it but probably the former

throwadobe · on July 17, 2024

"towards" basically negates the "fully" before it for all real intents and purposes

magicloop · on July 17, 2024

Remember that time when Linus looked at the camera and gave Nvidia the finger. Has that time now passed? Is it time to reconcile? Or are there still some gotchas?

jaimex2 · on July 18, 2024

These are kernel modules not the actual drivers. So the finger remains up.

CivBase · on July 18, 2024

Too late for me. I tried switching to Linux years ago but failed because of the awful state of NVIDIA's drivers. Switched to AMD least year and it's been a breeze ever since.

Gaming on Linux with an NVIDIA card (especially an old one) is awful. Of course Linux gamers aren't the demographic driving this recent change of heart so I expect it to stay awful for a while yet.

berkeleyjunk · on July 17, 2024

As someone who is pretty skeptical and reads the fine print, I think this is a good move and I really do not see a downside (other than the fact that this probably strengthens the nVidia monoculture).

vlovich123 · on July 17, 2024

AFAIK I believe all they did was move the closed source user space driver code to their opaque firmware blob leaving a thin shim in the kernel.

In essence I don’t believe that much has really changed here.

adrian_b · on July 18, 2024

Having as open-source all the kernel, more precisely all the privileged code, is much more important for security than having as open-source all the firmware of the peripheral devices.

Any closed-source privileged code cannot be audited and it may contain either intentional backdoors, or, more likely, bugs that can cause various undesirable effects, like crashes or privilege escalation.

On the other hand, in a properly designed modern computer any bad firmware of a peripheral device cannot have a worse effect than making that peripheral unusable.

The kernel should take care, e.g. by using the I/O MMU, that the peripheral cannot access anything where it could do damage, like the DRAM not assigned to it or the non-volatile memory (e.g. SSDs) or the network interfaces for communicating with external parties.

Even when the peripheral is so important as the display, a crash in its firmware would have no effect if the kernel had reserved some key combination to reset the GPU (while I am not aware of such a useful feature in Linux, its effect can frequently be achieved by switching, e.g. with Alt+F1, to a virtual console and then back to the GUI, the saving and restoring of the GPU state together with the switching of the video modes being enough to clear some corruption caused by a buggy GPU driver or a buggy mouse or keyboard driver).

In conclusion, making the NVIDIA kernel driver as open source does not deserve to have its importance minimized. It is an important contribution to a more secure OS kernel.

The only closed-source firmware that must be feared is that which comes from the CPU manufacturer, e.g. from Intel, AMD, Apple or Qualcomm.

All such firmware currently includes various features for remote management that are not publicly documented, so you can never be sure if they can be properly disabled, especially when the remote management can be done wirelessly, like through the WiFi interface of the Intel laptop CPUs, so you cannot interpose an external firewall to filter the network traffic of any "magic" packets.

A paranoid laptop user can circumvent the lack of control over the firmware blobs from the CPU manufacturer by disconnecting the internal antennas and using an external cheap and small single-board computer for all wired and wireless network access, which must run a firewall with tight rules. Such a SBC should be chosen among those for which complete hardware documentation is provided, i.e. including its schematics.

stragies · on July 18, 2024

Everything you wrote assumes the IOMMUs across the board to be 100% correctly implemented without errors/bugdoors.

People used to believe similar things about Hyperthreading, glitchability, ME, Cisco, boot-loaders, ... the list goes on.

adrian_b · on July 18, 2024

There still is a huge difference between running privileged code on the CPU, for which there is nothing limiting what it can do, and code that runs on a device, which should normally be contained by the I/O MMU, except if the I/O MMU is buggy.

The functions of an I/O MMU for checking and filtering the transfers are very simple, so the probability of non-intentional bugs is extremely small in comparison with the other things enumerated by you.

stragies · on July 18, 2024

Agreed, that the feature-set of IOMMU is fairly small, but is this function not usually included in one of the Chipset ICs, which do run a lot other code/functions alongside a (hopefully) faithful correct IOMMU routine?

Which -to my eyes- would increase the possibility of other system parts mucking with IOMMU restrictions, and/or triggering bugs.

saagarjha · on July 18, 2024

Did you run this through a LLM? I'm not sure what the point is of arguing with yourself and bringing up points that seem tangential to what you started off talking about (…security of GPUs?)

adrian_b · on July 18, 2024

I have not argued with myself. I do not see what made you believe this.

I have argued with "I don’t believe that much has really changed here", which is the text to which I have replied.

As I have explained, an open-source kernel module, even together with closed-source device firmware, is much more secure than a closed-source kernel module.

Therefore the truth is that a lot has changed here, contrary to the statement to which I have replied, as this change makes the OS kernel much more secure.

stkdump · on July 17, 2024

But the firmware runs directly on the hardware, right? So they effectively rearchitected their system to move what used to be 'above' the kernel to 'below' the kernel, which seems like a huge effort.

vlovich123 · on July 17, 2024

It’s some effort but I bet they added a classical serial CPU to run the existing code. In fact, [1] suggests that’s exactly what they did. I suspect they had other reasons to add the GSP so the amortized cost of moving the driver code to firmware was actually not that large all things considered and in the long term reduces their costs (eg they reduce the burden further of supporting multiple OSes, they can improve performance further theoretically, etc etc)

[1] https://download.nvidia.com/XFree86/Linux-x86_64/525.78.01/R...

p_l · on July 17, 2024

That's exactly what happened - Turing microarchitecture brought in new[1] "GSP" which is capable enough to run the task. Similar architecture happens AFAIK on Apple M-series where the GPU runs its own instance of RTOS talking with "application OS" over RPC.

[1] Turing GSP is not the first "classical serial CPU" in nvidia chips, it's just first that has enough juice to do the task. Unfortunately without recalling the name of the component it seems impossible to find it again thanks to search results being full of nvidia ARM and GSP pages...

mepian · on July 17, 2024

>the name of the component

Falcon?

p_l · on July 17, 2024

THANK YOU, that was the name I was forgetting :)

here's[1] a presentation from nvidia regarding (unsure if done or not) plan for replacing Falcon with RISC-V, [2] suggests the GSP is in fact the "NV-RISC" mentioned in [1]. Some work on reversing Falcon was apparently done for Switch hacking[3]?

[1] https://riscv.org/wp-content/uploads/2016/07/Tue1100_Nvidia_... [2] https://www.techpowerup.com/291088/nvidia-unlocks-gpu-system... [3] https://github.com/vbe0201/faucon

knotimpressed · on July 17, 2024

Would you happen to have a source or any further readings about Apple M-series GPUs running their own RTOS instance?

p_l · on July 17, 2024

Asahi Linux documentation has pretty good writeup.

The GPU is described here[1] and the mailbox interface used generally between various components is described here [2]

[1] https://github.com/AsahiLinux/docs/wiki/HW%3AAGX#overview

[2] https://github.com/AsahiLinux/docs/wiki/HW%3AASC

imtringued · on July 17, 2024

Why? It should make it much easier to support Nvidia GPUs on Windows, Linux, Arm/x86/RISC-V and more OSes with a single firmware codebase per GPU now.

stkdump · on July 17, 2024

Yes makes sense, in the long run it should make their life easier. I just suspect that the move itself was a big effort. But probably they can afford that nowadays.

rldjbpin · on July 18, 2024

mind the wording they've used here - "fully towards open-source" and not "towards fully open-source".

big difference. almost nobody is going to give you the sauce hidden behind blobs. but i hope the dumb issues of the past (imagine using it on laptops with switchable graphics) go away slowly with this and it is not only for pleasing the enterprise crowd.

svlasov · on July 19, 2024

All `g_bindata_k*.c` files are essentially blobs with no source provided:

https://github.com/NVIDIA/open-gpu-kernel-modules/tree/main/...

doctoboggan · on July 18, 2024

My guess is Meta and/or Amazon told Nvidia that they would contribute considerable resources to development as long as the results were open source. Both companies bottom lines would benefit from improved kernel modules, and like another commenter said elsewhere, Nvidia doesn't have much to lose.

benjiweber · on July 17, 2024

I wonder if we'll ever get hdcp on nvidia. As much as I enjoy 480p video from streaming services.