> and I annoyingly frequently have had to reinstall the OS from scratch Is there...

codetrotter · on Oct 25, 2021

One of the reasons is that the computer in question has a problem where suddenly everything completely freezes, and I have to forcibly reboot the computer with the hardware reset button. And sometimes after rebooting I find that files on the file system have become corrupted.

Sometimes the machine will run for days without freezing up. Sometimes it can freeze up after just a couple of hours.

Sometimes I am browsing the web when it happens. Sometimes I am writing code and compiling stuff when it happens.

I think the longest uptime I have had on it after the problem started without it freezing has been maybe 8 days. It’s been like this for a long while. A couple of years at least.

And when it freezes it’s not just the UI. It even stops responding to ping.

A few months back someone suggested that it may be due to a faulty PSU. But even after changing the PSU to a new one, it still happens.

And it’s been happening both with different Linux distros and with FreeBSD.

My hardware is:

- MSI B350 Tomahawk motherboard

- AMD Ryzen 7 1700 CPU

- MSI GeForce GTX 1060 6GB graphics card

- Crucial DDR4 2133Mhz 32GB ECC (2x16GB) CT2K16G4WFD8213 RAM

I bought those components in March of 2018, according to the order history in my account on the store where I bought it.

I don’t remember if I ever did run a memory test, but I will do so again anyways soon because I am suspecting that faulty RAM may be the reason. It just takes a long time to run a memory test on 32 GB of RAM and so whenever I think about needing to run it I don’t have time to do so at that moment. But as I am typing this I realize that I’ve got to go ahead and actually find time for running it soon.

Another couple of ideas I’ve had is that maybe the graphics card or the nvidia drivers are to blame. Or that I may have damaged some component with static electricity at some point. Or maybe it’s a thermal issue.

I most recently did a fresh install earlier today, and at the same time I switched out the SSD that was in it for another one. As far as I can remember though, I think I was having this kind of problem even before I was using the SSD that I switched out today. But I am not sure about that. So it could be that the SSD was the problem. But it may be a few days until it freezes again if it’s not the reason. Or it could happen in some hours.

I’ve also tried to look for clues in /var/log/syslog but haven’t really known what to look for specifically. And since even pinging the machine stops working when it freezes I am not sure the system would even be able to write to any logs at the moment that it freezes, since it seems that at that point pretty much everything has stopped working.

Currently I am running tail -F /var/log/syslog in a terminal in the hopes of either seeing something relevant frozen on screen, or in the surrounding output when I reboot the machine since I will then know which line of output I should begin looking in the vicinity of.

But yeah, a memory test is the next thing I plan to do after that.

bilange · on Oct 26, 2021

I *had* to get myself out of lurking mode to reply specifically to you; this issue seems widespread for 1st-gens Ryzen. I see your chipset also is close enough to mine (X370), and I felt a strong "déjà vu" by reading your freezing symptoms.

I reused my now old X370 Ryzen build to run TrueNAS Scale (Based on Debian), and have hard lockups like yours.

My personal notes on the subject seems to stabilize things a bit but not completely, and it's a mixture of BIOS Settings Tweaks and Kernel boot parameters that seems to help partially. Things I tried/applied with varying degree of success:

- Disabling Cool&Quiet

- Disabling C-States

- Gear Down Mode: Disabled

- Power Down Mode: Disabled

- VDDSCR_SOC: Offset by +0.00625v (seemed to stabilize things on Windows)

- Someone in the kernel bugreport mentionned the need to power off (as opposed to just reset) so all the BIOS Settings are applied correctly (didn't try it myself yet)

See those links for more infos:

- https://bugzilla.kernel.org/show_bug.cgi?id=196683 (a very long bugreport thread, people commented lots of things they tried to stabilize their build along with kernel parameters ideas)

- https://gist.github.com/diracs-delta/876d74d030f80dc899fc58a...

- https://web.archive.org/web/20201020144021/https://www.truen... (linked from Archive.org as TrueNAS WAS specifically mentionning Ryzen stability in the first paragraphs of this page)

Good luck; and if you ever found how to get rid completely of those freezes, let me know :)

(edit: formatting)

codetrotter · on Oct 26, 2021

Thank you for emerging from lurk mode for me :) I will try those things.

codetrotter · on Oct 26, 2021

I found some of the settings but not all of them, but here’s the ones I found, and changed now:

- Global C-state Control: Auto -> Disabled

- AMD Cool’n’Quiet: Auto -> Disabled

- CPU NB/SoC Voltage: Auto -> Offset Mode; CPU NB/SoC Offset Mode Mark: +; CPU NB/SoC Offset Voltage: Auto

(The offset value can only be auto with my machine, not a custom value it seems.)

Only ones I couldn’t find were Gear Down Mode and Power Down Mode.

Clicked save and exit in the UEFI.

Then I powered down the machine and even flipped the on-off button of the PSU to off and let it stay off for ~20 seconds for good measure. Then turned the PSU back on and then powered the machine back on.

Currently reading the bug report thread and will try some of those things as well.

codetrotter · on Oct 26, 2021

Now I've read a bit of those links and also read a bit of the following other links:

- https://utcc.utoronto.ca/~cks/space/blog/linux/KernelRcuNocb...

- https://access.redhat.com/documentation/en-us/red_hat_enterp...

- https://help.ubuntu.com/community/Grub2/Setup

And I've changed the following line in my /etc/default/grub from:

    GRUB_CMDLINE_LINUX=""

to

    GRUB_CMDLINE_LINUX="rcu_nocbs=0-15 processor.max_cstate=5"

since my CPU has 16 threads. And I've saved it and have run

    sudo update-grub

Now I'm about to reboot the computer and then hopefully it will be more stable from now on :)

Thanks again for the help bilange.

codetrotter · on Oct 26, 2021

Having now turned the computer back on I've also confirmed that these flags are indeed now being passed to the kernel when it is booted, as seen in the output of

    cat /proc/cmdline

which shows the following:

    BOOT_IMAGE=/boot/vmlinuz-5.11.0-38-generic root=UUID=4dcba509-efff-4ccc-a099-f919240c767c ro rcu_nocbs=0-15 processor.max_cstate=5 quiet splash vt.handoff=7

And that's the "rcu_nocbs=0-15 processor.max_cstate=5" we added to our GRUB2 config shown right inside of there.

djokkataja · on Oct 26, 2021

> It just takes a long time to run a memory test on 32 GB of RAM

Maybe run memtest while you sleep? :)

> Or maybe it’s a thermal issue.

I'm no expert on thermal issues by any means, but I did have a laptop that overheated frequently in the past. It was very easy to identify that it was a heat problem, because the fans would start screaming and then it would reliably overheat and power itself off anytime I tried to do anything CPU-intensive with it. Sticking it in a freezer "solved" the problem.

Anyway, given that your freezes are completely random, it doesn't sound like a thermal issue?

codetrotter · on Oct 26, 2021

> Maybe run memtest while you sleep? :)

Yup, did that tonight. The test is finished now and it found zero errors. The test took 7 hours and 56 minutes to run. I used a bootable USB stick with MemTest86 v9.3 Free to test it.

d_tr · on Oct 26, 2021

Some ideas just in case...

- Even if the RAM modules are not faulty, ECC is not officially supported by your CPU and motherboard, right? So this could be a problem... They might also not be correctly configured, so check the UEFI settings.

- Regarding the GPU, you could try lowering your PCIe mode to a slower one, especially if you are using a riser cable.

- I have had overly bent SATA cables cause complete freezes every few hours on a brand new PC. I luckily got the hint quickly because the HDD LED was staying lit.

- What about the quality of the power going into your PSU?

- Have you tried rebooting more gracefully with the SysRq commands?

codetrotter · on Oct 26, 2021

> Even if the RAM modules are not faulty, ECC is not officially supported by your CPU and motherboard, right?

Correct.

> So this could be a problem... They might also not be correctly configured, so check the UEFI settings.

Most of the settings in UEFI that I can find about RAM are related to overclocking, and I am not using those features.

But it may well be as you say that using ECC RAM with a non-ECC motherboard could cause problems.

> I have had overly bent SATA cables cause complete freezes every few hours on a brand new PC. I luckily got the hint quickly because the HDD LED was staying lit.

My SATA cables are cheap ones I bought off of eBay. I may try and replace them.

> What about the quality of the power going into your PSU?

The freezing has happened in all of the different places I have lived during this time, so I don’t so. Unless the power cord could be doing something, but the power cord looks fine on the outside at least.

> Have you tried rebooting more gracefully with the SysRq commands?

Going to try that in the future.

Thank you.

codetrotter · on Oct 26, 2021

> > Even if the RAM modules are not faulty, ECC is not officially supported by your CPU and motherboard, right?

> Correct.

I should also add to this for the record, that my mistake here was that when I bought the system, I checked to see that Ryzen 7 1700 supports ECC RAM, but that I was unaware that the motherboard also needed to support it in order for ECC to work.

Someone in a sibling comment pointed me to some links and gave me some info about some UEFI settings to try along with some kernel boot arguments and I've applied these. Hopefully it will be enough to take care of the issue, or at least to improve the stability a bit. But if the system continues to be unstable then I will probably try and save up a bit of money to replace the RAM that I have with non-ECC RAM instead, or alternatively that I save up money for a new motherboard which supports both ECC RAM and my current CPU as well as being able to host a newer generation of Ryzen CPU that I can then buy further into the future.

Also, I went to an electronics store today after reading your comment and bought a new SATA cable that I now use instead of the one that I had.