High System Load with Low CPU Utilization on Linux? (2020)

gfv · on Sept 23, 2022

Great write-up about the troubleshooting process!

Regarding the exact case, there is a slightly deeper issue. XFS enqueues inode changes to the journal buffers twice: the mtime change is scheduled prior to the actual data being written, and the inode with the updated file size is placed in the journal buffers just after. If the drive is overloaded, the relatively tiny (just a few megs) journal buffers may overflow with mtime changes, and the file system becomes pathologically synchronous. However, since 4.1something, XFS supports the `lazytime` mounting option that delays the mtime updates until a more substantial change is written. Without it, the journal queue fills up at roughly the speed of your write() calls; with it, at the pace of the actual data hitting the disk, so even in highly congested conditions your application can write asynchronously -- that is, until dirty_ratio stops your system dead in its tracks.

tanelpoder · on Sept 24, 2022

Author of the blog post here. Thanks for the feedback and the extra XFS details (I'm not a big XFS expert). Is the XFS "lazytime" the same thing as the "relatime" mount option?

gfv · on Sept 24, 2022

No, relatime applies to how often atime is updated, while lazytime controls how often all three file timestamps (atime, mtime and ctime) are written out to disk. They are orthogonal: you can have strictatime+lazytime to have accurate atime tracking that generates no disk IO on reads. The downside is, of course, if your system crashes, the non-persisted atimes will be unreliable.

godshatter · on Sept 23, 2022

A bit off-topic, but I'm amazed at how well linux handles large numbers of threads. I have a program that runs bots against each other playing the game of go, where each go-bot plays every other go-bot one hundred times, which is highly cpu intensive. I launch one go-bot at a time which is battling every other go-bot it hasn't already played against, often with close to 900 threads running go-bot battles at once. I have six cores with two hyper-threads each and all of them are pegged at 100% usage for hours on end by this program. The processes are running at normal priority, i.e. not changed via "nice".

When I first tried this, I was prepared to hard-boot since I was almost sure it would make my desktop unusable, but it didn't. I can even play some fairly cpu- and gpu-intensive games without too many hiccups while this is going on. If I wasn't paying attention, I probably wouldn't know they were running.

raffraffraff · on Sept 23, 2022

> The main point of this article was to demonstrate that high system load on Linux doesn’t come only from CPU demand, but also from disk I/O demand

Great article but this summary was zero surprise. I've only ever seen high load from disk I/O. When I was first clicking on the link I thought to myself "well, it's disk I/O, but let's see how we get to the punchline"

tanelpoder · on Sept 24, 2022

The main reason for mentioning the disk I/O demand was that all the other (classic) Unixes only include CPU demand as part of system load and not I/O. This will confuse sysadmins coming from Solaris/HP-UX/AIX/BSD background.

The other reason is that I've troubleshooted plenty of Linux load spike problems that are about CPU demand spikes only, usually due to some spinlock that gets held unusually long or some interrupt storm issue or some sort of a "database logon storm" due to connection pools in the app server suddenly creating thousands of additional DB connections...

cmurf · on Sept 23, 2022

IO wait is counted against load by linux. So high IO pressure, i.e. found in `/proc/pressure` will cause high load. When considering reclaim (dropped file pages which then need IO to read the page on demand again) it makes sense because high IO wait will delay reads (and writes) slowing everything down.

I'm liking this project https://github.com/facebookincubator/below

It's packaged in Fedora.

TYMorningCoffee · on Sept 23, 2022

How does pSnapper translate [kworker/6:99], [kworker/6:98], etc into the pattern (kworker/*:*)? I would like similar functionality for log analysis.

Edit: Nevermind. I skipped over the key paragraph here:

> By default, pSnapper replaces any digits in the task’s comm field before aggregating (the comm2 field would leave them intact). Now it’s easy to see that our extreme system load spike was caused by a large number of kworker kernel threads (with “root” as process owner). So this is not about some userland daemon running under root, but a kernel problem.

pmontra · on Sept 23, 2022

My old laptop suddenly became very slow many years ago. top told me it was 99% in wait status. It was the hard disk failing. I think it was trying to read again and again some bad sector. Shutdown. Bought a new HDD, restored from backup, solved. BTW, that new HDD failed with a bad clack clack clack noise years later. Backups saved me again.

teddyh · on Sept 23, 2022

I thought PSI (Pressure Stall Information) was the modern way?

https://news.ycombinator.com/item?id=17580620

gfv · on Sept 23, 2022

PSI tells you whether you have a problem or not; it's up to you to use the techniques described here to find the cause and the reason.

tanelpoder · on Sept 24, 2022

Yep, I once tested (and verified) that PSI doesn't account for the asynchronous I/O completion waits (io_getevents) just like Linux system load doesn't. So, both of them would miss the asynch I/O completion waits as a thread waiting for async I/O completion is in S state (not D). Once problem is so big that the block device I/O queues get full and even io_submit() calls hang - these threads would sleep in D state, so would contribute both to system load and PSI.

Edit: I understand that PSI was created as an easy (and cheap) to query metric to see if your Android mobile device has a problem (and some apps need to be killed) and not really for full blown troubleshooting drilldown of server workloads.