Hanging the Linux core dump pipe helper

phaker · on April 30, 2018

In re the 'and nothing improves' sentiment at the end: there _is_ a safety limit and it's even documented in dump(5):

/proc/sys/kernel/core_pipe_limit

When collecting core dumps via a pipe to a user-space program, it can be useful for the collecting program to gather data about the crashing process from that process's /proc/[pid] directory. In order to do this safely, the kernel must wait for the program collecting the core dump to exit, so as not to remove the crashing process's /proc/[pid] files prematurely. This in turn creates the possibility that a misbehaving collecting program can block the reaping of a crashed process by simply never exiting.

Since Linux 2.6.32, the /proc/sys/kernel/core_pipe_limit can be used to defend against this possibility. The value in this file defines how many concurrent crashing processes may be piped to user-space programs in parallel. If this value is exceeded, then those crashing processes above this value are noted in the kernel log and their core dumps are skipped.

A value of 0 in this file is special. It indicates that unlimited processes may be captured in parallel, but that no waiting will take place (i.e., the collecting program is not guaranteed access to /proc/<crashing-PID>). The default value for this file is 0.

(edit: in case if that came off as curt i didn't intend it to, i do agree that far too many things suck because "it's documented in a defunct mailing list somewhere and if you don't know it's your own fault")

kanox · on April 30, 2018

This is also documented in "Documentation/kernel/sysctl.c". Looking through history it seems it was added at the same time as the feature.

Situations where people complain about "insufficient documentation" while not bothering to read it are very common and somewhat demoralizing for the people who take the time to write such explanations.

digi_owl · on April 30, 2018

I find more often than not that the kernel people are reasonably good with documentation. User space on the other hand seems to treat documentation either like a chore best avoided, or something to bludgeon people with when and API is mangled between releases.

monocasa · on April 30, 2018

Kernel space has the benefit of a very clear public/private boundary. The kernel internals have much worse docs than user space once you get over that wall.

jwilk · on April 30, 2018

"no waiting will take place" confused me for a moment.

I guess what they meant that the kernel doesn't wait for the crash handler to exit, but of course it does wait for it to read the crash dump from stdin.

mst · on April 30, 2018

Certain types of script really really really should always start with

    alarm 10

or similar. This has saved me from myself way too many times.

cpach · on April 30, 2018

What does that do?

rambojazz · on May 1, 2018

It's a way to send a SIGALRM signal to your running process [1] using the alarm() syscall. You can also catch the signal [2] and act upon. For example it's used by "timeout" [3].

[1] https://dokk.org/manpages/alarm.2#DESCRIPTION [2] https://dokk.org/manpages/sigaction.2#DESCRIPTION [3] https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/tim...

jey · on April 30, 2018

Probably sends a SIGALRM signal in 10 seconds.

https://linux.die.net/man/2/alarm

mst · on April 30, 2018

Which, if you haven't taken precautions to make it otherwise, will shoot your process in the head.

"Dear operating system, please shoot me in the head in ten seconds if I haven't already finished" is a really useful thing to ask for sometimes.

amelius · on April 30, 2018

Can't you just make core-files land on a FUSE-mounted filesystem, and from there decide if you want to keep the file?

ahefner · on April 30, 2018

That only moves the point of failure from one userspace process to another. The FUSE handler can hang just like the core dump pipe helper.

zbentley · on April 30, 2018

Presumably you'd then just have to worry about memory consumption. You could send them to /dev/shm as well, but if there isn't anywhere to put them, there will still be problems.