> Everything will probably be just fine if you kill-9 something. If a program fails permanently and dramatically when it's kill-9'd, you should "remove it from your filesystem", because it can't handle other, unavoidable abrupt failures either.
There is a lot of middle ground between failing "permanently and dramatically" and shutting down cleanly.
What about a program that can recover after an abrupt termination, but only after a time-consuming recovery from a journal? What about a program that can recover after an abrupt termination, but only after you manually remove a lock file? These are cases where it's not "just fine" to use SIGKILL, but not so bad as to warrant removing the program.
Generally your life will be easier if you don't take the sledgehammer approach to killing processes, and at least try a non-SIGKILL first.
Your reply is the real answer to the linked question. SIGTERM serves a real purpose, and almost always it works to end the process cleanly and promptly.
If a process doesn't respond reasonably to SIGTERM, then you should consider removing it from your filesystem.
"If a process doesn't respond reasonably to SIGTERM, then you should consider removing it from your filesystem."
That's an overreaction. I've seen programs refuse to respond to SIGKILL for reasons out of their control, like getting stuck in an NFS transaction or other "unusual" filesystem and the kernel being unable to process the SIGKILL. "Listing a directory" is not exactly a crazy thing to do.
And no wiggling out by saying "well, that's the kernel, not the process", the process never gets a chance to "handle" SIGKILL, so it can't really screw it up, either. Arguably, all such failures are the kernel; the kernel should not expose any sequence of calls that causes SIGKILL to fail, and over time they tend to be fixed (I haven't seen this on my modern Linux machine in a long time, even playing with some funny stuff), but it has happened and will probably continue to happen as new stuff comes out.
>And no wiggling out by saying "well, that's the kernel, not the process", the process never gets a chance to "handle" SIGKILL, so it can't really screw it up, either.
Technically, you are correct- the process doesn't even know it's been SIGKILL'd. However, there are other things it could do to gracefully handle the scenario upon next start.
Is there an already-existing lock file? Prompt for its removal.
An already existing PID file? Again, ask what to do. Include tools to fix records that may have been left in an inconsistent state.
So on and so forth.
That said, I've never had to do a SIGKILL. However, I still expect my programs to sanely recover from power outages, clumsy interns, RAID failures, and other "acts of god" that may suddenly cause a program to end before it has a chance to clean up after itself. It's part of making robust programs.
It's even worse than that -- kill is just a command that sends signals to a program, some of which happen to (gracefully or not) stop the program. You can even define your own!
> If a process doesn't respond reasonably to SIGTERM, then you should consider removing it from your filesystem.
What's reasonably? MySQL/InnoDB will start consolidating the journal and buffer pool in preparation for a shutdown. On machines with a large amount of RAM allocated to the buffer pool that will take quite some time. Is that still reasonable?
Absolutely. Those are healthy parts of a clean shutdown, which is what SIGTERM was really asking for, isn't it?
InnoDB is designed to recover gracefully from a SIGKILL, too. The journal task is "saved" for later. That's why they tell you in DBA school, "don't use MyISAM tables. They're not safe." Because if they received a SIGKILL or power outage at the wrong moment, writes could have been lost. Amiright?
What you don't expect is for a process that receives SIGTERM to fail hard, and require an expensive journal recovery or suffer some unrecoverable data loss as a result.
Well, actually: Recovering from a SIGKILL will (or at least used to) take much longer at startup than shutting down with SIGTERM, so it could handle that better. I still agree with you that it's reasonable, but others may hold a different opinion.
Heh - you're saying that the SIGKILL recovery on start is cheaper than the SIGTERM cleanup on quit. OK. I'll agree, that's strange. One would probably have expected to have to dot and cross the same number of I's and T's either way, and not for SIGKILL recovery to actually be faster.
This paper measures the time difference between clean and unclean reboot across various systems. Another important point is that many servers never shut down intentionally, making unclean shutdown the norm in lots of deployments.
Thanks for the hard numbers. That also makes sense.
It should be easy to rationalize that SIGKILL recovery is slower than SIGTERM shutdown for such a database. If your in-memory cache is empty, it won't be providing any speedups, right? Thus you'll need to go back to disk for everything.
No, actually I wanted to say the opposite: Killing the process with SIGKILL and then doing recovery at startup is much more expensive than letting the process shut down properly with SIGTERM.
> If a process doesn't respond reasonably to SIGTERM, then you should consider removing it from your filesystem.
I currently have a vim process that is running in the background and not responding to SIGTERM (so I'll probably SIGKILL it). Does this mean I need to find a new editor?
I was also thinking that criteria might prune the filesystem a bit too much. Alternately, if you're a vim hater, "Another good reason to get rid of vim!"
Vim responds to SIGTERM while in the foreground, but if I suspend it (e.g. C-z), it does not respond. If I send a SIGTERM to a suspended Vim process, it does respond immediately after I foreground the process.
It might not be that simple as ops folks often have to support specific versions of stuff on apps that use db-specific features. In that case, it might be worth a prof services contract with a db shop.
(I've supported clustered Oracle on AWS handing massive numbers of micropayment transactions and seen weird shit in prod where it "kind-of failed" according to our ops dba. Classes of survivable bugs vary from apply a vendor hotfix to edge cases not worth downtime.)
There is a lot of middle ground between failing "permanently and dramatically" and shutting down cleanly.
What about a program that can recover after an abrupt termination, but only after a time-consuming recovery from a journal? What about a program that can recover after an abrupt termination, but only after you manually remove a lock file? These are cases where it's not "just fine" to use SIGKILL, but not so bad as to warrant removing the program.
Generally your life will be easier if you don't take the sledgehammer approach to killing processes, and at least try a non-SIGKILL first.