Everything will probably be just fine if you kill-9 something. If a program fails permanently and dramatically when it's kill-9'd, you should "remove it from your filesystem", because it can't handle other, unavoidable abrupt failures either.
> Everything will probably be just fine if you kill-9 something. If a program fails permanently and dramatically when it's kill-9'd, you should "remove it from your filesystem", because it can't handle other, unavoidable abrupt failures either.
There is a lot of middle ground between failing "permanently and dramatically" and shutting down cleanly.
What about a program that can recover after an abrupt termination, but only after a time-consuming recovery from a journal? What about a program that can recover after an abrupt termination, but only after you manually remove a lock file? These are cases where it's not "just fine" to use SIGKILL, but not so bad as to warrant removing the program.
Generally your life will be easier if you don't take the sledgehammer approach to killing processes, and at least try a non-SIGKILL first.
Your reply is the real answer to the linked question. SIGTERM serves a real purpose, and almost always it works to end the process cleanly and promptly.
If a process doesn't respond reasonably to SIGTERM, then you should consider removing it from your filesystem.
"If a process doesn't respond reasonably to SIGTERM, then you should consider removing it from your filesystem."
That's an overreaction. I've seen programs refuse to respond to SIGKILL for reasons out of their control, like getting stuck in an NFS transaction or other "unusual" filesystem and the kernel being unable to process the SIGKILL. "Listing a directory" is not exactly a crazy thing to do.
And no wiggling out by saying "well, that's the kernel, not the process", the process never gets a chance to "handle" SIGKILL, so it can't really screw it up, either. Arguably, all such failures are the kernel; the kernel should not expose any sequence of calls that causes SIGKILL to fail, and over time they tend to be fixed (I haven't seen this on my modern Linux machine in a long time, even playing with some funny stuff), but it has happened and will probably continue to happen as new stuff comes out.
>And no wiggling out by saying "well, that's the kernel, not the process", the process never gets a chance to "handle" SIGKILL, so it can't really screw it up, either.
Technically, you are correct- the process doesn't even know it's been SIGKILL'd. However, there are other things it could do to gracefully handle the scenario upon next start.
Is there an already-existing lock file? Prompt for its removal.
An already existing PID file? Again, ask what to do. Include tools to fix records that may have been left in an inconsistent state.
So on and so forth.
That said, I've never had to do a SIGKILL. However, I still expect my programs to sanely recover from power outages, clumsy interns, RAID failures, and other "acts of god" that may suddenly cause a program to end before it has a chance to clean up after itself. It's part of making robust programs.
It's even worse than that -- kill is just a command that sends signals to a program, some of which happen to (gracefully or not) stop the program. You can even define your own!
> If a process doesn't respond reasonably to SIGTERM, then you should consider removing it from your filesystem.
What's reasonably? MySQL/InnoDB will start consolidating the journal and buffer pool in preparation for a shutdown. On machines with a large amount of RAM allocated to the buffer pool that will take quite some time. Is that still reasonable?
Absolutely. Those are healthy parts of a clean shutdown, which is what SIGTERM was really asking for, isn't it?
InnoDB is designed to recover gracefully from a SIGKILL, too. The journal task is "saved" for later. That's why they tell you in DBA school, "don't use MyISAM tables. They're not safe." Because if they received a SIGKILL or power outage at the wrong moment, writes could have been lost. Amiright?
What you don't expect is for a process that receives SIGTERM to fail hard, and require an expensive journal recovery or suffer some unrecoverable data loss as a result.
Well, actually: Recovering from a SIGKILL will (or at least used to) take much longer at startup than shutting down with SIGTERM, so it could handle that better. I still agree with you that it's reasonable, but others may hold a different opinion.
Heh - you're saying that the SIGKILL recovery on start is cheaper than the SIGTERM cleanup on quit. OK. I'll agree, that's strange. One would probably have expected to have to dot and cross the same number of I's and T's either way, and not for SIGKILL recovery to actually be faster.
This paper measures the time difference between clean and unclean reboot across various systems. Another important point is that many servers never shut down intentionally, making unclean shutdown the norm in lots of deployments.
Thanks for the hard numbers. That also makes sense.
It should be easy to rationalize that SIGKILL recovery is slower than SIGTERM shutdown for such a database. If your in-memory cache is empty, it won't be providing any speedups, right? Thus you'll need to go back to disk for everything.
No, actually I wanted to say the opposite: Killing the process with SIGKILL and then doing recovery at startup is much more expensive than letting the process shut down properly with SIGTERM.
> If a process doesn't respond reasonably to SIGTERM, then you should consider removing it from your filesystem.
I currently have a vim process that is running in the background and not responding to SIGTERM (so I'll probably SIGKILL it). Does this mean I need to find a new editor?
I was also thinking that criteria might prune the filesystem a bit too much. Alternately, if you're a vim hater, "Another good reason to get rid of vim!"
Vim responds to SIGTERM while in the foreground, but if I suspend it (e.g. C-z), it does not respond. If I send a SIGTERM to a suspended Vim process, it does respond immediately after I foreground the process.
It might not be that simple as ops folks often have to support specific versions of stuff on apps that use db-specific features. In that case, it might be worth a prof services contract with a db shop.
(I've supported clustered Oracle on AWS handing massive numbers of micropayment transactions and seen weird shit in prod where it "kind-of failed" according to our ops dba. Classes of survivable bugs vary from apply a vendor hotfix to edge cases not worth downtime.)
I agree for the most part. There are a few exceptions, however, and no-one seems to have noted/listed them. Most notably, when a process is 'kill -9'ed, some shared system resources stay open. Most notably of those are the named semaphores. This will sometimes cause the process to fail when it is reopened. You can use ipcs and ipcrm to remove those.
One thing I always liked about couchdb, aside from its other merits or faults, is that it was designed to be a crash only service. Meaning that the normal, correct way to shut it down is just an abrupt halt of the process. This means that functionally there's no difference between a normal shutdown and a crash or a killed process. This is quite valuable for a service where data integrity is important.
> If a program fails permanently and dramatically when it's kill-9'd
There are several kinds of bad state that can be left over:
(1) space leaks (disk, memory (sysv shared memory))
(2) inconsistent data (data / config files)
(3) locks/semaphores (used to protect (2) from happening in normal operation)
(We're talking about those that are under the control of the program itself, not the control of another program or the kernel (the latter can and generally will be cleaned up automatically).)
Programs can be written to clean those up when they are restarted, which is the reason for the saying that if some program does not do that, then it's not worth being left installed.
But it takes a certain amount of work writing code to recover from an abandoned lock and the possibly inconsistent state that has been left over, so much so that even some wide-spread system libraries don't do it. Of course you're free to remove those system libraries from your system, but you may have to write a replacement.
Also, in the case of such libraries, it may happen that you lock out other programs (that also happen to use the same libraries) than the one you SIGKILL'ed, and you may not be aware of the source for the issue.
In particular I'm thinking of ALSA. If you SIGKILL a program while it uses [a "hw:" device in] ALSA, it leaves behind a semaphore that prevents other programs from accessing the same device indefinitely (it seems there's no recovery code in the ALSA libraries). This robbed me of quite some time to figure out; when I did, I wrote a utility to clean up the semaphores explicitely[1]. Every now and then it happens that I need to reach out for it.
Well, as you can't catch -9 (SIGKILL), any task in your program that shouldn't be interrupted... will be interrupted. That's generally a good rule, but programs aren't usually SIGKILLED. It is SIGTERM the one you're supposed to take into account.
Permanent and dramatic failure caused by SIGKILL presents a candidate for deletion, but it depends on your definitions of permanent and dramatic.
It wouldn't be unexpected for a SIGKILLed process to fail to flush a cache, or to exit with persistent state ...inconsistent. That can be permanent (changes lost) and dramatic.
You can wrap operations in transactional overhead to greatly reduce the chance of data loss, but you can't eliminate it entirely without assistance from peers.
kill -9 is the equivalent of unplugging the machine in the middle of operations. It might be fine, and with careful design it might be fine almost every time.
That doesn't change when you add peers, though, so it's kinda besides the point. Oh, and a normal shutdown doesn't prevent flipped bits due to cosmic rays either ...
Unless you're talking about filesystems with specific support for atomicity, and databases with custom storage drivers for those filesystems, that's really not fair.
It would be far too easy to kill any program mid-write().
That is what journaling is for. A database that corrupts data when you kill -9 it is garbage. A database has to survive without corruption when power fails unexpectedly, and that's even harsher than kill -9 in terms of what can go wrong if your code is sloppy.
> Unless you're talking about filesystems with specific support for atomicity, and databases with custom storage drivers for those filesystems, that's really not fair.
Unless you're talking about a database not intended to be used in production, that's really not fair.
What if the power comes down? The UPS blows up? There's an earthquake and the ceiling comes down? Is it OK for a DBMS to be corrupt its data then, too?
> Unless you're talking about filesystems with specific support for atomicity, and databases with custom storage drivers for those filesystems, that's really not fair.
Yes it is. The DB should not unrecoverably corrupt itself in case of hard power loss, and a kill -9 is significanly less traumatic than a hard power loss (which includes PSU melting/explosions or UPS going down).
isn't that the whole point of journaling! Of course database can handle kill -9, they are designed to. That doesn't mean it's friendly to do that because you'll pay a price in redoing journals.
also -9 can leave ipc hanging so some ipcs might be needed.
A database should not lose any committed data on `kill -9` in the default configuration. This is why PostgreSQL waits on fsync on the write-ahead log before completing a commit. This can be disabled with synchronous_commit off, in which case you will indeed lose data on a crash.
I said it already elsewhere in this thread, but either the writer has other persistent storage, then it should just keep the data and retry later and nothing will be lost, or the writer doesn't have other storage, then you will also lose uncommitted data with a normal shutdown, as you can not even try to commit any data while a data base is down.
Actually, it can. Killing it with -9 prevents the process from writing the buffer pool etc. to disk, so that the the server needs to recover that state from the journal/log files. No data will be lost though. So you're basically trading time at shutdown vs. time at start-up.
> So you're basically trading time at shutdown vs. time at start-up.
I'm being a bit pedantic, but trading time at shutdown vs. time at start-up + potential data loss (if the journal was in the middle of being written while the -9 was sent, you probably lose the last journal entry).
If that counts as data loss, then a "normal shutdown" also causes data loss. When the journal write gets interrupted, the database client won't get a success return, so it should not delete the data it was trying to commit, and retry the transaction at a later time, so nothing is lost at all. When the client is some entity without reliable persistent state that can not store the data in order to commit it later, that entity will be unable to connect to the database once the "normal shutdown" is completed, and thus will face the exact same problem, as the journal entry is "lost" before the write even starts.
> If that counts as data loss, then a "normal shutdown" also causes data loss.
I disagree. If a journal write was in progress on SIGTERM, you either fail or complete the write, returning the result to the client. In either event, you should end up with a clean journal on SIGTERM, assuming your implementation was sane (i.e. how I'd write it). If the client chooses to ignore the failure, that's not on the server and that's the whole point.
A normal shutdown should never require data loss. Somehow you're conflating SIGTERM with SIGKILL and I'm not understanding why.
A client that relies on receiving an explicit failure response is broken. When the server is killed by the OOM killer or when power of the server fails or the network connection between the client and the server fails (for too long) while the commit is in progress, the client also doesn't get an explicit response, and thus also doesn't know whether the commit succeeded or not, so it has to deal with that case anyhow - and if it doesn't lose data in that case, it won't lose data on SIGKILL either.
And the whole point of a journal is that it doesn't need to be clean. If your database requires the journal to be clean on startup as a condition for its correctness, then the journal is useless, it could just as well just require the data itself to be consistent. The only reason why there is a journal is so that correctness is not affected when writes are interrupted as any point.
Also, for that matter, a SIGTERM might be able to guarantee that the journal is clean, but it would be highly broken if it tried to guarantee that the client gets to know about the result, as that could take an arbitrary amount of time that might be dependent on the behaviour of remote systems, which would be terrible shutdown behaviour indeed.
Really, the only reason why a database should even try and catch SIGTERM is when checkpointing on shutdown is a lot cheaper than log recovery on startup - other than that, it only makes the code more complicated without providing any benefits.
I remember learning about kill as a kid, then kill -9 in college, and after college re-learning it via Monzy[1].
I had sort of forgotten the importance of the signal argument. Such an incredibly powerful command that is nothing more than a representation of a very simple decision or architecture [2]:
Some of the more commonly used signals:
1 HUP (hang up)
2 INT (interrupt)
3 QUIT (quit)
6 ABRT (abort)
9 KILL (non-catchable, non-ignorable kill)
14 ALRM (alarm clock)
15 TERM (software termination signal)
Heh I went to try it and sure enough it was in the manpage but didn't work on the commandline because.....manpage was for /bin/kill but from the commandline I was using the builtin kill. WOW!
% kill -L
kill: unknown signal: SIGV
kill: type kill -l for a list of signals
% which kill
kill: shell built-in command
% /bin/kill -L
1 HUP 2 INT 3 QUIT 4 ILL 5 TRAP 6 ABRT 7 BUS
8 FPE 9 KILL 10 USR1 11 SEGV 12 USR2 13 PIPE 14 ALRM
15 TERM 16 STKFLT 17 CHLD 18 CONT 19 STOP 20 TSTP 21 TTIN
22 TTOU 23 URG 24 XCPU 25 XFSZ 26 VTALRM 27 PROF 28 WINCH
29 POLL 30 PWR 31 SYS
Several of the answers on stackexchange and comments here are absurd.
SIGKILL shouldn't be your first resort but sometimes is necessary as a last resort. If the sky falls and you can't deal with it, there's something wrong in a lot of places which have nothing to do with signals.
But 'nuclear option' attaches too much meaning, if you're in the position, you'll run into plenty of circumstances where SIGKILL is necessary. It's a perfectly fine tool to use and deserves no extremist opinions.
Right... In Windows and OSX you have to do the same thing fairly often. It's called "force quit" or something like that. Otherwise you'd sometimes be waiting an eternity for programs to get themselves unstuck. If a program leaves a messy state behind when you `kill -9` it, it will automatically clean up the mess the next time it runs. If it doesn't, then don't bother using it, because it's extremely poorly written. (If there are orphaned child processes left behind, I kill them manually.)
Usually it's just an obviously good idea to send a milder signal first, because it's less likely to leave orphaned child processes and `kill 345` is just plain easier to type than `kill -9 345`. Also, trying a TERM signal first gives you some feedback on how fubar the process really is.
Yeah, Windows has "Force Quit" baked into their regular shutdown process now - if any program is blocking the shutdown for more than a second or three, the user will get the option to kill everything. Under windows, there is now no excuse not to properly handle a forced abrupt termination, because the layman users will do it.
I always looked at it like: I have tried to stop / shutdown this program in the documented way, then I tried a kill -15, but this thing won't go so kill -9. I then think very hard about why I allow such a thing on a system.
Sometimes you have to kill -9, but you shouldn't do it unless you tried other signals and they didn't work, and you know what the consequences are.
For example, postgresql forks a process for every connection. What you may not know is, if you kill one of these processes, it needs to clean up it's use of the shared memory pool. If you kill -9 any of postgresql's child processes, the other processes will see that a peer died uncleanly, and postgres will just shut down rather than risk corruption.
This is a Postgres thing tho', you can kill Oracle shadow processes willy-nilly with no consequences. Oracle has another process PMON that will clear up after them. If you kill PMON (or SMON) however the DB will shut down. However no data will be lost; at one company I worked at kill -9 on SMON was the normal way to shut the production DB down!
Some databases allow local shared memory connections. So, if you kill -9 a regular client that connects to the database that way, the whole database goes down. Fun times!
Android's OOM killer won't run until after an activity's onPause() or onStop() method has been called, which gives applications a chance to save their state. A foreground activity is typically considered unkillable.
Background services have weaker lifecycle guarantees, but the system can be asked to automatically restart your service and re-delivery any Intents that were being processed.
It's not like the system is regularly killing apps without any recovery options. Dealing with these lifecycle events is a key part of Android development, so apps are designed to deal with them.
Process is considered killable after onPause, there's no requirement that onStop is called. And, in practice, onStop/onDestroy are rarely called outside of an activity calling finish() on itself.
But that call only gives that activity a chance to save state, not the process as a whole. You don't run around stopping all your services and such in onPause(). There is no SIGTERM equivalent where you can go around doing actual cleanup work for the entire process.
As for Services you'll note it's considered killable at any given point, there's largely no guarantees other than it won't be killed in the middle of executing code in onStart/onStop. But during the bulk of its actual work it's totally up for random killing.
And fwiw foreground processes are totally considered killable. They are the last in the queue, yes, but they are still in the queue.
The difference between SIGKILL (is Android actually using SIGKILL?) and onPause() then SIGKILL is that the process still has time to save state (the most important part) and the code itself is not resumed until onResume() is called.
On standard UNIX there's no onPause() or anything similar, so the process cannot react to this in any way.
The state save is important in terms of being able to rebuild the UI quickly. It's not important in the context of "heavy" resources like files, sockets, etc... None of those are ever cleaned up in onPause. The worst thing that will happen if you SIGKILL an Android app without calling onPause first is that the next time it's launched it won't resume from where you left off, it will be as if you rebooted the device.
Also to be clear the onPause and SIGKILL are not tied together. You could get an onPause and it be minutes or hours or even days before you get SIGKILL'd, during which you are completely free to keep running code in the background.
And depending on how your process was started there might not even have been an Activity to be onPause'd in the first place. Consider an app that started doing work in response to a broadcast or content provider query.
iOS may use SIGKILL when out of memory or you're killed in the background (in most cases it sends a memory warning or -[NSApplication applicationWillTerminate:] message first and gives up to 5 seconds before SIGKILL).
You do need to be a little careful as an app writer. All your file writes should be atomic (either SQLite/CoreData or using the atomic write methods in CoreFoundation and Objective-C) or you must be able to detect and delete partially files and recover in-situ.
> All your file writes should be atomic (either SQLite/CoreData or using the atomic write methods in CoreFoundation and Objective-C) or you must be able to detect and delete partially files and recover in-situ.
You should be doing that regardless, there's nothing special about SIGKILL in this regard. Dirty pages will still get flushed to disk, so SIGKILL is safer than, say, sudden power loss. Which isn't exactly unheard of on battery powered devices, after all.
I wouldn't say it's totally fine when oom kills mysqld and corrupts tables for example when a shutdown as alternative would have been better. Unfortunately oom-killer is not that smart.
I really hope you are wrong about killing mysqld corrupting tables, because if it does then I would not want mysql near my data. Databases should use a journal for crash safety.
"I use kill -9 in much the same way that I throw kitchen implements in the dishwasher: if a kitchen implement is ruined by the dishwasher then I don't want it."
Somebody please write a bash script `murder' that sends 15, 2, 1 and then 9, with slighth delay between each signal. I'd do it myself, but I'm not very bash-proficient and have to run to work now ;).
So there are two answers, one saying to try without -9 first (which is obvious), and the other suggesting you uninstall any program where -9 is necessary.
Was this submitted to HN for more opinions? So people could see the disagreeing answers? Something else?
When you care enough to send the very best, kill -15. When you want evidence (core dump), kill -11. For all other (most) purposes, kill -9 never hurt anyone but a process you wanted shot down anyway.
I can give a real practical reason from experience - I was writing a script that needed to pull data from the database. I noticed I was doing something stupid in the SQL query, so I wanted to kill it and rerun it with the change.
I kill'd the script running it, and then kill -9'd it when that didn't work. Two weeks later someone asked about my query that was still running on the database.
And now I'm the one who warns people not to kill -9 scripts without understanding why it's stuck and how to clean it up properly.
You will want to SIGKILL when you need to regain control, and/or when the process is not fundamental. E.g. some ancillary script stuck at 100% CPU or making the server swap crazy that isn't responding to SIGTERM. Anything more fundamental is likely to respond well to SIGTERM or not have the CPU/swapping problem in the first place.
If you're not a server admin, kill -9 probably won't mess up your own machine in a significant way. I'm by no means a veteran, but I've used Linux as my primary workstation for 4 years now and haven't yet had reason to regret using kill -9. And given the number of times I have used it, I prefer a single kill -9 as opposed to 2-3 more kills beforehand.
This just about sums it up:
"Don't use kill -9. Don't bring out the combine harvester just to tidy up the flower pot."
(from this answer http://unix.stackexchange.com/a/8927)
Not always. The kill system call can be used to send any signal, it's just got a name that implies you're sending something like SIGKILL or SIGTERM. I have written C programs that use kill for harmless inter-process communication.
It's SIGINFO on the BSD's, and works with just about any process (though obviously not everything has a customized handler). Also handily mapped to Ctrl-T.
No because only SIGQUIT, SIGABRT, SIGKILL, SIGTERM and often SIGHUP are supposed to do that. All of the other signals have wildly varying meanings. See man 7 signal: http://unixhelp.ed.ac.uk/CGI/man-cgi?signal+7
If your application's child processes are orhpaned when the parent dies unexpectedly (for more than a few milliseconds), that is a bug in your program.
That's not what is normally meant by "multiprocessing" - multiprocessing is when your application forks multiple processes of itself in order to get some concurrency/parallelism, and those forked processes should monitor their parent and exit immediately when the parent disappears.