Do not mount NFS with "soft" unless you really know what you're doing. NFS' behavior to hang is not "stupidity" - it's actually one of the best things about NFS. Applications do not deal well with failed reads/writes. If there's a brief network interruption or the NFS server goes down, it's WAY safer to cause applications to hang until the server comes back. Since NFS is a stateless protocol, when the server comes back, I/O resumes as if nothing ever happened. This helps make using NFS feel more like using a local filesystem. Otherwise it becomes a very leaky abstraction.
Not being able to kill processes stuck on NFS I/O is annoying though, so you can mount with the "intr" option and that makes such processes killable. However, since Linux 2.6.25, you don't even need this and SIGKILL can always kill applications stuck in NFS I/O.
Expecting NFS (or any other remote filesystem) to behave as if it were local is a fundamental error.
Time, and speed of light, ultimately matter. If you need assurance, find a way of getting reliability in your system through redundancy and locality. Distinguish between "task has been delegated" and "task has been confirmed completed". Down any other path runs pain, and anyone who tells you otherwise is selling something.
You're going to have to compromise: whole systems (or clusters) going titsup because your NFS heads had a fart, or lost commits. Neither is very attractive when shit's on the line.
Your database relying on NFS is a fundamental error you'll have to design around.
NFS can be faster than local storage. Compare a very slow tape drive versus a very fast 10Gbps network connection to an NFS server with a huge in-memory cache.
The problem is expecting any filesystem to be reliable.
> Expecting NFS (or any other remote filesystem) to behave as if it were local is a fundamental error.
Any storage medium is subject to fail, the issue isn't specific to remote file systems. In fact I personally consider the Plan 9 methodology towards file systems to be one of the greatest ideas it's imparted on Linux and Unix. So I'm all in favour of the local file systems, services and remote objects and file systems transparently behaving as one.
> Time, and speed of light, ultimately matter.
People often cite the speed of light when talking about electronics when actually electrons don't travel at the speed of light (they have mass). The difference maybe small, but using precise scientific terms imprecisely is one of my pet hates.
> You're going to have to compromise: whole systems (or clusters) going titsup because your NFS heads had a fart, or lost commits. Neither is very attractive when shit's on the line.
If this is a serious issue then you should be looking into iSCSI. A bad workman blames his tools, a good workman finds the best tool for the job.
> People often cite the speed of light when talking about electronics when actually electrons don't travel at the speed of light (they have mass)
Actual electrons move pretty slowly - of the order of millimetres per hour. That speed has nothing to do with the speed the signal propagates down the wire.
The speed the signal is propagated is the speed the electromagnetic wavefront moves along it. Which is mostly limited by the dielectric constant of the wire's insulator. (That actually is related to the speed of light in the insulator - they both depend on its permittivity). It's not related to the speed electrons move in the wire, which depends on its cross-sectional area and the current. In particular, the fact that electrons have mass isn't relevant to the wavefront propagation speed.
(C.f. fibre optic cables, which will have a wavefront propagation speed not dissimilar to copper wire (i.e. both will be a substantial fraction of c), even though, unlike copper wire, their carriers are massless and actually do move at the wavefront speed).
Analogy: imagine pushing the end of a very long, rigid broomstick. The actual wood in the broomstick moves pretty slowly (maybe you move it a cm in a second). But the person at the other end feels their end of the broomstick move almost immediately, limited only by a speed of light delay (or a little more if the broomstick isn't as rigid as possible).
Damn, you beat me to it. I think you did a better job explaining than I would have, too. An upvote for you, sir!
For any skeptics or laymen who might have been lost at 'electromagnetic wavefront', or who are wondering why electrons would seem to move so slowly when we think of electricity (and even free electrons) as moving so fast, the analogy of the broomstick really does not do it justice, since you can't see the broomstick oscillating very quickly as you might have guessed the electrons must really do to satisfy Heisenberg:
The important concept to understand is that information does not move trapped inside of individual electrons, and thus is not at all limited by electron drift velocity.
Fair point. For the analogy I was going for a hypothetical rigid broomstick, made of stuff that's as incompressible as physically possible. The speed of sound in that would be c. Yeah, that isn't very realistic, but I don't think that hurts the point it was illustrating.
My experience with NFS failures is that they're an order or two of magnitude higher (and much more binary) than local storage issues.
Speed of light matters in that it sets an lower absolute limit on the latency of connections between two points, particularly if they're outside of a single datacenter. While non-local storage can be faster (and more reliable) than local storage within a well-architected datacenter environment, even a link of a few tens of miles, let alone thousands, will introduce SOL latencies, carrier inconsistencies, and switching and other equipment idiosyncrasies. Your point that actual propagation rarely hits SOL speeds only compounds the problem.
I've looked at iSCSI. NFS, for its warts, allows multiple concurrent access, iSCSI binds storage to a single system. Management/diagnosis tools, vendor sophistication, documentation, and support are sorely lacking (or were on my last exposure). NFS may suck, but we can quantify, analyze, diagnose, and understand its suckage.
My own preference is for distributed network/application based storage. Caching for repeated static queries. Direct remote database access (if your network and remote storage are fast enough for NFS, they're fast enough to put the DB server directly on hot storage), Hadoop/MapReduce type solutions for higher-speed access, git or other change-and-push state for less aggressive changes. Getting to my earlier point: this requires architecting your application to take account of remote storage capabilities and limitations. It need not be specific to a given technology (and ideally you'd abstract out to the level of "a cache", "a database", "a keypair storage", "a versioning system"), but you're also not just expecting local filesystem semantics and reliability.
> My experience with NFS failures is that they're an order or two of magnitude higher (and much more binary) than local storage issues.
Usually that's the fault of lower down in the chain than with NFS itself. If you're building a home brew SAN using NFS, then there's a lot of kernel tweaks and such like that you can implement. The Linux kernel is a great all-rounder, but if you're building your own enterprise-grade appliances then there's a lot more work required than just vi'ing /etc/exports.
But anyone using NFS to run databases or for caches needs their head examining. It's purely a protocol for file system sharing, caching should be ramdisk, databases should be using dedicated TCP/IP channels preferred by the RMDBS in question and so on. Using NFS as a blanket solution like that is a little like running websites over NFS and wondering why PHP isn't compiling properly. In fact some versioning systems do use HTTP as their method for syncing.
I'm referring to an enterprise implementation comprising over 5000 nodes and somewhere in the neighborhood of 300-400 NFS mounts under typical use (~60-70 at boot, additional mounts via autofs under typical workloads).
Issues: ad-hocery, former Sun shop, growth-through-acquisition (including acquiring the application infrastructure of said firms), 20 years of legacy, much staff attrition (no more than normal, but even 5-10% means virtually complete turnover within that period), geographically distributed network (3-4 continents), etc.
Pretty much a worst pathological case, except that it's very standard in many, many established firms.
Will some kind of distributed file system work here? If you can clustered your NFS server you'd (hopefully) end up with more localised nodes for your client to connect to (a little bit like how a CDN works). For what it's worth, I wouldn't even run 5000 simultaneous HTTP requests on a single node in any of my web farms, let alone on a single NFS server.
Also have you looked into kernel level tweaks? I'm guessing you're either running Solaris (being a former Sun shop) and while I am a sysadmin for Solaris, I've not needed to get this low level before, but certainly on Linux, there's a lot of optimisations that can be made to the TCP/IP stack that would improve the performance in this specific scenario.
I do agree with you that 400->5000 active NFS connections does push the realm of practical use, but I don't think that dismisses NFS entirely; it still outperforms all other network mounted file systems.
In my experience NFS is fine as long as the traffic does not need to traverse a WAN link. The problem, as I see it, is that once something is available via NFS everyone wants to mount it from everywhere. NFS simply does not deal well with high RTT connections.
This would be a good time to re-read Waldo's "A Note on Distributed Computing", which points out how remote filesystems will never act like local filesystems. http://labs.oracle.com/techrep/1994/smli_tr-94-29.pdf
I consider any uninterruptable sleep in the kernel a bug. There's no technical reason a process waiting for a resource (e.g. disk I/O) couldn't be killed on the spot, leaving the resource on its own. If it can't be, it just means it hasn't been implemented in the kernel.
It's bug motivated by compatibility. On original 70's implementations of Unix, file system I/O mostly led to busy wait in kernel and thus was not interruptible because it was simply not possible and there were applications that relied on this behavior. On UNIX, signal received during system call generally causes the kernel to abort whatever it was doing and requires application to deal with that situation and restart the operation, implementations of stdio in libc generally do the right thing, but most applications that do filesystem I/O directly do not (and surprisingly large number of commonly used network services behave erraticaly when network write(2) is interrupted by signal). And even applications that handle -EINTR from all I/O still have places where it is not handled (allowing interruptible disk I/O will cause things like stat(2) to return EINTR).
Allowing SIGKILL to work and not any other signal is ugly special case, and while generally reasonable it is still special case that is relevant for things like NFS (with modern linux NFS client allowing you to disable this behavior) and broken hardware (and then trying to recover the situation with anything other than kernel-level debugger is mostly meaningless, with power cycling being the real solution when you can do that. Accidentally we currently have similar issue on one backend server where power-cycling is not an option).
> implementations of stdio in libc generally do the right thing
Tell me, what is the "right thing" for stdio to do when it sees EINTR? It strikes me that this can't really be solved at the library level. There are times when you'll want to retry and there are times when you'll want to drop your work and surface the error to the caller. Doesn't seem to me like a library can decide which is which. Which is probably why the I/O syscalls need to surface it in the first place. (I'd argue if a library like stdio, which does nothing but wrap syscalls and buffer stuff, can decide it, then there's no need for EINTR to exist at all because the syscall could theoretically make the same decisions.)
The right thing is almost always to retry the syscall. Syscalls on Unix return EINTR because it makes the kernel simpler, which was a key design goal in Unix[1]. If you need to do something when a signal fires, you do it in a signal handler (relying on EINTR instead of a signal handler is error-prone because if the signal fires between syscalls you lose it).
That's the theory anyways - in practice it's really hard to use signal handlers to do stuff because of things like threads and async-signal safety. There are newer syscalls like pselect() which let you atomically unblock signals, execute the syscall, and re-block the signals, meaning EINTR can be used reliably, and then there's the even newer Linux-only signalfd() syscall which lets you receive signals via a file descriptor. But stdio is still very much oriented for the older signal handler approach.
Yes, I'm aware, "almost always". Not always, though. I was thinking specifically of an application that might want to use signals to cancel blocking I/O and continue running.
(PS: When I wrote my reply I was also already familiar with your linked article, the challenges of signal safety, and the signalfd() syscall. Surely an interesting set of topics but I still maintain that a library doesn't really have a "good" way to deal with EINTR, especially if all it does is wrap read or write.)
Simplest solution would be for libc to export some flag that could be set in signal handler signifying that I/O operation should be aborted.
as for the simpler kernel, I think that windows NT/VMS solution where user code has to explicitly block on I/O completion is simpler kernel-wise, but leads to unnecessary complexity in applications (which is abstracted away by winapi, but it's sometimes leaky abstraction). On the other hand, most common application for interrupting syscalls is timeouts and then killing the thread is most often what you want.
In all, EINTR is not way to find out that there was an signal during syscall but an hack to get process to meaningful state the easiest possible way when signal handler runs. By the way for some syscalls post-2.6 linux does something reasonably similar to ITS' pclusering transparently without returning EINTR.
SIGKILL is not really a "special case", because signals are not equivalent, they have assigned semantics. SIGSTOP also can't be caught. The fact that other signals can be caught and these can't just follows from their intended purpose. If these two signals could be caught, they wouldn't be serving their purpose, and another way of forcibly killing a process would have to be devised.
Could you please elaborate how fixing all uninterruptable sleeps would break existing software? Note that I'm not talking about adding any new I/O timeouts to NFS or anything else during normal operation - only that when a process receives SIGKILL it will die immediately, as opposed to when the I/O completes. But if you're sending SIGKILL you surely want the process to die.
So having a system that cannot be safely shut down because someone has opened a file handle on a remote fs that became unavailable is preferable to special-casing one signal?
But by all means, if the signal thing bothers you - fine, I just really want some way to stop the process. Let's add two new system calls, something like "terminate with extreme prejudice" (abbr. twep) and "unmount without mercy" (abbr. uwm).
> [Fixing the bug] was simply not possible and there were applications that relied on this behavior [so now that it's technically possible to fix it, it would break compatibility so much we actually can't fix it anyway.]
I'm beginning to wonder if it's possible to design a platform complex enough to be usable without running into this problem.
Can anyone explain the "Why is a process wedged?" part? I do understand that piping from /dev/random to /dev/null is going to run forever, but I do not understand the gdb-output, nor what that has to do with the rest of the text.
connects a trace to the process to see what it's doing
gdb ${CMD} ${PID}
connects to the process (gdb needs program name and can be given a pid to connect to)
(gdb) disassemble
prints the actual (assembler) code being run. In this case, I think all we get from the output is that it's in fact in the middle of some syscall - you'd have to check registers and syscall tables to determine which.
As others have mentioned, much of this is less useful than implied in the face of an actual wedged process.
Tangentially, using gdb to attach to running processes is a very powerful technique - I've been able to get line numbers out of running bash scripts.
So that whole section doesn't make a lot of sense to me. Running strace on an unkillable process tends to produce no output (if the process was actually making new system calls, it wouldn't be wedged). And worse, my experience is that attaching to a wedged process with ptrace() usually does nothing at all except also hang the attaching the process. This also applies to gdb.
And finally, even if attaching worked, getting a disassembly of the current PC (in this case the syscall trampoline) would tell nothing useful about what's going on.
I don't understand it either. The only syscalls going on there are reads from /dev/random and writes to /dev/null - both of those can be interrupted (in fact writes to /dev/null should be instantaneous).
I think the author may be conflating applications blocked in system calls (since reads from /dev/random will block if the system lacks entropy) with applications blocked in uninterruptible system calls.
There are no reads from /dev/random successfully happening in the example (not after the first few blocks, anyway).
/dev/random reads from an entropy pool that is quickly exhausted and slowly filled. The kernel will lock a process in D state while waiting for entropy.
If you're expecting a stream of pseudo-random data then you can get that by directing /dev/urandom to /dev/null.
> The kernel will lock a process in D state while waiting for entropy.
No it doesn't, at least not in Linux 3.0.57 or any other kernel I can remember for the last many years. It blocks if it needs entropy, but it's interruptible meaning it's in the S state, not the D state, and can be killed.
It think one of the reasons the kernel can't kill a process is because one of the process's threads is blocked inside a kernel call (not completely sure about this). Hes using ps's 'wchan' option to get the address of the kernel function that the process is currently blocked or sleeping on. After he gets the wchan address, he uses gdb to map the address to function name.
nwchan WCHAN address of the kernel function where the process is
sleeping (use wchan if you want the kernel function name).
Running tasks will display a dash ('-') in this column.
/proc/{pid}/wchan contains the function name, not its address.
In this example he's just printing the syscall that his example process is running.
The calls to strace and gdb are disjoint, he's cut out a lot of stuff here, probably why its confusing. In the commands he's used strace would dump all the syscall details to the console and GDB would just attach to the process and enter its normal console (presumably hes determined hes stuck in a loop so hes just using disassemble).
I believe the author is using that as an example of a command that will 'wedge' if you tried to kill -9 it.
The rest of it is an example of how to dump the cause of a process that is wedged.
"/proc/{pid}/wchan" contains the name of the current syscall a process is executing.
Basically you should be able to use everything but the first line of that section as a shell script(with some modifications) to determine the cause of a wedged process.
If kill -9 does not work, its a bug. The kernel needs to be able to end processes no matter what the process is doing. By definition this should not be about how the misbehaving process was implemented. I imagine practical considerations are keeping these bugs in there, eg I can imagine the effort of making all processes killable would stand in no relation to the gains - its hard to do, and rare to occur,
The number of times I have spent 10 minutes staring at the output of 'pgrep processname' , when I had attached gdb to the process in another terminal session... Urgh!! :-/
sshfs used to have this problem and it was enough to bring nautilus and a lot of other applications to their knees as they tried to stat() my homedir and failed on the hung sshfs mount point.
Same for the cifs.kext (or was is smbfs.kext) in early versions of OSX. Putting your laptop to sleep with a mounted Samba share was enough to slowly grind the system to a halt when you woke it up.
Not being able to kill processes stuck on NFS I/O is annoying though, so you can mount with the "intr" option and that makes such processes killable. However, since Linux 2.6.25, you don't even need this and SIGKILL can always kill applications stuck in NFS I/O.