Are there people running servers that actually swap without it being a disaster? For regular 'cattle not pets' servers I just decline to create a swap partition and vm.panic_on_oom = 1.
Java's bloated heap, for example, has easy to use startup options that cause the Java process to use a deterministic amount of RAM, as does any other sane program that uses nontrivial amounts of memory.
OOMing (or swapping) a server almost always indicates an error that needs human intervention, so cleanly killing/rebooting the faulting system and raising the alarm in your monitoring system is the right thing to do.
"Swapping" does not imply that the system is only swapping. Yes, if the system is only swapping - or, more accurately, if kswapd is what's pegging your CPUs at 100% - then you have problems. But the kernel will try to swap out unused pages to make room for the disk cache. If you turn this off, it won't do that, and you could end up hitting the disk more.
Modern systems are usually teetering near the edge of using all physical RAM. This is by design; unused RAM is wasted. You want to use as much RAM as possible to avoid going to disk. What you call "OOMing" is when the applications on the system require more physical RAM than what exists. This is independent from the swappiness ratio.
The goal of virtual memory is to apportion physical memory to the things that need it most -- keep frequently used data in RAM, and pageout things which aren't frequently used. This makes things go faster. When you disable swap, you constrain VM's ability to do this -- now it must keep ALL anonymous memory in RAM, and it will pageout file-backed memory instead. Even if that file-backed memory is much hotter.
I came to this conclusion by following the kernel community, starting with the "why swap at all" flamewar on LKML. See this response from Nick Piggin http://marc.info/?t=108555368800003&r=1&w=2 who is a fairly prominent kernel developer. Nothing I've read from the horses mouth has refuted this since then. This is true even on systems with gobs of memory.
You're worried about systems which grind to a halt under memory pressure, which is unquestionably a concern. The thing is, disabling swap doesn't fix this. As soon as you're paging out important file-backed pages (like libc), your system is going to grind to a halt anyway. And disabling swap can't prevent this -- (same point from a VM developer here http://marc.info/?l=linux-kernel&m=108557438107853&w=2). To really give your prod systems a safety net, you need to (a) lock important memory (like, say, SSH and libc) in memory or (b) constrain processes which hog memory with (e.g.) memory cgroups. IMHO cgroups/containers are a better solution.
I understand the idea, but I don't have a lot of faith in the kernel to make the right decision under memory pressure--and neither do server application developers, hence the proliferation of massive userland disk caches in processes. The amount of allocated memory that the kernel can find to discard on a no-swap system is fairly small, so I'm somewhat relying on the rogue program overrunning the point of blowing out the caches and going all the way to panic the system.
Ideal tuning would probably also reserve some decent amount of space for file caches and slab but I'm not aware of any setting that does that.
It's an entirely different story on laptops and development servers, where the workload varies widely, may contain large idle heap allocations worth swapping, and manually configuring memory usage isn't practical.
>I understand the idea, but I don't have a lot of faith in the kernel to make the right decision under memory pressure--and neither do server application developers, hence the proliferation of massive userland disk caches in processes.
There is a school of thought which thinks especially server devs should just trust the operating system in this regard. One notable person of that school is phk of FreeBSD and Varnish fame: https://www.varnish-cache.org/trac/wiki/ArchitectNotes
I don't disagree with swap being helpful in some (maybe most?) cases.
But there are cases where it clearly does more harm then good. I had a postgresql datbase server with a lot of load on it. The server had loads of ram, more than what postgresql had been configured to use plus the actual database size on disk. Even so linux one day decided to swap out parts of the database's memory, i assume since that was very rarely used and it was decided that something else would be more useful to have in memory. When the time came for queries that used that part of the database, they had a huge latency compared to what was expected.
Maybe i'm misremembering and maybe there was some way of preventing that from happen while still having a swap enabled on the server..
vm.swappiness [1] could be what you're looking for. I find the default value of 60 leaves my desktop more prone to swapping out than I would like (but then firefox consuming >15gb of ram leaves it little choice).
You reboot a machine every time it starts to use swap?? What the hell kind of applications are you running?
Even with java apps (which I grant you are difficult to estimate memory limits on without knowing the application's design) swapping can be a useful way to page out unneeded parts of memory - and more importantly, keep needed parts of memory intact. This can also mean keeping the Java processes in memory while paging out apps which are less crucial, which leaves more room for Java, etc.
Swap is, for lack of a better comparison, the canvas sheet you use to catch someone jumping off a building, or a new york city sewer. In the first example you can use it to save your applications/servers so you don't need to reboot them (the higher your availability requirements, the less you can stand random reboots). In the second example it's the place you send those inhabitants you don't deem worthy of RSS.
The other thing you have to consider is the idea of memory overcommit in application and kernel design. The system is built with a promise that it has way more memory than it actually physically does. Applications will always reserve a fake huge chunk of memory and doesn't care that the system is lying to them about how much is really available. Without swap, when these apps attempt to use the nonexistent available memory, they crash. With swap, they survive.
Then there's apps designed to rely on swap like the Varnish malloc() and file storage methods, or database servers. Even if you disable swap there are still performance and stability problems related to the VM, and understanding this helps your apps run more efficiently. (http://blog.jcole.us/2012/04/16/a-brief-update-on-numa-and-m...)
That's another way of saying "I always overprovision my servers for their worst case scenario instead of their expected state". Not that there's anything inherently wrong with that, as long as the person signing the checks understands and agrees with the decision you've made for them.
Under what scenario will a server start swapping due to load, only to continue performing as it should? It's hard for me to come up with a real world scenario where performance doesn't drop from peak, leaving a backlog of requests rapidly growing, which then compounds the problem.
Either way you handle it, the server is toast. So the real solution is to start dropping requests or downgrading service in some way to make sure the server never reaches OOM.
It definitely is important that the person signing the cheques knows how much capacity they're paying for, but that goes two ways. If you underprovision, they must understand that if they ever get mentioned in the NYT, their site will likely go down. It's up to them to balance the risks.
Take something as simple as SSHD, NTPD, CROND or any other of the many background daemons that get used once or twice a day. They only need a few bytes of themselves in active memory to sit waiting on a socket connection or on a timer. The rest can be paged out for the actual application you're using the server for. It doesn't even matter if these services take a few milliseconds to get paged back in when needed since things like SSHD are working over the network (many orders of magnitude slower than hard drive access).
There's basically no reason not to have at least a small amount of swap. There's never been a benchmark showing no-swap as faster.
It should also be mentioned that things like mmap()+MAP_PRIVATE on a file will flat out fail if the file is larger than free_memory+swap. It's a common, easy and fast way to work with large files where the portion of the file you're working on gets paged in and out as required. Turn off swap and you break this functionality. You're basically limiting what you can do on your system with no performance gains.
Java's bloated heap, for example, has easy to use startup options that cause the Java process to use a deterministic amount of RAM, as does any other sane program that uses nontrivial amounts of memory.
OOMing (or swapping) a server almost always indicates an error that needs human intervention, so cleanly killing/rebooting the faulting system and raising the alarm in your monitoring system is the right thing to do.