Hacker News new | past | comments | ask | show | jobs | submit login
Do we really need swap on modern systems? (redhat.com)
291 points by omnibrain on Feb 23, 2017 | hide | past | favorite | 249 comments



My personal rules of thumb for Linux systems. YMMV.

* If you need a low-latency server or workstation and all of your processes are killable (i.e. they can be easily/automatically restarted without data loss): disable swap.

* If you need a low-latency server or workstation and some of your processes are not killable (e.g. databases): enable swap and set vm.swappiness to 0.

* SSD-backed desktops and other servers and workstations: enable swap and set vm.swappiness to 1 (for NAND flash longevity).

* Disk-backed desktops and other servers and workstations: accept the system/distro defaults, typically swap enabled with vm.swappiness set to 60. You can and likely should lower vm.swappiness to 10 or so if you have a ton of RAM relative to your workload.

* If your server or workstation has a mix of killable and non-killable processes, use oom_score_adj to protect the non-killable processes.

* Monitor systems for swap (page-out) activity.


For the curious (I was):

* vm.swappiness = 0 The kernel will swap only to avoid an out of memory condition, when free memory will be below vm.min_free_kbytes limit.

* vm.swappiness = 1 Minimum amount of swapping without disabling it entirely.

* vm.swappiness = 60 The default value.

* vm.swappiness = 100 The kernel will swap aggressively.

https://en.wikipedia.org/wiki/Swappiness


> vm.swappiness = 0 The kernel will swap only to avoid an out of memory condition, when free memory will be below vm.min_free_kbytes limit.

This is not the case.

It used to be the case, but changed in kernel version 3.5-rc1 (2012 ish)

There was a discussion about this on HN a few weeks ago: https://news.ycombinator.com/item?id=13511086

And there's a blog post on the percona website about how this rather bizarre change bit them: https://www.percona.com/blog/2014/04/28/oom-relation-vm-swap...

I call it bizarre because (as I wrote in that other HN thread) a) it changed the behaviour of lots of production systems in a surprising way, and b) if you want to ensure your processes never swap you already had the option to not have a swap file or partition.


If you are on the experimental side:

There is also zram (just swap in memory lz4/lzo compressed) and zswap (compressed cache in memory for swap pages before hitting disk) that needs a real swap device but compresses pages beforehand.

I run zswap on my Desktop and on a few servers and it gives you some more time before the oom killer comes and the system feels a bit longer responsive.

zram is a nice idea but quite a beast in practice (at least on MIPS with 32mb RAM) sys constantly at 100% if you need it and other quirks. Maybe it got better or I did something wrong.

But if you need an in-memory compressed block-device it's pretty great - you can just format it with ext4 and have a lz4 compressed tmpfs.


From what I understand, zram results in LRU cache inversion whereas zswap does not (as it intercepts calls to the kernel frontswap API). Although, if you have a workload that would benefit from MRU then I guess this is just a bonus :)

Zswap maintains default kernel memory allocation behaviour, with the tradeoff that it needs a backing swap device to push old pages out to (which is why zram tends to be used more often in embedded devices that only have a single volatile memory store, of devices with limited non-volatile storage).


I use zram rather than a regular swap partition on all my laptops (because I'd rather not swap on SSDs) and desktops (same reason and/or there is an absurd amount of RAM to begin with). I also hear that most chromebooks use zram too (you really don't want to be swapping on that eMMC memory).

I set it up with one zram device per CPU core for a total space of ~20% available RAM.

No performance issues w/ zram so far so I haven't felt the need to change the compression algorithm.


zram has worked fine on my chromebooks. This is with running multiple chroots - and I have hit the oom killer a number of times (when even zram swap wasn't enough).

Until you actually run out of memory, zram seems very much a set-and-forget type of thing. No babysitting required.

tl;dr: it does what it says on the tin, and ... with minimal cpu impact.


First I've heard of either. How would I set these up?


You can setup zram like this. Typically you'll want to make a service for it since it needs to run on every boot.

  # modprobe zram num_devices=1
  # echo 1G > /sys/block/zram0/disksize
  # mkswap zram0 /dev/zram0 -L zram0
  # swapon -p 100 /dev/zram0
Official documentation here: https://www.kernel.org/doc/Documentation/blockdev/zram.txt


Wasn't there a Debian/Ubuntu thing recently where vm.swappiness = 0 had a behavior change which increased the number of incidents of the OOM killer stomping on things like database processes?

(Maybe it wasn't so new... https://www.percona.com/blog/2014/04/28/oom-relation-vm-swap...)


Thank you for sharing this. There's an interesting conversation thread in the comments on that post. It's a little over my head, but my takeaway is that with the kernel change, in an OOM event, MySQL is unable to be swapped out due to the type(s) of memory pages it's using, so the kernel is forced to kill it (or itself). In practice, it's relatively straightforward to tune MySQL/MariaDB for a certain memory allocation, and if it's on a shared host, oom_score_adj can be set to protect it.


Can you not protect processes from the oomkiller? This is trivial and very useful on FreeBSD

https://www.freebsd.org/cgi/man.cgi?query=protect&sektion=1


Yes, with oom_score_adj[0], which I've mentioned several times. Setting it to -1000 for a process protects it from OOM killing.

0. http://man7.org/linux/man-pages/man5/proc.5.html


that looks painful to use. Do Linux distros let you automatically protect services? E.g. On FreeBSD:

mysql_enable="YES" mysql_oomprotect="YES"

Now every time you start the MySQL service it's automatically protected


Define "automatically" and "services" first ;-) Normally you just set it in the systemd's unit file for each daemon you want to adjust. So for some definitions of the above the answer is yes. (OOMScoreAdjust in https://www.freedesktop.org/software/systemd/man/systemd.exe...)


Personally, I use one tool for both FreeBSD and Linux. Picking up the (imported) rc.conf variable for a service is a mere matter of

    oom-kill-protect fromenv
And the conversion from something like OOMScoreAdjust is quite straightforward. A PostgreSQL systemd unit file that read OOMScoreAdjust=-625 becomes a run program that contains

    oom-kill-protect -- -625
* http://marc.info/?l=freebsd-hackers&m=145425153624976&w=2

* http://jdebp.eu./Softwares/nosh/guide/oom-kill-protect.html


Cool, thanks for sharing. I would hate to have to muck around in /proc manually to set this.


> * SSD-backed desktops and other servers and workstations: enable swap and set vm.swappiness to 1 (for NAND flash longevity).

Is this that big of a worry? I have a 5-year old SSD in my daily driver laptop, on OS X which loooves to swap out anything it can to gain memory for disk cache, and I'm still barely 15% into the SSD wearout.


How big is the OSX/macOS swap? It's a file and not a partition, right?


It uses a dynamically-sized swap file rather than a dedicated partition.


To elaborate, it uses a series of dynamically-sized swap files (something like 256 MB, then adding a 512 MB file, then 1 GB, then 2 GB, etc)


Nope, just a rule of thumb. :-)


Also... Linux. Not macOS.


Swapping should have disappeared years ago. At best, it gives the effect of twice as much memory, in exchange for much slower speed. It was invented when memory cost a million dollars a megabyte. Costs have declined since then. How much does doubling the memory cost today?

What seems to keep swap alive is that asking for more memory ("malloc") is a request that can't be refused. Very few application programs handle an out of memory condition well. Many modern languages don't handle it at all. Nor is it customary to check for a "memory tight" condition and have programs restrain themselves, perhaps by starting fewer tasks in parallel, opening fewer connections, keeping fewer browser tabs in memory, or something similar.

I've used QNX, the real time OS, as a desktop system. It doesn't swap. This make for very consistent performance. Real-time programs are usually written to be aware of their memory limits.

Most mobile devices don't swap. So, in that sense, swapping is on the way out.


> Nor is it customary to check for a "memory tight" condition and have programs restrain themselves, perhaps by starting fewer tasks in parallel, opening fewer connections, keeping fewer browser tabs in memory, or something similar.

These aren't mutually exclusive and are actually complementary with swap.

If you have more than enough memory then swap is unused and therefore harmless. The question is, what do you do when you run out? Making the system run slower is almost always better than killing processes at random.

And it gives processes more time to react to a low memory notification before low turns into none and the killing begins, because it's fine for "low memory" to mean low physical memory rather than low virtual memory.

It also does the same thing for the user. "Hmm, my system is running slow, maybe I should close some of these 917 browser tabs" is clearly better than having the OS kill the browser and then kill it again if you try to restore the previous session.


> Making the system run slower is almost always better than killing processes at random.

In practice, heavy swapping (forth and back) makes it impossible to even kill the culprit manually (because I can't open an xterm or whatever). While there is often no benefit to have the processes continue running that slow.

Also, idealistically programs should be written with the assumption that the machine could go down at any instant. Having a few more cases where the program is killed will have the effect that the program is better tested and debugged.


I cannot remember a single occasion, where my desktop recovered when it started swapping. Always, the whole system locks up and I need to reboot. Thus, better kill some random processes instead of all of them.


Always, really? Perhaps I'm lucky but this happens quite frequently with my system (dev workstation, so browser with lots of tabs, IDE, my own app/server stuff, other "power/mem-hungry" dev tools...), and I always manage to keep it sane/healthy:

- notice system starts swapping (if you do not monitor that, to me it sounds as careless as driving on the highway on 2nd gear and ignoring engine noise -- ideally the OS could proactively help here but unfortunately I don't know a good "automated" tool)

- find out which process/app uses the most memory (Linux can even tell you which ones use the most swap space [1])

- decide which one you want to (gently|forcefully)-(quit|restart|whatever). Exercise judgment.

[1] http://stackoverflow.com/questions/479953/how-to-find-out-wh...


> I cannot remember a single occasion, where my desktop recovered when it started swapping.

..which operating system is that?


Ubuntu


Sounds to me your swap is not swapon'd. I get the same behaviour when I'm not running swap and memory is depleted.


Swap space is only partially related to virtual memory overcommit, and virtual memory overcommit is extremely common and almost unavoidable on most Unix machines. Part of this is a product of a deliberate trade-off in libraries between virtual address space and speed (for example, internally rounding up memory allocation sizes to powers of two), and part of this is due to Unix features that mean a process's theoretical peak RAM usage is often much higher than it will ever be in reality.

(For example, if a process forks, a great deal of memory is shared between the parent and child. In theory one process could dirty all of their writeable pages, forcing the kernel to allocate a second copy of each page. In practice, almost no process that forks will do that and reserving RAM (or swap) for that eventuality would require you to run significantly oversized systems.)


Plus mobile apps do get, and usually handle, a low-memory notification from the OS.


On iOS too many low memory warning in a set amount of time, Apple won't tell developers how many in what time frame in order to prevent them from gaming the system, will result in your app getting killed.


Until Apple stops soldering on memory, swap will still be alive on the desktop.


Years ago, about 80% of desktop machines were never opened during their life. It's probably higher today.


... for a small fraction of users.


Memory allocation is a non-market operation on (most? all?) operating systems. There's effectively no cost to processes allocating memory, and a fair cost to them not doing so.

I'm not sure that turning this into a market-analagous operation (bidding some ... other scarce resource -- say, killability?) might make the situation better or worse. And the problem ultimately resides with developers. But as a thought experiment this might be an interesting place to go.


This idea was implemented in EROS, and we've been exploring it for Robigalia as well. Storage is a finite resource which can be transferred between processes, including an "auction" mechanism which allows two processes to examine a trade before agreeing to it.


Doesn't this already exist for processor scheduling?


There's a weighting in many such systems, but ultimately it's still just a queue, usually a FIFO one.

Niceness allows for higher-priority processes to preempt others, but doesn't address the problem of an overwhelmed queue.

And processor scheduling isn't memory allocation. Time is ultimately some percentage of wall-clock (and/or overcommittment). Memory is ... different.

There's also the question of such stuff as garbage collection and scheduling of that. I had the opportunity to do some JVM tuning "ergonomics" (horrible name) a few years back. Turns out that you get far better behaviour in most cases by decreasing the sweep frequency and increasing the the allocation chunks (terminology is escaping me), due to the fact that natural attrition deallocates memory, and running sweeps too frequently simply chews up massive amounts of CPU time with no return on freed memory.

We also identified processes which genuinely did require very large memory allocations, and allocated hardware specific to those.

Specific workflow and process understanding (always idiosyncratic to a particular work assignment) was necessary, and took time to acquire.


I hate swap. My experience with it is that once a disk-backed machine (as opposed to SSD) has started swapping, it's essentially unusable until you manually force all anonymous pages to be paged in by turning off swap ("sudo swapoff -a" on Linux) or reboot.

My hunch is that the OS is swapping stuff back in stupidly. Once memory is available, I'd like it to page everything back proactively, preferring stuff from swap and then from file-backed mmaps. But instead it seems to be purely reactive, each major page fault requiring a disk seek to page in what's needed with little if any readahead. Basically the whole VM space remains a minefield until you stumble over and detonate each mine in your normal operation. Much better to reboot and have a usable system again.

On my Linux systems, I've turned off swap.

On OS X...last I checked, I wasn't able to find a way to do this. I'd like to turn off swap entirely, or failing that, have some equivalent way to force all of swap to be paged in now so I don't have to reboot when I hit swap. Anyone know of a way?


> My experience with it is that once a disk-backed machine (as opposed to SSD) has started swapping, it's essentially unusable until you manually force all anonymous pages to be paged in by turning off swap ("sudo swapoff -a" on Linux) or reboot.

That depends. If your workload exceeds the amount of available memory, you will start "thrashing" the disk and that can make a system un-responsive.

If you happen to launch a large application, or start working with a big file, unused pages will be evicted to disk to make room and, after some slowdown, the system should become perfectly usable again. YMMV

On OSX, I don't know a way, but I can't recall the last time I had to reboot due to RAM/swap issues, even when I was developing apps on a 4GB Macbook Air. I guess memory compression, which is enabled by default, helps here. Most OSX systems have very fast SSDs as well.


> If you happen to launch a large application, or start working with a big file, unused pages will be evicted to disk to make room and, after some slowdown, the system should become perfectly usable again. YMMV

What is an unused page? One that the foreground, memory-hungry application doesn't need? Okay, fine, but what happens when you switch back to some other application? My experience is that it needs the RAM that was paged out, and it doesn't get paged back in all at once. Every time you hit some 4 KiB of memory that happens to be paged out, you wait another 10 ms. I don't know how much beyond the 4 KiB gets paged in at the same time. Worst-case, there's no read-ahead at all. Let's say the application is using 1 GiB of RAM. Then this can happen 262,144 times, which means 44 minutes of waiting in small bursts as you're trying to use it, rather than the 10 seconds (at 100 MB/s) it'd take to read it all in one go. That's what I mean when I say the machine is unusable.


This is my experience too, and this thread has motivated me to "swapoff" all my desktop systems today I think. There's no situation in which swap usage for me has ever not lead to a reboot due to the system going unresponsive the moment it starts swapping.


I will posit that your swap is too big then; I have 16GB of ram and run with 1-2GB of swap. A runaway process hits the OOM in about 30s.


To be fair, this is probably ZFS-on-Linux having a bad swap interaction. But I have found that once I get to 95% RAM use, I start hitting the swap fairly frequently and as a side-effect my desktop starts stuttering (usually happens if I kick off an 8-core code compile).

The other culprit I suspect is too many potentially blocking processes Cinnamon's main thread, but I've never figured out a decent way to go after the problem.


oh ZoL is terrible with swap; I get OOM all the time when my ARC is still like 4GB.


On my 2015 MBP with 8GB & SSD, I am often stuck for 10-15 minutes unable to do anything while thrashing. And I am someone who has Activity Monitor handy. I do not have this on my much older and weaker Ubuntu X220s doing the same type of development. Not sure why that is.


If it's 3rd party SSD, have you enabled TRIM? I had to do that for my old Mac Mini, made big difference. (2015 MBP of course has factory-installed SSD, but maybe this helps someone else.)

http://osxdaily.com/2015/10/29/use-trimforce-trim-ssd-mac-os...


It's all original Apple hardware.


It can be Spotlight reindexing, or Time Machine creating a backup into /Volumes/MobileBackups (done once per hour). Especially taxing if you have lots of files on your disk.


It's kind of annoying that these processes aren't niced and ioniced (or the macos equivalent). Especially when typing a password.


Look for hung mds (spotlight indexer) threads. They can get hung up and fail Un-gracefully with some files.

I had a client whose app generate PDFs that would cause this to happen.


Memory (8GB) is just suddenly jumping to red, usually by some Safari tab (gmail etc) which suddenly jumps to 1GB. And that's it. No indexing or mds processes; just browser tabs which suddenly jump over a point and lock up everything because of memory use. It's like i'm back in the early '90s when I first touched a non Amiga and non SunOS system. And asked how people can work with that 'Windows/DOS stuff'. I am not sure why it misbehaves so much though...


Usually when people say 'swapping' they mean page faulting. It's nothing more than a slight annoyance on a single-user machine if you swap for 10 seconds, but on a busy server you are dead in the water.


Take this with a grain of salt since this anectode is a few years old, before I upgraded to a SSD to host my swap, and this is on Windows.

I'll always remember when I used to load a large piano instrument in a VST DAW on Windows 7, taking about 3-4GB out of 12GB of RAM, it played perfectly fine but if I left the application open, invariably on the next day I'd get a barrage of audio dropouts when pressing any new piano key. One trick was to put my arms on the entire keyboard a few times to wake the swap back up to memory. Another trick, which I ended up relying upon despite occasional low memory warnings, was to disable the pagefile entirely - that sure fixed the problem.

I'm not sure how/if things have improved since with Windows 10 and SSDs, but I always felt there was something wrong with the algorithms, since even with GBs of memory free at all times, old memory content would tend to end up on disk, without any good reason I could see.

I assume the OS used time to prioritize various caching/pre-fetching techniques over actual application data, and/or once paged, never preemptively loaded data back to RAM even if plenty of memory was available.


"My experience with it is that once a disk-backed machine (as opposed to SSD) has started swapping, it's essentially unusable"

The OS should start swapping very early to avoid bursts of disk I/O and rendering the system unusable. On linux this is somewhat configurable, even if not user friendly, but a combination of swappiness and vfs_cache_pressure could turn it into a usable machine, taking care of inefficient memory usage, memory leaks, unnecessary vfs cache, etc.


> The OS should start swapping very early to avoid bursts of disk I/O and rendering the system unusable.

I think you're talking about the I/O of paging dirty things out, but I'm talking about the fact that some memory location is no longer present in RAM, so accessing it will take 10 ms or more to page in.

The system is not only useless while actively swapping. It's useless after it has ever swapped, and you can only recover by disabling swap ("sudo swapoff -a") or rebooting.


Contrarian anecdote: I've recovered from swapping a few times on a desktop machine with an HDD (typical scenario: an ON clause was omitted from a join and postgres is doing a cartesian join between two tables). I didn't find things unable instantly, I was able to recover by SIGTERMing the relevant process and then running `swapoff --all` while maybe going for a tea break. YMMV.


On OSX it used to be that you had to disable the pager daemon, then remove the swap file. Probably need to disengage system protection for it to work though.

sudo launchctl unload -w /System/Library/LaunchDaemons/com.apple.dynamic_pager.plist

Then:

rm /var/vm/swapfile*

Disclaimer: haven't tried doing it this way on Sierra.


Something seems to be seriously wrong with the swap implementation on modern systems.

20 years ago on Windows 98 it just started swapping, but it was no big deal. If something became too slow to be usable, you could just press ctrl+alt+del and kill that swapped program and everything worked fine afterwards.

On the other hand, my modern linux laptop, it starts swapping, and it swaps and swaps and you can do nothing, not even move the mouse, till 30 minutes later something crashes.


> on Windows 98 it just started swapping, but it was no big deal.

At that time, swapping out a 4k page was a significant part of memory: 4k of 16 MiB is 1/4096 of memory. Each swap gets back a lot of memory the program needs. Now the swap is still 4k pages, but memory has expanded by a thousand fold. Basically swap is a thousand times worse today than it was in the time of Windows 98.

For harddrives swap isn't used now to expand memory, it's used to remove initialization code and other 'dead' memory. Swap should be set to only a tiny fraction of the memory size for this reason, to prevent it from being used to handle actual out-of-memory conditions. But realistically for most users it's not even worth enabling at all because of the occasional memory that needs to be swapped in from disk.

For SSDs the seek speed has improved to match the extra memory so swap can still be used like in the old days to expand the effective memory size. But memory is so large a swap file that's a fraction of memory size to offload 'dead' memory is enough unless there's a specific reason to actually use swap for out-of-memory.


I have been using various operating systems for a while.

I feel like Linux has, in general, from a UX point of view, the worst behaviour when swapping and the worst behaviour in general under memory pressure.

I feel like it has gotten worse over time, which might not be just the kernel but the general desktop ecosystem. If you require much more memory to move the mouse or show the task manager equivalent, then the system will be much less responsive when it thrashes itself.

Honestly, I'ld much rather have Linux just crash and reboot, that'd be faster than it's thrashing-tantrums.

Luckily, there's earlyoom, which just rampages the town quickly if memory pressure approaches. Like a reboot (ie. damage was done), just faster.

In any case, it makes me sad (in a bad way) to see how bad the state of things is when it comes to the basics of computing, like managing memory.


Not an excuse for bad implementations, but since I run i3wm, my feelings of happiness increased rapidly. To such an extend that I do not want to ever run anything else; stability, speed, memory use... Solves (for me) the issues you have.


i3 is magnificent. The same display seems 10x bigger when using i3. As true for netbooks as for big desktops. My old x120 dual boots win7, which is unusably slow and unstable on it. Arch with i3 is still snappy. Unless I'm running a web browser. Web browsers have gone insane.


Use Noscript for browsing on old or resource limited hardware. The problem is the amount of code running on modern websites.


I couldn't figure out how to scale i3 to the high DPI on my Yoga 900, with Wayland on F25.


If you're on Wayland, use Sway instead. It feels so much like i3 that I often forget it's not i3. Hidpi works pretty well: https://github.com/SirCmpwn/sway/issues/797. I use this on a Dell Precision with the 4k display.


Thanks!


Use xrandr --dpi 192 (or whichever value you’d like to use) before starting i3, i.e. typically in your ~/.Xsession.

i3 ≥ v4.13 will pick up this value from the Xft.dpi resource in ~/.Xresources as well, which is the more common way of configuring DPI.

edit: haven’t tested this within Xwayland, though. Note that i3 is only supported on X11.


What do you mean by "scale i3"? Just the text drawn by i3 or also the managed windows?


I meant scale the entirety of the interface. I'll try Sway, as apparently i3 and Wayland isn't really a supported combination.


> Web browsers have gone insane.

Yep. I bought a second computer for full browsers. One for dev, another for 'full' browsing (Javascript on) and on my i3 dev machine, I only have NoScript browsing on for dev stuff.


That's what happens when you run everything through 100 layers of abstraction. Windows, for better or for worse, runs most things closer to the metal.


Because Windows 98 always kept enough resources available to show you the c-a-d dialog. On Linux, however, there is no "the shell must remain interactive at all times" requirement, so a daemon that gobbles memory and your rescue shell have the exact same priority. Modern Windows even has a graphics card watchdog and if any application issues a command to the GPU that takes too long, it's suspended and the user is asked if it should be killed. Probably not what you want on an HPC that does deep learning, but exactly what you want on an interactive desktop.

I suppose it might be possible to whip something up with cgroups and policy that will keep the VT, bash, X and a few select programs always resident in memory and give them ultimate I/O priority, but I haven't tried.


This is the exact opposite of my experience. Back in the Windows 9x days it was a fairly routine experience for the system to soft-lock with the HD grinding away and I'd sometimes end up just hard rebooting the computer after waiting a few minutes for the ctrl-alt-delete dialog to appear. On macOS with a SSD I don't even notice when my system is swapping heavily.


Isn't this related to this change on kernel 4.10? https://kernelnewbies.org/Linux_4.10#head-f6ecae920c0660b7f4...


Possibly, however since the writeback behavior is configurable I expect you could test that thesis by changing the aggressiveness of the writeback draining.


Could this be a reflection of the increasing gulf between RAM speed and HD speed? Even with NVMe drives, which one probably shouldn't be swapping to anyway, RAM is orders of magnitude faster.


I think, among other things, it has to do with the size of the swap space relative to the speed of the swap device. IME high disk i/o combined with large swap space means swap never fills up and the OOM killer doesn't kick in. On systems with less RAM and swap, OOM conditions were hit much sooner, even with slower disks.

Default settings for dirty ratio and dirty background ratio exacerbate the issue: more data is held onto before it is written, and once the background ratio is hit, any application writing to disk will block.


With SSD's disk is not that slow.


SSDs are only ~4x faster than magnetic last I checked. If RAM is 100ns per access, and hd access is down from say, 1ms to 0.25ms, that's still a huge huge gap. 4x isn't even an order of magnitude.

EDIT: see comment below for more accurate numbers.


From the article:

>A typical reference to RAM is in the area of 100ns, accessing data on a SSD 150μs (so 1500 times of the RAM) and accessing data on a rotating disk 10ms (so 100.000 times the RAM.


reminded me of this...

Latency Numbers Every Programmer Should Know

https://gist.github.com/jboner/2841832


Thank you for the correction. I should have read more carefully. Still, we're talking 3 orders of magnitude for SSD vs RAM.


0. Possibly not true in all cases. 1. Modern systems are much more aggressive about enormous disk caches, which can ironically lead to io storms when it swaps out your application to buffer writes, then has to flush the cache to swap the app back in. 2. Difference in working set size and number of background programs waking up.


I think thats more related to Linux and its prioritization of IO than anything else. Note that the latest kernel release 4.10) contains an IO throttle that should improve this experience.

https://kernelnewbies.org/Linux_4.10#head-f6ecae920c0660b7f4...


I feel you. X and some recovery critical software should have their reserved memory cgroup with some guaranteed, safe amount of physical memory and 0 swappiness. I speculate that on Windows it works so well because most of these stuff are in kernel space anyway.


If you have an SSD, try setting vm.swappiness to 1 (not 0).


Just type

sudo swapoff -a sudo swapon -a


Can't type while it is thrashing. Otherwise the offending program could just be killed


What I've always been specifically confused about, is if there's any point in giving a VM a swap partition inside its virtual disk, rather than just giving it a lot of regular virtual memory (even overcommitting compared to the host's amount of memory) and then letting the host swap out some of that RAM to its swap partition.

Personally, I've never given VMs swap. I'd rather have memory pressure trigger horizontal scaling (or perhaps vertical rescaling, for things like DBMS nodes) than let Individual VMs struggle along under overloaded+degraded conditions.


Generally yes. In fact, this is why "balloon" drivers exist, to allow the host to create backpressure and make the guest swap. The guest knows more about which pages are interesting than the host. If you make the host do the swapping, it will pick silly things, like the guest's disk cache, to write to swap.


For clarification to other readers, "Generally yes" was the reply to the originally posed question, which means the above comment actually disagrees with the suggested solution. (I had to read both a few times to get this straight.)


Ah, this is a great idea. It'd also be easier to understand and see service degradation (ie. physical memory being used on the host) directly from something like vCenter instead of relying upon Solarwinds to tell me the host is out of memory.


But does the host actually know what is an appropriate thing to swap? It doesn't know what is contained in that chunk of memory it just swapped. Although ideally, you would just build the system to contain enough memory for each VM so they can each run at full capacity along with whatever else overhead it may need for your hypervisor. You wouldn't want the host swapping out anything related to your VMs because it's just going to kill any performance of the affected VM. Give each VM its own swap space and let the guest figure out what needs to be swapped.


One usage of swap in modern systems: hibernation. If you need to use hibernation, that means a swap must exists, either as a swapfile (pre-allocated, as uswsusp require a fixed offset on the disk to resume) or as a partition.


I've been reading these stories for ten years. About 8 years ago I started taking them seriously and stopped using swap. Turns out not having swap works much better. I'm amazed how slowly the consensus seems to be moving though.


Systems are used for vastly different purposes. With different memory usages and expected operation.

There can be no consensus because there is no one answer.


We reached this same conclusion for our servers generally. The problem with swap is that it's unpredictable. It's better most of the time to have a system that's predictable. However much RAM is available to the system, you can deal with that, by making an appropriate choice of hardware type, or by scaling up, tuning software, etc. It's harder to deal with performance problems related to use of swap in my experience, since it's nondeterministic what will be swapped.


Yeah. I've had issues with this on some systems.

On Windows without swap when you hit a remotely low on RAM point, things start going really poorly for some reason - random latency. So with 16 GB of RAM even I can't disable swap on Windows without some really strange performance characteristics, I run SSDs so I really wanted it off and I just stuffed more RAM in my box - with 32 GB it isn't a problem.

On Linux however, you can pretty much turn it off and everything will run smooth until you're actually out and then you lag badly briefly, Linux's oom-killer does its thing and all is good again within the span of a few seconds.


I've noticed the same thing, Windows just becomes bizarrely cranky if you disable swap entirely. My solution was to instead leave it on, but limit it to just a couple of megabytes. That seems to avoid the VM subsystem freakouts thus far.


Sadly, trying to investigate this is quite hard, since people are outright hostile to questions about it.

If you ASK about swapping on windows, you get people telling you that "Microsoft engineers are smart, don't disable swap and go <insert expletive here>" even if you asked something that is NOT about disabling swap.

So, I had this gamer laptop, i7, nVidia GPU, 8GB of RAM (when most machines had 2 or 4), but some stupidly slow 5k RPM HDD made for power saving and locked "noiseless mode", thus very slow seek too (ie: it moves the heads slowly to avoid making noise and for aerodynamic reasons).

I noticed that ever after I just booted up, RAM usage would jump to 6gb and the HDD would trash endlessy and make the machine unusable... after some research I found some interesting posts by MS employees about it:

Windows can "preemptively" use swap, it will write on swap things it thinks you might need to swap out. Sounds good on paper.

Also, Windows has several caching systems, that will write to "RAM" random crap.

One day that was particularly bad, I noticed that when I booted, Windows would immediately attempt to copy to RAM a gigantic binary file that was the sound files of a game I played a lot recently, this caused trashing due to reading the file, then, it would attempt to load other programs it had to, then page out immediately, and enter some crazy loop of trashing the I/O forever... Every time I opened the task manager and looked at the graphs, disk I/O would be constantly maxed out at 100%...

Disabling the VM made the laptop behave better (despite all the bugs Windows have when you disable VM).

But what I really wanted, was to change how the VM works... I wanted to keep the VM, and the caching, but change settings, for example I would set it to NOT page out anything at all unless RAM was used more than 80%, and also to never "cache" stuff unless HDD was actually idle and a good amount of RAM free. But sadly, this can't be done it seems, I got no useful answer on stackexchange sites when I asked this (But got a couple personal messages and e-mails full of expletives in many places where I asked about it, for some reason people get personally offended when the subject is virtual memory).


Java on Windows used to have a background service which touched the pages of the Java components to keep them in memory and make Java performance look better. This was active even if you hadn't run a Java program in weeks. OpenOffice once had a similar program. Enough things like that and you can't get anything done.


That program used to stall your startup something fierce too. It was really annoying.


Yeah, what you wanted to change is what Linux calls "swappiness", configurable in vm.swappiness. In Windows I can't find any such configuration option.


>One day that was particularly bad, I noticed that when I booted, Windows would immediately attempt to copy to RAM a gigantic binary file that was the sound files of a game I played a lot recently, this caused trashing due to reading the file, then, it would attempt to load other programs it had to

Oh, that's just superfetch, it's a service you can disable to reduce a bit the idle trashing after desktop has loaded.


On windows when you allocate it will gurantee it has the memory to fulfill the request at the time of the request. On Linux no check is made until you try to use the memory.

Because of this memory pressure will be higher on a windows box. Pageing helps paper over this as the commit can be billed to the page file not RAM. Windows is smart enough to not write anything to swap until you actually use the page so in practice this is rarely a problem.

The benefit to this approach is you actually have a hope of recovering from OOM.


> On windows when you allocate it will gurantee it has the memory to fulfill the request at the time of the request. On Linux no check is made until you try to use the memory.

That's not true, you have to turn vm.overcommit_memory on in Linux for that to happen I believe. Which is off by default in most distros.


https://www.kernel.org/doc/Documentation/vm/overcommit-accou...

The default is to allow "sensible" overcommit whatever that means. From my experience whatever "sensible" is, really is sensible and I haven't had issue with that. You can also set it to allow all memory allocations, even "silly" ones (i.e. allocate 100GB memory on a system with 10GB RAM); or to refuse overcommiting memory.


Oh, interesting. Didn't know that. Thanks!


> Linux's oom-killer does its thing

Usually selecting sshd to kill, in my experience, rendering the server inaccessable.


Protect that service against oom-killer, it's mentioned elsewhere in this thread how to.


Two examples of why I have swap:

* On a laptop to hibernate, which results in zero power consumption vs suspend which will drain the battery in a day or so

* I use tmpfs for /tmp and using swap as the backing is far more performant than regular filesystems


> On a laptop to hibernate, which results in zero power consumption vs suspend which will drain the battery in a day or so

Swap is not strictly needed for this:

(it boils down to vm.swappiness=1)

https://wiki.debian.org/Hibernation/Hibernate_Without_Swap_P...


You still need swap, just not a swap partition. I suppose we can debate if there's a meaningful difference between the two.


I have an encrypted swap partition. Hibernation works well with that. I don't believe it is possible with a swap file since the containing filesystem would also be mounted by the hibernation image.


> I don't believe it is possible with a swap file since the containing filesystem would also be mounted by the hibernation image.

for sure it is. Even for swap files inside an encrypted volume.

https://vadim-kirilchuk-linux.blogspot.com/2013/05/swap-file...

The crux is knowing the swap-file offset and passing that argument as resume_offset on boot.

Have a look. It's totally doable and you won't need to have an actual partition.


I never realised the swap header on files also provides enough information to locate the other blocks without going through the filesystem. Logical I guess. This is however more work that "works out of the box" swap partitions.


I could swear there was a way around that but I don't remember the details and can't find them.

Still, it's possible to hibernate without having the drawbacks people have been complaining about in this thread.


> * I use tmpfs for /tmp and using swap as the backing is far more performant than regular filesystems

This seems absurd. You're running an in-memory filesystem backed by memory-on-disk? You weren't comparing to a journalled filesystem or something like that?


Since I do use journalled filesystems for my real data, your comment implies I should create yet another partition for /tmp using a filesystem that optimises performance over durability/integrity, eg without journalling.

I consider files in /tmp to be temporary, and do not expect them to survive a reboot. (Actually I prefer they don't - less administration and housekeeping.) They also have random lifetimes ranging from fractions of a second to several days. And random sizes from zero length to gigabytes (eg making an ISO image).

With tmpfs RAM is used which provides the best performance since the filesystem is trivial. Memory pressure will cause swap to be used as needed. Files not accessed will end up in the swap and taking no RAM.

By far the fastest I/O is the I/O you don't have to perform.


> Since I do use journalled filesystems for my real data, your comment implies I should create yet another partition for /tmp using a filesystem that optimises performance over durability/integrity, eg without journalling.

If you're using a swap partition just for the sake of /tmp then it's the same difference, no?


> If you're using a swap partition just for the sake of /tmp then it's the same difference, no?

No. The big difference is that regular filesystems try to do I/O to their backing device - heck that is their point, and what they do the vast majority of the time. tmpfs does not do any I/O. However I/O will happen when there is memory pressure by the swapper, but that is going to be rarer.

ie with tmpfs, swap is a spillover mechanism. With a regular filesystem, the underlying device is the primary mechanism.

Swap can also be used for actual swap on the occasions it is helpful.


> No. The big difference is that regular filesystems try to do I/O to their backing device - heck that is their point, and what they do the vast majority of the time. tmpfs does not do any I/O. However I/O will happen when there is memory pressure by the swapper, but that is going to be rarer.

Yes and no - aren't they just two ways of looking at the same decision? Regular filesystems will buffer, and when the system is low on memory it will flush buffers, using similar criteria to deciding whether to swap.

> Swap can also be used for actual swap on the occasions it is helpful.

Swap enables actual swap, sure. My experience is that it usually hurts more than it helps though.


> ... Regular filesystems will buffer, and when the system is low on memory it will flush buffers ...

That is the bit you are missing. Unwritten filesystem data is regularly flushed. The flush interval is often around 5 seconds. Lookup "pdflush" to get the gist, although things have changed since then. Same with laptop mode.

Quite simply if a file is created and lives for at least N seconds then there will be disk activity irrespective of memory pressure. N is 5, perhaps up to 30 seconds in normal use.

Even if the file contents aren't fully flushed, metadata is.


Reminds me of a youngster who thought he could beat the system by putting the swap file to a ramdisk drive.


> I've been reading these stories for ten years. About 8 years ago I started taking them seriously and stopped using swap.

Not sure what you're referring to here. This story doesn't recommend eliminating swap...


"Systems without swap can make sense and are supported by Red Hat - just be sure the behaviour of such a system under memory pressure is what you want"

So, it doesn't exclusively recommend it, but it concedes that there are use cases where it makes sense.


The quote I included made it sound like he was referring to stories that advocate getting rid of swap.


Yours might have been interpreted as they advocated never getting rid of swap :)


Ditto, and over that period memory has become even cheaper.

I sort of wonder if we'll see a 100% RAM, large memory laptop soon that boots from an SD-card or in a cryptographically secure fashion over 4G wireless networks, aggressively disables RAM for power saving and suspends well.


Aren't there legacy applications which expect swap where otherwise with modern applications swap isn't necessary? Or, at least that is my current (mis)-understanding...


This is by far my biggest pet peeve in the space. The "rule of thumb" that you need 2x RAM as swap. Even 10 years ago this "rule" was ancient and useless but it was always a constant challenge educating customers as to why, and that yes - we really did know better than your uncle Rob.

Once a server hits swap, it's dead. There is no recovering it other than for exceptional cases. If you are swapping out, you've already lost the battle.

I tend to configure servers with 512MB to 1GB swap simply so the kernel can swap out a couple hundred MB of pages it never uses - but that's really more to make people feel better than it really being useful at all.


Rules of thumb involving more swap than RAM probably date from decades ago, when Unix virtual memory systems were sufficiently primitive that the total amount of virtual memory you could use was just your swap space, not swap space plus (most of) RAM.

(The limitation came about because the simple way to handle swapping is to assign every potentially swappable page of virtual memory a swap address when you allocate it in the kernel. Then the kernel always knows that there's space for the page if it ever needs to swap it out and you're never faced with a situation where you need to swap out a page but there's no swap space left.)


2x RAM as swap is clearly bad, but I like having around 512MB to 1GB (on systems of basically any size); when you do start using more ram than you have, it gives you some buffer (as long as you actually alert on it). If you have a small memory leak, you can recover; if you have a large memory leak, you're going to run out of swap pretty quick anyway.


I wish we took the path of EROS [0] rather then "RAM and DISK are seperate". A lot of problems stem from that incompatable viewpoint of computing. Computer Science is about hiding complexity under lays of abstraction that continualy provide safer states and constraints on the things built on top of them. Our abstraction that RAM and DISK are seperate is not safer nor does it provide constraints that are simple to navigate. Thinking about this the other way, where DISK is all you need and memory is just a write-through cache, is much safer in my opinion and leads to some really cool application design.

If RAM and DISK are the same, then writing a file system is just writing an in-memory tree. No need to pull data from the disk, just navigate the tree in your program's memory and pull the blob data out. Want to persist acorss reboots, protect against power outages, or save user settings? Just set a variable and it'll be there.

The benifits are much better then the costs.

[0] - https://web.archive.org/web/20031029002231/http://www.eros-o...


The AS/400 (or whatever they call it now) had an approach like that. Everything was on disk and RAM was just a cache of disk. That also meant every "object" had an address and could be accessed by any process with suitable permissions. There are lots of other things they do, with a very different approach than Unix, Windows etc.

Frank Soltis' book is recommended reading: https://www.amazon.com/dp/1882419669/


AS/400 is really an amazing system, in many ways still ahead of its time. Persistent, single-level storage and capability security are ideas that still have yet to catch on in the mainstream—even though more research gets poured into NVRAM every year.

It's a shame hardly anyone knows about it. Those things are a joy to use. You can get a free (limited, but still useful) AS/400 user account to play around with at http://pub400.com/. I really recommend it.

(Disclaimer: I'm slowly working on a system that resembles AS/400 in many ways, but optimized for analyzing and reporting on very large timeseries databases. It's intended for business applications that require a combination of scheduled reports and fast ad hoc analysis of big timeseries data, initially the oil & gas industry (which is where I work in my “real job”).)


The challenge with this is that abstracting away disk in a way that isn't horribly leaky is incredibly hard as long as one lets us manipulate individual bits and the other requires us to write whole sectors.

Note that EROS is not providing a write-through cache. It's providing a write-back cache using checkpointing coupled with a journalling capability and ability to explicitly sync data.

So it's leaky: Your application needs to know that it needs to structure it's writes to memory so that they will make sense if the system comes back up with some of the data missing, and needs to know how to use the journalling functionality.

It can't just act as if it's running in RAM forever.


I don't know where you get the idea that you can't just pretend you're running in RAM forever. If you look at the main goal of EROS you can see that is the point.

Check out http://wiki.c2.com/?TransparentPersistence


From the EROS website which specifically described the checkpointing mechanism as well as gives a short description of the journalling support.

If you just pretend you're running in RAM, and the system crashes, you will lose the data between the crash point and the last checkpoint unless you have explicitly used the journalling mechanism. Often that is acceptable. E.g. since you're restoring the program at the same point in time, if the changes are entirely based on data that were in the system at the point of the last checkpoint, it will just redo the work to calculate the changes.

But if there are side-effects, that is often not going to be acceptable. E.g. database updates that the system has said were committed will suddenly disappear.

To solve that, EROS has a journalling mechanism to allow you to give guarantees about specific data that changes in between checkpoints, but that requires applications to explicitly use it to tell the OS what needs to be saved when so that the application can guarantee that a given piece of data has been durably recorded when it promises a client it has been recorded, and that the writes get correctly ordered.

That's a sensible compromise - if you do it right, it only needs to touch the "boundaries" where the system does IO.


> If you just pretend you're running in RAM, and the system crashes, you will lose the data between the crash point and the last checkpoint unless you have explicitly used the journalling mechanism

> But if there are side-effects, that is often not going to be acceptable. E.g. database updates that the system has said were committed will suddenly disappear.

Yes, not every system will forever be recoverable. You can definetly crash at just the right moment to ruin your year. I'd still like the other safety constraints that this provides because I still think that even if N (where N is the number of threads available on the CPU) processes have a chance of being corrupted we're still going to be saving the rest of the processes on the system that aren't currently transacting with one another.


I absolutely like many of the ideas behind EROS, including checkpointing, though I think it does have some issues. E.g. we often treat reboots as a "clear all state to recover from weird situations", and so we still need something like that.

The point is more that it doesn't mean you can just treat things like we do RAM now. It ends up being closer to how you'd work in apps that use mmap'd files to back persistent data.

The capabilities model was interesting too.

I think my biggest problem with a persistence model like that, though, is that I suspect it would encourage not thinking seriously about state, and ideally I'd prefer a system where state is minimized. E.g. compare EROS with the virtual opposite: Android, for example, where apps might find themselves killed anytime. Some apps handle it poorly and go through lengthy initialisation processes again when restarted, but many maintain the illusion of being fully presistent to the extent that users can rarely tell if they've started from scratch or not.

I'd love to see more work into allowing the illusion of a persistent process with little developer effort, though. Perhaps OS-level per application checkpointing support that the application can have control over (allowing the app to control exactly what gets checkpointed, and when to ditch the checkpointed state to reinitialise). So "cleanup" can occur by restarting processes transparently for users while hopefully providing most of the benefits of persisting state.

Perhaps coupled with OS-level checkpointing of the information required to bring said "virtually-persistent" apps back up in the same state after a reboot.


You might want to investigate Mumps:

https://en.wikipedia.org/wiki/MUMPS

Setting data in memory is the same as setting data on disk, the only difference is the name of the variable:

s X=1 ; store 1 in variable named X, in memory.

s ^X=X ; store 1 in variable named X, on disk.

s X=^X ; load disk to memory


How is this different than just memory mapped files? I guess it happens a little more automatically, but it doesn't seem to really solve a major problem that I can see.


Have you ever lost power and lost data from a document you were editing? Has a server ever crashed in a datacenter, it's data corrupted, and now your company has lost a few 100k to a ffew million? Have you ever had to wait for processes to start again for a long time after fixing the hardware failure?

These are all problems that have been solved on EROS based system. They used to do demos where they would setup a system and have someone start working on some code or a text document, they'd pull out the power plug of the system, plug it back in and the user would be right were they left off. No data loss, no corruption, just back to work.

None of that was handled in user space. That was all opaque and you didn't have to worry about it at all.


How is that not slow as balls when you're hammering memory? Guaranteeing atomic writes to disk for every memory access would seem to be problematic from a performance perspective.


How is paging not slow as balls? When you're done changing your data, or your time quantum is up, you are paged and saved. The only difference now is if the system dies for some reason, you come back right where you were paged off.


So it's not right where you left off, and the program state is unknown?

I'm guessing the system must effectively "checkpoint" your work regularly and sync to disk to avoid partially saving a state and corrupting the data.

This isn't terribly different from working on a memory mapped file except that it also saves the ephemeral state of the running program so it can be restored. But I still don't understand how it's not going to be horribly slow when you start your program and the first thing it does is allocate 4GB of memory for its workspace. Synching all of that data to disk is a massive undertaking, and this isn't an uncommon use case, people start virtual machines all of the time.

And paging is slow as balls. That's what this whole article is about, modern machines are unusable when they start paging.


I've read somewhere that for BeOS demos they used to play a bunch of videos and music and then unplug/plug and after boot everything was playing again from where it left. I guess they were using the same design for process persistence.


That was just the media player remembering where it left off and restarting from that place. There may have been some metadata support in Be's filesystem to help that, but it's not technically necessary. It's about as amazing as your web browser reloading your tabs when you restart it.


If "never loose data" isn't a great selling point then I don't really know what is.


It's a trivial problem if you're willing to run your system entirely off of the disk. I mean the performance will be unbearably slow, but you'll never lose your data.


Memory mapped files are incredibly hard to use for consistent, durable storage. I mean, so is POSIX I/O in general, but if you do MAP_SHARED you made your life even more complicated. (MAP_PRIVATE and rewrite-the-whole-thing-for-every-commit works, though, and can have some advantages).


Is iOS a modern system? Because iOS does not have swap.

> Although OS X supports a backing store, iOS does not. In iPhone applications, read-only data that is already on the disk (such as code pages) is simply removed from memory and reloaded from disk as needed. Writable data is never removed from memory by the operating system. Instead, if the amount of free memory drops below a certain threshold, the system asks the running applications to free up memory voluntarily to make room for new data. Applications that fail to free up enough memory are terminated.

https://developer.apple.com/library/content/documentation/Pe...


My desktop at work has 16G of RAM. I didn't bother setting up swap, and I find the old guidance (2x RAM) pretty absurd at this point. I've had the OOM-killer render the system unresponsive a couple of times, but only because I'd written a program that was leaking memory and I was pushing it to misbehave. If you really want virtual memory on purpose, you can still set up a memory-mapped file for your big data structure.


Putting spinning-rust-backed swap on a 16G system is absurd. By the time such a system is into swap, it probably isn't trying to swap three or four megabytes, it's probably trying to swap three or four gigabytes, and that can literally take hours. Simply writing that much data to a hard drive can take a non-trivial amount of time, and swap doesn't generally just cleanly run out to the hard drive with nothing else interfering, it's a lot messier. Given the speeds of everything else involved, a 16GB RAM system trying to swap to a hard drive, even a good one to say nothing of those slow-writing SMR hard drives [1], is basically a system that has completely failed and it might as well just start OOM-killing things.

A system backed by an SSD does degrade more nicely, though. The system visibly slows down but doesn't go to outright unresponsive like it does on a hard drive. You can make a case for letting that happen and having human intervention select the processes to kill, rather than letting the kernel do it. So, even though it still isn't really useful as an extension of RAM, it can still be useful in recovering from systems that you've run yourself out of memory on. Since putting an SSD in my systems I've actually gone back to running with some swap space. Though the fact I like hibernation sometimes is also a reason I run with swap in Linux on my laptop.

[1]: Swap will almost certainly completely blow out the buffers on those things and you'll be stuck with the raw hardware write speeds pretty quickly.


> I've had the OOM-killer render the system unresponsive a couple of times

Use earlyoom instead of relying on oom-killer.

https://github.com/rfjakob/earlyoom

To quote from the description:

> The oom-killer generally has a bad reputation among Linux users. This may be part of the reason Linux invokes it only when it has absolutely no other choice. It will swap out the desktop environment, drop the whole page cache and empty every buffer before it will ultimately kill a process. At least that's what I think what it will do. I have yet to be patient enough to wait for it.

[...]

> This made people wonder if the oom-killer could be configured to step in earlier: superuser.com , unix.stackexchange.com.

> As it turns out, no, it can't. At least using the in-kernel oom killer.

And earlyoom exists to provide a better alternative to oom-killer in userspace that's much more aggressive about maintaining responsivity.


I don't have swap either. On 8GB it is pretty annoying, because a program I often use frequently overcommits and the system hangs.

Is there any way to tell the OOM killer which program to kill first?


> Is there any way to tell the OOM killer which program to kill first?

The fun OOM analogy [1] that comes up when people propose different OOM killer designs:

> An aircraft company discovered that it was cheaper to fly its planes with less fuel on board. The planes would be lighter and use less fuel and money was saved. On rare occasions however the amount of fuel was insufficient, and the plane would crash. This problem was solved by the engineers of the company by the development of a special OOF (out-of-fuel) mechanism. In emergency cases a passenger was selected and thrown out of the plane. (When necessary, the procedure was repeated.) A large body of theory was developed and many publications were devoted to the problem of properly selecting the victim to be ejected. Should the victim be chosen at random? Or should one choose the heaviest person? Or the oldest? Should passengers pay in order not to be ejected, so that the victim would be the poorest on board? And if for example the heaviest person was chosen, should there be a special exception in case that was the pilot? Should first class passengers be exempted? Now that the OOF mechanism existed, it would be activated every now and then, and eject passengers even when there was no fuel shortage. The engineers are still studying precisely how this malfunction is caused.

[1] https://lwn.net/Articles/104185/


>Is there any way to tell the OOM killer which program to kill first?

From TFA:

>Without swap, the system will call the OOM when the memory is exhausted. You can prioritize which processes get killed first in configuring oom_adj_score.

The linked solution document is only available to registered RH users, though, and the name is actually oom_score_adj and not oom_adj_score.

`man 5 proc` has details, but tl;dr is set /proc/<pid>/oom_score_adj to -1000 to make a process OOM-killer-invincible.


From the article: Without swap, the system will call the OOM when the memory is exhausted. You can prioritize which processes get killed first in configuring oom_adj_score.


Use earlyoom: https://github.com/rfjakob/earlyoom

By default, it'll start killing processes when free memory drops below 10%, though you can configure the threshold. I had the same problem for years, and then I started using earlyoom and I don't have to deal with it anymore.


Use cgroups or `ulimit -m`.


My new workstation has 128 GB of RAM. It also has 1 GB of swap (on NVMe) that, AFAICT, has never been touched. I use it as sort of a canary that something abnormal is happening if it starts being used.


I haven't used swap in years, and more recently I've accompanied that by using earlyoom [0] to start killing processes when RAM usage rises above 90%.

Both changes have made my computers much more usable. Systems should designed to fail fast when memory is low instead of slowing down.

[0] https://github.com/rfjakob/earlyoom


One of the things we used at Blekko was that swap became a 'soft' indicator that something on the system had exceeded its foot print (our machines all had 96GB of RAM so it meant something had too much RAM) and OOM-killer messages in the log was grounds for taking the machine out and rebooting it and looking for a more serious problem (like sometimes things rebooted and had 32GB less RAM).

That said, the article's recommendation was spot on in terms of making a conscious decision on how you want your system to behave when its coming close to running out of memory. Large swap space was originally the way you got those things that were too big to fit in memory to run, and now they are a way to essentially batch process very large data sets.


If Linux has no swap, it doesn't quickly and efficiently kill processes when memory is exhausted. Instead it first removes executable code from RAM and reads it back from disk when needed. This is because without swap executable code is the only thing in RAM that is duplicated on disk and can be removed. This makes the system completely frozen and unusable.


This is my experience too. I used to run my desktop without swap, but found that the experience when running out of memory was even worse than with swap. Also there appears to be enough memory which isn't actually used frequently that it gives a bit more memory headroom (I will still manage to use up 32GB of RAM).


Last time I tried running a linux system with zero swap, I ran into huge issues.

It would never actually hit the OoM killer, instead it would just lock up while it still technically had a few hundred mb of memory free.

From what I can tell, it was stuck in a loop evicting something from cache and then immediately pulling it back in from disk. Everything was technically still running, but the ui wasn't responsive enough for me to even kill a program.

Simply adding 200mb of swap would change the behaviour enough that the OoM killer would eventually run.


I never understood the rule of thumb where swap space was proportional to the amount of physical RAM. It seem to me it should be the size of your largest expected allocation (system wide) minus the amount of physical RAM or something like that. If you had a nicely configured system and took out half the RAM it doesn't make sense that you'd want less swap space.


A system that has way more swap than RAM will run out of 'performance is acceptable' way before it runs out of memory.

That was different in the early days, but that was because people accepted worse performance (GC that stops the world for seconds can be better than no GC, even when running a GUI).

Certainly nowadays, if you take out half the RAM, you will want to take out half the processes, too.


But you choose amount of RAM dependant on maximum memory usage. Therefore swap space (being proportional to RAM) becomes dependant on largest expected allocation also. It wouldn't be wise to build system with 2Gb of RAM and 4Gb of swap space, when you need 6Gb of memory at peaks: such a system would be slooow. It may be not wise to buy 8Gb of RAM when 5Gb is the maximum that might be needed.


I think it made a lot more sense in the mid-90s, where a system would have 32 MB of RAM and people read the RAM requirements of the software they'd buy. So the size of your largest expected allocation was proportional to the size of RAM only because if you had more RAM you'd run more RAM-intensive software.

Now, desktops can have 32 GB of RAM, but everyone just uses it to run Chrome.


> Now, desktops can have 32 GB of RAM, but everyone just uses it to run Chrome.

.... which will happily chew away your 32 GB of RAM if you let it run for enough time :)


Chrome, Slack, Spotify and Atom, which each bundle their own Chrome. On my computer right now Chrome is using about 1.5 GB of RAM (on Darwin, much of it is compressed). Slack is using 0.75 GB of RAM, Spotify is using another 0.5.

We write very memory-hungry software to make up for our copious amount of RAM.


It makes supporting hibernate a lot easier - your system can just dump the contents of memory to the swap partition.


The rule made sense when your system had fewer than 128-256 MB of ram. These days it doesn't.


My feeling on swap is this:

1) If you're ok with one machine dropping out of your system, you don't need swap.

2) You should never build a system where losing a single machine is a problem.

3) Therefore, you should never need swap

4) Perhaps there is an exception for a desktop machine, since it's doesn't fit rule 2.


Tend to agree.

A bit of a side ramble: Unfortunately, sometimes regarding rule 2, you already have a system where losing a single machine is a problem, and it will take time and resources to improve or replace it to the point where losing a single machine isn't a problem, so "in the meantime" you have to accept and support this.

Also, sometimes "the meantime" is very long. :-(

Also, by the time the system is improved to be more resilient, maybe you'll be working somewhere else or on something else, and, presto, you'll uncover some other horrible legacy system in your dependency chain that isn't resilient either. It seems as if at every organization that has had computers for long enough, there is an infinite supply of legacy systems.

Point being unless you only work with brand new things that themselves only work with brand new things, you can't get out of getting decent at managing services that aren't properly "any single machine can disappear" resilient


Sure, dealing with legacy systems might mean messing with swap.

However, as pointed out elsewhere, if you're hitting swap your performance will be so bad you might as well have lost the machine.


Doesn't that risk cascading failures?

A cluster of a few machines experiences a bunch of requests that trigger pathological memory usage. One machine OOMs, drops out. Now the rest of the cluster has to take more load, needs more memory, and increases the likelihood that the other machines also run out of memory.


> A cluster of a few machines experiences a bunch of requests that trigger pathological memory usage. One machine OOMs, drops out. Now the rest of the cluster has to take more load, needs more memory, and increases the likelihood that the other machines also run out of memory.

A performance cliff (as you'd inevitably see while swapping) also puts you at risk of cascading failure. It might actually be better to completely drop out if the restart time is reasonably low. This is similar to GC thrashing with Java servers: many people prefer to configure their servers to suicide when GC time is over some threshold rather than try to go on as long as possible. I'm one of those people.

Better ways to avoid cascading failure are overprovisioning (RAM is pretty cheap for servers) and load shedding / graceful degradation at the application layer, coupled with care in client-side retry logic. (Avoiding accidental capacity caches, using exponential backoff on any retry.)


How do you hibernate with no swap? Do you need a special hibernation partition to write to?


The way I've done it is create a swap file and set it's swappiness to 0 so nothing actually gets paged into it. Hibernation forces the writes so it will get used on hibernate.


The main issue I have with not using swap in modern Linux is that it will cause the kernel to be busy for hours at a time. What happens is, as the kernel runs low on RAM, it has to spend more time searching for smaller and smaller chunks of RAM to back the request, the smaller chunks are more numerous and the "kswapd" kernel thread is responsible for this activity. As the system approaches 0 RAM free kswapd will also try to release less important pages, which takes more CPU time. Ultimately you get to the point where allocations take a really long time, and there are lots of allocations.


I recommend using swap together with zswap, and increase swappiness. Zswap is available in mainline kernel. It keeps compressed "swapped-out" pages in memory (so they are accessible quickly on page fault) and only uncompressible pages go to disk. Usually most of memory is compressible and overhead is small, so it is suitable for many workloads. See https://wiki.archlinux.org/index.php/Zswap .


Many applications request a memory, do some writes and don't use it in typical scenarios. So this memory is effectively wasted. If there's swap, smart operating system will swap that memory and use physical memory for more important tasks, e.g. for disk caching. So using swap allows for more efficient memory usage. E.g. on small server with 21 day uptime I have 102/1024MB memory used and 41MB is in swap, which means that I have 5% memory almost for free.


My Ubuntu development laptop has been running without swap since I bought it in early 2014. It's got 16 GB of RAM and sometimes it hits 11 GB of used memory. No problems whatsoever. If I'll start hitting the memory limit I'll buy another 16 GB. I've replaced the HDD with a SSD but I don't understand why I should use it as swap like in the old days of RAM scarcity.


It's not uncommon for us to buy rack machines with much more RAM than disk. The disk is almost uninteresting, except that we need a place for an OS to boot from, and some other legacy things.

I suspect I would be fine with much of our datacenter being diskless (and put disk -- ahem, I mean storage -- where it is needed). Local disk is a headache more often than not.


I have a somehow different view on swap.

The issue is not swap or swap utilisation, the problem is worst case latency. Even for a database an OOM is usually better than a latency hit that makes it unusable slow.

As a simple example an app might start allocating and use memory in an infinite loop. How long will that take? How long will your system be unresponsive?

If you have more swap than you can write in 30s you'll most likely do it wrong (your system can be unresponsive for 60+s).

Another worst case would be allocating all memory, using it and then performing random reads throughout the memory space. Your swap to ram ratio defines how much misses and thus how much IO you are doing instead of direct memory access. This should stay way below your IO capacity.

As a result I usually try to use a small swap partition and monitor for swap-ins, not swap usage or swap out.

So that's my thinking around swap mainly due to the fact that I have seen too many servers causing issues due to swap related latency.


If u want to exec from a process using a large fraction of the physical address space on the machine, you need swap to maintain a nice amount of virtual adddess space. Needing swap and using swap are different things. How swap interacts with the process and memory subsystem is poorly understood.


I tried to run systems without swap a few years ago. That wasn't a very good idea. Most applications are very generous in their memory usage (not to mention allocation, almost always insane), and normally those pages are swapped out never to be heard from again. So without swap performance suffers since less pages are available for cache. (And in virtual environments it gets even worse since the balloon driver isn't really that great.) I didn't have time to see it through and abandoned it.

In light of that this recommendation from Red Hat is very interesting. Just a fifth of memory as swap is probably enough to get real world performance back, without getting completely stuck when something goes haywire. On large memory systems it should probably be even less.


> [Recommended amount of swap] depends on the desired behaviour of the system, but configuring an amount of 20% of the RAM as swap is usually a good idea.

This sounds like good advice compared to the classic "2x RAM" guideline. Back in the HDD era when we already had around 8GB RAM I started wondering how long it would take to actually fill 16GB of swap in terms of raw write speed.

On the other hand SSDs are fast enough that swap might actually make a low-memory system feel faster.

My current Linux laptop has around the same amount of swap as RAM. Am I mistaken in thinking that suspend-to-disk saves RAM contents on the swap partition?


What about to support hibernation? Is that possible via swap file now?


It depends upon what filesystem you're writing it to, but the answer is mostly yes.


Answer: yes


I'm not sure if it is still true on Win 10, but earlier versions started to perform terribly if you had no swap on the boot partition, regardless of how much core you had.


This is contrary to my experience in Windows 7. As under Linux, I run Win7 without swap on modern hardware, and I've had no trouble there. Which versions of Windows have you had trouble with?


XP and 7. Current rig had 32GB though I left the boot swap at 1.5GB when 8 was installed and likewise after the upgrade to 10. My understanding when researching this (long ago) was that Windows like to purpopsely put stuff on the swap even if there is free ram.

Even now, with not quite 9GB "in use" 22GB "standby" and 1G "free" the paging file is at 1.5% use with peak about 3%. Granted, that is tiny on a fixed 1.5GB file but for some reason Windows feels the need to drop about 20-50Mb in the pagefile.


That article takes a system with 2GB ram as example. For a modern system that is pretty unrealistic, even Laptops have more. My system has 12.

I missed the mention of zram. Zram can create ramdisks, and compress them. It can create a compressed swapdisk in ram, basically making your ram last longer in case you really run out of memory. In my experience that is a good alternative to having a bit of swapspace as reserve, as the article recommends.


Disk space is cheap. Add a few gigs of swap, or make a swap file! If you just have a few gigs of HDD disk space left, buy another disk!! Having some swap is worth it.

One neat thing that swap can be used for is to take sleeping processes, that might use a lot of memory, and put that memory on disk to free up memory for other active programs.

How much swap do you need ? Square root of your RAM size.


I've disabled swap on MacOS. Works fine with just 16 GB. That's with Spring STS and at least one 2GB virtualbox for running ArangoDB. I do use Safari so it uses less memory than Chrome.

I shut it off because, in my opinion, OSX is pretty shitty at memory management. It swaps for no good reason. I've had 6 GB free and still using 1.6 GB of swap. That shouldn't happen.


It swaps out unused memory to free up space for disk cache in order to improve performance. This is actually a good use of memory.


Not knowing a huge amount of OS kernels theory, I can still point to Solaris 10 and it's handling of swap as being clearly superior. I had a 16gb ram server giving me problems... I logged in and it was swapping continuously with only 2mb, yes 2048kb, free. Yet my ssh session was not overly lagged.

Under Linux, if there is heavy swapping, forget it, nothing will work well.


I love how the article essentially says that each situation is different and then says 20% of RAM is a good rule of thumb.

In my experience, the most commonly used swap configuration is minimal, 500M perhaps. And vm.swappiness=1.

I'd say it's more rare to find a system that actually needs swap than one that can do without it.

I have yet to run into an application that for some reason needed swap around.


Swap seems like a nice safety valve. Preferable, I think, to suddenly shutting down an important program in use because it's OOM.


Must depend on use case, but I prefer program that is planning to use swap (usually one where I accidentally allocate a way too big buffer) to fail automatically, rather than having to try to use the now unresponsive system UI to kill it


You are right, it depends. While building firefox from sources system needs several gigs of RAM. At the same time normal functioning of my system does not need more that 4Gb. And a couple of years I used swap just for such big /usr/bin/ld processes. Now I have 8Gb of RAM and linking FF or LibreOffice is not an issue anymore.


I remember when it became impossible to build Firefox on our lab machines with 1GB of RAM and 2GB of swap. Even before that it took literally all day to build.

I got another taste of that lately when I had to build Wireshark from source on a Raspberry Pi model B thanks to broken packages in the repo. At least it was the version with 512MB of memory and overclocked. For the most part Wireshark isn't that bad to build, certainly better than Firefox, but there are a couple of dissectors that have unreasonably large source files.


it never works for me on windows

it just slows my system down to a crawl, requiring me to force a reboot

it probably depends on your hardware

and if i disable the pagefile, windows update stops working and at 75% memory usage it starts panicing and closing programs


With an SSD it only slows down my system enough that I know I need to free up some memory.

But it allows me to save, exit a program, or close a tab, without losing any work.


Swap isn't the problem. Swapping is.

Question is: is there a way to identify who the culprit is and take appropriate vengeful action in the instance of swapping? Generally it's "foo large application", though there's also a very strong tendency for "foo large application" to be a critical system element -- either OS or application level.


So, the argument the article makes is:

1. Swap is slow

2. If using swap, your system starts to thrash

3. If thrashing, you can't close programs to free memory

4. If you can't close programs, you have to wait until the task is killed by the OS

5. If you have no swap (or very little), you don't have to wait.

Except with an SSD, swap isn't slow enough to cause that issue. So really this article only seems to apply to servers, not desktops.


It seems like the OS developer's opinion these days is that RAM is cheap, don't swap. In the old days people cared a lot about swap performance because RAM was so tight that you were virtually guaranteed to swap at some point. These days you get 16GB sticks in boxes of Crackerjacks so why would you ever swap?

Of course the trend of making notebooks thinner by ditching the SODIMM slots and soldering insufficient amounts of memory to the mobo may reverse this.


Not entirely true, I've had swap thrashing with ssd's on desktops too.

Though it tends to mean you're boned, or going to be waiting a while while all i/o is dedicated to swapping for minutes at a time.


Do current operating systems ever page out memory to swap when not necessary, in order to make room for more disk cache in RAM?


If OS's stopped doing this at some point, I'd be shocked: this is why having memory and swap is valuable.


You'll be shocked, then. Linux under the default settings won't swap unless there is pressure to do so.

Windows will.

It's a tradeoff: if you swap something out while not under pressure, that could be the thing you next need, resulting in it just getting swapped back in. Or, maybe not and the extra cache is useful (but if you're not under pressure, maybe letting go of older cache, instead of swapping something out, is a better trade: letting go of old cache doesn't require swapping out, since it's by-definition backed by disk.)


Windows' approach is painful if you run it on slow HDDs like I did 15 years ago before switching to linux, full time.


macOS does this aggressively. I'm currently at 8 GB used/16 GB, and still have 5 GB swapped out.


Windows does this very aggressively. It's built in to the I/O subsystem, actually, and (IIRC—it's been a bit since I've messed with this) IOCP calls execute immediately (on the same thread, avoiding any context switches) when requested data is already cached.


In designing Kubernetes, for instance, we document and (mostly) enforce swap off for many of the reasons laid out here. Kubernetes takes over the management of host overcommitment, and being able to react correctly to OOM and near-OOM depends to some degree on having a clear understanding of the actual memory use on the system.


The real issue is not an amount of swap but thrashing.

E.g., several large processes sleeping in memory on desktop would be fine if only one or two used at the same time. OTOH, clustered nodes well tuned for a single task may not need a swap.

In any case, it is a metric for thrashing that should be used to initiate culling.


I have 16 GB of memory with swap off yet still sometimes get lag and freezes due to low memory. An aggressive OOM killer or a performance watchdog should be considered a mandatory feature. On desktop I'd much rather have my programs shut down than get any lag.


Just a note for people playing with vm.swappiness ( = 0 for ex) , what it does it's different depending on distro and version (as in with one version = 0 meant "absolutely no swap" and another version it may mean "try not to swap"


on modern system with a lot of RAM I found myself doing the opposite of swap to speed up development : mount disk on RAM using tmpfs [1] and then change the ccache [2] directory to that ram disk. With that setup, I obviously don't want swap to kick-in :) It can make the compilation of C++ programs much faster.

[1] http://ubuntublog.org/tutorials/how-to-create-ramdisk-linux....

[2] https://linux.die.net/man/1/ccache


I have had /tmp as tmpfs for a couple of years since arch switched to this layout. I ended up having the opposite problem, probably because I never had enough RAM.

I have only ever enabled swap when compiling a few projects with "-j8" would take up all the memory. Using fewer threads would usually end up being slower than more threads + swap.


Swap media is closing in on RAM speed at a fast clip. RAM latency has been stalled for 30 years and storage latencies are doing very well & improving due to flash. So from hardware POV, it should be a good time for swapping.


Personally, I have a 100GB swap partition on my system... why? because I'm a filthy tab hoarder. I just put them in the background, if i need them, sure it takes a couple seconds but I don't have to close 'em.

/shrug


Me too (typing this in my 277th tab)

Using Firefox on Bunsen Linux (Debian Jessie derived)

Its only using half of 4G of RAM, but there's 20 Mbytes in swap.


"The Great Suspender" (browser extension) is your friend.


All the laptops at my workplace have the minimum storage (Apple...), it becomes frustrating when I open photoshop and almost my entire free space suddenly vanishes while my 16gb of ram isn't even 20% utilised


I've had 32 gigs of memory in my desktop rig for more than a decade. Needless to say, no swap. Some decent points about mitigation like memory priority in the article. Good read.


I have 32G in one machine, and 16G in another. I recently moved over to the 16G machine to do my dev work in, and I run a few VM's in it.

I've found myself wanting to upgrade it to 32G ram, but honestly that's about the only use case (besides production servers) where I would ever consider swap, and at that point I consider it a problem of not enough memory rather than swap being necessary.


Can you change swap settings on macOS, I haven't dug in deeply but couldn't find an option anywhere but I feel that the default settings is faaaaar too aggressive.


I haven't used swap in years on an 8GiB machine. I might hit the OOM killer once a year. It does its job and the system keeps running.


I once swore if I had at least 200 Mb of ram I would turn swap off. Decades and giga bytes of ram later that still hasn't happened


These days the primary use of swap is to get the efficiency of overcommitting RAM without compromising reliability.


Is it possible to hibernate without swap? That's one of the use cases when it's actually useful.


Are this rules also applicable for FreeBSD or any other BSD like systems?


FreeBSD doesn't overcommit the available memory (RAM + swap) by default.

Don't use swapfiles in FreeBSD because the file system write paths of UFS and ZFS potentially allocate memory. Both geom_mirror (software RAID-1) and geom_eli (disk encryption) are fine and I would recommend using GELI to create a onetime keys for mirrored swap partitions at boot.

An other good habit to get into is to limit the resources available to your services to some generous upper bound you expect them to require. The most flexible way to enforce those are restrictions in FreeBSD are hierarchical resource limits. Use them and monitor resource consumption. That way you get an early warning before a rouge process drives the system into massive swapping.


I've found you definitely need swap if you don't have 8GB of memory. I personally have nowhere near the amount of patience required to wait for the OOM killer.


What about OSX? Is goood to disable swap?


Over the past 5 years swapping has always been due to a memory leak.


(for me I mean), I should really learn how to disable it.


Just put your swapfile on a ramdisk.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: