In defence of swap: common misconceptions

gok · on June 9, 2020

> On SSDs, swapping out anonymous pages and reclaiming file pages are essentially equivalent in terms of performance/latency

…but very much not equivalent in terms of how much you are wearing out your SSD.

> Swap can make a system slower to OOM kill

This is almost always actually a bad thing. Turning on swap means your system goes from randomly killing processes to suddenly running 100x slower than it used to and then randomly killing processes.

pmoriarty · on June 9, 2020

Yeah, but in the meantime you might actually notice that your system is slowing down, investigate and possibly fix the issue or intelligently kill the right process.

mrob · on June 9, 2020

In practice, everything slows to an unusable crawl, and you hit the reset button because it's the fastest way of regaining control. IMO, earlyoom or similar is essential for general desktop use where memory load is unpredictable. Better to lose one process than lose all of them.

"The oom-killer generally has a bad reputation among Linux users. This may be part of the reason Linux invokes it only when it has absolutely no other choice. It will swap out the desktop environment, drop the whole page cache and empty every buffer before it will ultimately kill a process. At least that's what I think that it will do. I have yet to be patient enough to wait for it, sitting in front of an unresponsive system."

https://github.com/rfjakob/earlyoom

kalleboo · on June 9, 2020

I guess Linux is different from macOS since I just had this situation last week. I wrote a quick app to generate a graphic and let it run in the background, noticed my machine was feeling a bit sluggish (but still totally usable for web browsing etc) and noticed my app had a memory leak and was using 100 GB of RAM and climbing on a 32 GB system. I could easily quit the app using the regular GUI.

NVMe SSDs are fantastic.

gok · on June 9, 2020

macOS has also has Jetsam (like the Linux OOM killer but much more aggressive) and swap-to-compression, so you're not as often swapping to disk as you'd think.

kalleboo · on June 9, 2020

My app was leaking uncompressed 12 MP photo buffers and the swap file was over 70 GB, so in this case it was definitely swapping to hell

steerablesafe · on June 9, 2020

> In practice, everything slows to an unusable crawl

This is my experience as well. Maybe this could be mitigated with cgroup v2 by setting some processes to never swap so the user doesn't lose interaction with the system.

tsimionescu · on June 9, 2020

Unfortunately, that can happen to a much worse extent if swap is turned off. Linux will swap out all file-backed pages before invoking the OOM Killer, and since all user-space code is sitting in memory mapped pages, it will start swapping out code. And if you think waiting for disk to do a memory read is slow, imagine what waiting for disk to see what is the next available instruction can do to the system.

luckylion · on June 9, 2020

Is there a trick to reserve a percentage of actual RAM to be used for investigations and fixes? Because unless you already have a working shell open on the machine etc, are you going to wait 10 minutes for ssh to spawn a child, then 10 minutes for the shell to load and give it another 5 minutes for your ps to show?

Arnt · on June 9, 2020

BTDT, a lot of times. In the meantime I notice the system is slowing down, but I can't kill the guilty process because the system is too slow to ssh in. The ssh client times out while the ssh server is trying to obtain the pages it needs.

Swap is what turns the deficient OOM killer from a minor to a major problem.

sophiebits · on June 9, 2020

I think the author clearly acknowledges that is a bad thing but argues that other advantages outweigh it.

clarry · on June 9, 2020

I say pfft to this. The article makes a poor case. None of this matters if you have enough memory. Hence, calling swap "emergency memory" (or poor man's ram) is perfectly justified.

If you can honestly answer "How much swap do I need, then?", and the answer is not infinite, then you can just add that amount of RAM instead of swap and be done with it, no need for swap, unless you want to prepare for emergency.

"Ideally you should have enough swap to make your system operate optimally at normal and peak (memory) load." Ideally you should have enough memory to make your system operate optimally at normal and peak memory load. In a perfect world, everything would fit in CPU's SRAM and you wouldn't even need DRAM..

cesaref · on June 9, 2020

Agreed, if you can size it, you can specify the right memory for a host for an app. There are plenty of situations where you've got dedicated hardware and a suitable budget and you can do size the hardware correctly.

If however you have something like kubernetes, your apps move across hosts, you have memory scaling with use (e.g connections) and you have no idea what the load on the app will be, it becomes basically impossible to right-size the memory so other strategies come into play. Monitoring becomes king at that point.

throwaway2048 · on June 9, 2020

If your workload cant fit in memory, without special measures such as optane swap storage you will have such crushing unresponsiveness even with something like SSDs that it is entirely not worth the hassle in 99% of cases, no matter what your budget is.

clarry · on June 9, 2020

And if you can't size it, you can't size swap either so yeah.. monitor and then do what you need to do.

de_watcher · on June 9, 2020

Swap is purely psychological. It's like a slower RAM that stops you so you can think about what you've done without actually losing your work.

otabdeveloper4 · on June 9, 2020

> ...unless you want to prepare for emergency.

On a modern desktop system 'emergency' is every day, thanks to modern Javascript and the modern web.

clarry · on June 9, 2020

Swap doesn't save you from that :(

Those things need infinite ram.

pmoriarty · on June 9, 2020

I have a pretty old, slow laptop on which I've been running Gentoo for a long time, so I'm frequently recompiling a lot of software. Over the years, as software has gone from being bloated to being insanely bloated (qt-webkit, firefox, and rust: I'm looking at you!), the "only" 8 gigs of memory that my system has just sometimes isn't enough, so adding swap has let me stretch the life of my system.

Sure, huge compiles are much slower when they hit swap, but at least it's possible to run them with swap, while it would be impossible for me to do so without.. at least until I can afford to get a new computer which can fit more memory.

gen220 · on June 9, 2020

Surely, the solution is to acquire a whole bunch of old, slow laptops and hook them together as a build cluster! :)

Old thinkpads (2012ish) run for a hundred bucks and some change, and some of them support up to 32 GB of ram, IIRC. The battery is a generic and replaceable component, too. They can last a lifetime! Or at least until FF et al renders 32 GB insufficient.

capableweb · on June 9, 2020

Now I don't know pmoriarty's situation at all and don't want to assume anything either, but, if someone has a "pretty old, slow laptop on which I've been running Gentoo for a long time" without upgrading, there is surely a reason for that. Suggesting that "hundred bucks" is something you can spend just to get faster build times might work for a well paid developer in the 1st world but many developers are not well paid so cannot afford the luxury of upgrading their machine willy nilly like that.

While upgrading for money works for some, less bloated software would work for everyone + no need to spend money to get acceptable build times + even if you have a super computer, you compilation times gets a lot faster. So aiming for less bloated software feels like a more noble goal than trying to convince people to upgrade their hardware.

gen220 · on June 9, 2020

Hey, I'm sorry, I didn't mean to come off making assumptions or prescribing advice. My intent was to illustrate that "new computer" doesn't have to mean "shiny, expensive, newly-manufactured computer". On reflection, my words do reflect my biases that investing one-to-two hundred dollars in such a setup wouldn't be a cause of much concern for me, and I regret that I assumed likewise for the readers.

I totally agree with you that less bloated software should be our goal. Although, realistically, if you're compiling a modern web browser you don't have much of a choice in the matter (unfortunately). I don't think that reducing bloat by an order of magnitude (i.e. a difference that would impact compile times) is on the radar of Chromium or FF.

psexec · on June 9, 2020

Since you're a gentoo user, you probably already know that, but have you tried using compressed memory like zswap? I find that I very rarely swap out to disk since I enabled it (and tuned to my needs):

https://wiki.archlinux.org/index.php/Zswap

kevin_nisbet · on June 9, 2020

Another aspect that plays into this, can be interactions with GC'ed languages, and particularly longer than expected pauses that appear to be caused by page faults and thrashing when in a critical GC mark phase. IIRC this can even become apparent in non-blocking sections of the GC, because the GC runs out of available segments while still running the mark and thrashing paged out memory. I do believe all of the systems I've personally observed this behaviour on predate kernel 4.0 though, and didn't have SSDs, so I'm not aware of whether it's less pathological with GCed runtimes post kernel 4.0 and whether SSDs would be fast enough to avoid the blocking. It probably also depends on the allocation rate of the application.

So I have recommended disabling swap on systems in the past, but inline with the article, these are on systems that are largely dependent on application memory, and don't benefit from the IO cache prioritizing any application pages. As the article is pointing out, this isn't a hard and fast rule, but a tuning that needs to be done depending on the application, and using cgroups to fine tune this may be a better approach depending on the use case.

altmind · on June 9, 2020

While I find the question puzzling, i find the description lacking. The whole explanation of anon maps is redundant, since the behavior of anon maps seems to be the same as the behavior of plain memory - its evicted on low memory conditions with LRU behavior and there is no anon maps specifics mentioned.

To recap: swap makes anon memory pages reclaimable, by offloading it to disk. This does not explain anything. Its basically "swap allows anon maps to be swapped"; so what?

>> You need to opportunistically handle the situation yourself before ever thinking about the OOM killer.

Or, just rely on OOM killer, minimal swap(so the application dont agony in low memory conditions) and restart policies. There are no reason to do the kernel work handling memory over-commit.

lostdog · on June 9, 2020

As soon as I figure out how to get the OOM killer to just kill Chrome every time, I might never have to hard reset my system again!

cbsks · on June 9, 2020

You are looking for /proc/<pid>/oom_score_adj:

https://github.com/torvalds/linux/blob/15a2bc4dbb9cfed1c661a...

jwilk · on June 9, 2020

Curiously, in https://bugs.chromium.org/p/chromium/issues/detail?id=333617 people are complaining about the opposite problem:

Linux kills a Chromium process, leaving bigger memory hogs roaming around.

z991 · on June 9, 2020

On Ubuntu, earlyoom basically does that. It's great.

  sudo apt install earlyoom

lostdog · on June 10, 2020

Looks brilliant, thank you!

hinoki · on June 9, 2020

I’ve case you are serious, replace chrome with:

  #!/bin/bash
  echo 1000 > /proc/$$/oom_score_adj
  path_to_real_chrome

mappu · on June 9, 2020

What's the maintainable way of doing this, that can survive chrome package updates and also handle http URIs opened from other applications?

steerablesafe · on June 9, 2020

How do you typically start chrome? I guess you could just change the .desktop file somewhere in $HOME, system updates don't touch that.

zbentley · on June 10, 2020

Won't that just OOM-shield the shell launching chrome and not the chrome process itself?

I think you want "exec path_to_real_chrome".

DecoPerson · on June 9, 2020

That was a lot of words to say “Swap allows us to push infrequently used pages to disk during memory usage spikes to avoid triggering the dropping of useful file pages, or worse, the OOM killer.”

jpitz · on June 9, 2020

Tfa even bolds the initial takeaway: "Swap is primarily a mechanism for equality of reclamation, not for emergency "extra memory". Swap is not what makes your application slow – entering overall memory contention is what makes your application slow."

It is a mechanism for making more page types reclaimable.

danw1979 · on June 9, 2020

It's an absolute brain dump of an article. I went in really interested but couldn't make it past the first few paragraphs.

Thanks for summarising...

colejohnson66 · on June 9, 2020

I’m still confused on swap. What’s the practical difference (besides speed) between 16 GB or RAM and 16 GB of swap versus 32 GB of RAM and 0 swap?

eurekin · on June 9, 2020

I'm in the same boat with that. What makes 16GB RAM + 16GB swap acceptable, when 32GB of just RAM is the same size and faster? Why, 32GB of RAM only needs extra swap?

NikolaeVarius · on June 9, 2020

16 GB of SSD space is cheaper

colejohnson66 · on June 9, 2020

But if I upgrade to 32 GB or RAM, would I even need the swap? I guess my question is: why is swap not dynamic like pagefile.sys on Windows? The pagefile grows and shrinks as it’s needed. But a swap partition doesn’t; it’s size is set at creation.

bb88 · on June 9, 2020

What's your workload like, do you run a lot of heavy servers for development? Or do you mostly do browsing?

A growing/shrinking swap file doesn't really add anything, especially if you're thinking a 1:1 relationship with memory and swap. To me it seems like 16GB of HD space isn't that big a deal.

You can also play around with swap performance if you want temporarily.

https://linuxize.com/post/create-a-linux-swap-file/

Yes, it's going to be slower since you're using a file in a filesystem rather than using a dedicated partition, but it would give you a way to test it out. Some workloads will do better since the unused pages will get sent to disk.

SomeoneFromCA · on June 9, 2020

Yo do not need a partition, you can use a file.

dTal · on June 9, 2020

True. It's still a fixed size.

(It's also slower. The reason we use partitions is to avoid invoking a whole heap of filesystem code every time we want to access some memory.)

westmeal · on June 9, 2020

One of the scenarios not mentioned is if you have a hilarious amount of RAM and little diskspace. I run with 1 GB of swap, VM swappiness set to 0 and 16gb of ram. I don't think I've ever gotten close to exhausting mem or even invoking oom. What real difference is made here? Is it just something I don't understand?

kevin_nisbet · on June 9, 2020

It really depends on the application and use case, but with lots of headroom there likely isn't a material difference (IE the application only uses 128MB of RAM, you're just not going to see it).

Where it would come into play in your use case, is say you have an application, that serves as an application, and can also serve files from disk, like a web app. Say the running application uses 8GB of RAM and the files on disk is 10GB. On your system with 16GB of RAM, the IO subsystem in the kernel will use leftover RAM as a cache on disk access. So even though your web server sends files from disk, in most cases it'll leave a copy in RAM, and then can just serve the file from RAM without waiting on the disks which are relatively slow. Our total dataset is 18GB though, so everything we're doing doesn't really fit in RAM with application + files usage.

This is where swap comes into play. Even though the application uses 8GB of RAM, maybe some of it doesn't get used very often. Where as the files on disk, the total 10GB get used almost constantly. What the kernel can do, is say, 2GB of the application, even though it's in use, we don't see the application ever using it. Maybe it's just some buffer or house keeping data, so it's not used frequently. We'll swap that out to disk, and our IO cache is now a little bit bigger, allowing it to speed up access to that 10GB of files always being used on the disk, speeding things up.

When the application does need that memory, the kernel will bring it back into main RAM, using some free memory that's maintained, or reclaiming some memory, which is slow. Instead of the RAM being instantly available, it has to wait for the kernel to do its work, which leads to engineers discovering in certain limited cases, if they disable swap, their applications get faster, since it doesn't block waiting for the kernel to do it's work in those cases.

If you don't have resource contention on the size of the application in RAM, and the amount of frequently accessed files / IO rate, you're not going to see any performance difference.

* The defaults swappiness setting on many distros can also be a little aggressive, which I suspect as lead to alot of these disable swaps. Although, as the article points out, this changed linux 4.0+, but I'm not familiar with the changes being referred to here.

westmeal · on June 9, 2020

Thanks so much that helped me

toast0 · on June 9, 2020

I've run a lot of systems with gobs of ram and like 512M swap on FreeBSD. Linux is different than FreeBSD of course, but determining how much memory is being used / if you're running out is difficult. Watching swap stats is a better gauge.

If your swap i/o rate is high, you're probably thrashing, even if the usage is low; you're right on the edge, and need more ram (or fix bugs).

If your swap usage is growing rapidly, you've got an urgent problem; sometimes a little bit of swap lets you get in and fix something or at least shut it down nicer than OOM killer, sometimes it just gives you a more obvious record that you ran out of memory.

If your swap usage gradually climbs to say 50% over time, you might have a slow leak, or something to look at anyway.

jcelerier · on June 9, 2020

Heh, I had to go from 16G to 64G because of how often I was hitting OOM when compiling big stuff (a few times per week at least). And that was in 2016...

raincom · on June 9, 2020

I had a production server with 512G memory with 0 swap. It was running some ldap server. Every week, this ldap server needs a restart due to some slownesss. Adding swap has cured it.

nineteen999 · on June 9, 2020

512G for an LDAP server?? Surely you mean 512M.

raincom · on June 9, 2020

Axsuul · on June 9, 2020

This is fine as long as the tradeoff in performance is acceptable.

nwmcsween · on June 9, 2020

IMO swap should be a found using fio with a rw test using 4-16k (Linux does 16k reads on swap req due to vm.page-cluster) threaded to the number of processors on the system * the amount of time you're willing to let the OOM thrash around. The swap should also use zswap to compress in ram and shunt uncompressable to swap.

SomeoneFromCA · on June 9, 2020

No one mentioned zswap. It is awesome, if you have at least semimodern CPU. It feels like you have 15% more ram than you actually do.

dTal · on June 9, 2020

Not just on a semi-modern CPU. I used it on a Zipit Z2 (truly ancient PXA270 Arm SoC with 32 mb of RAM) that I was using as a Debian PDA, and it was a godsend. Probably a little slower, but when you only have 32 megabytes of RAM, that's really the bottleneck you care about.

(I don't think Debian will run with so little memory anymore.)

rcthompson · on June 9, 2020

One thing I would love to see is a discussion of how to tune the swap system (e.g. what value to use for vm.swappiness) in cases where zram/zswap are enabled.

bsder · on June 9, 2020

My biggest gripe is that nobody designs for no swap.

With flash file systems, swap is a Bad Idea(tm).

However, whenever I disable swap, all manner of Linux systems freak out because they are used to never having to actually deal with out of memory (having swap on Linux means your entire system becomes totally unresponsive long before anything actually flags out of memory). Linux doesn't help this by overcommitting memory that it doesn't actually have.

mappu · on June 9, 2020

> Linux doesn't help this by overcommitting memory that it doesn't actually have.

As long as the POSIX fork/exec pattern is implemented with a CoW address space, you can just write all over memory to cause new physical allocations. There's no malloc ENOMEM return to check in this case.

(You can do this on Windows too via RtlCloneUserProcess and friends.)

You could "solve" this problem by implementing fork/exec to use copying instead of CoW (like Cygwin and PDP-11 Unix), but I don't think anyone wants that, especially because you'd usually throw away the work with exec() anyway.

wtallis · on June 9, 2020

I think he's suggesting that the fork itself should fail if the subsequent overwrite pass to defeat CoW would cause problems. So you would have trouble actually directly using all of your RAM if you had some processes that don't follow a fork with an exec—but that leftover physical RAM should still be able to be used for the disk cache.

duskwuff · on June 9, 2020

> I think he's suggesting that the fork itself should fail if the subsequent overwrite pass to defeat CoW would cause problems.

That'd put you in the unusual position of forbidding any sufficiently large process from ever creating children. The kernel has no way of knowing whether a fork() will be followed by exec().

hyperman1 · on June 9, 2020

If posix_spawn is a syscall, it would be a decent replacement.