The nice thing about virtual memory is that it's, well, virtual. It costs you almost nothing until you've touched it. (Fun exercise for the reader: measure the kernel overhead for an unused 1 TiB VMA.) But creating huge spaces--that terabyte mmap wasn't theoretical--that stay untouched is hugely algorithmically useful, especially for things like malloc implementations.
Why does it bother people? Two reasons. First is mlock to avoid swap. This is solvable in much better ways--I'm a fan of disabling swap in many cases anyway. Second is that, absent cgroups, it's difficult to put hard limits on memory usage in Linux. So people, looking under the streetlight, put limits on virtual usage, even though that's not what they care about limiting! Then they get angry when you break it. My refrain here, as in many cases (see for example measuring process CPU time spent in kernel mode): "X is impossible" doesn't justify Y unless Y correctly solves the problem X does.
(I spent years in charge of a major memory allocator so this is a battle I've fought too many times.)
Thirdly, the kernel will kill you if your overcommit ratio is too high. I had this argument with the Go folks several years ago (when Docker would crash after starting 1000 containers because the Go runtime had allocated 8GB of virtual memory while only a tens of MB were in use and the kernel freaked out).
You're right that it doesn't cost anything, other than the risk that a process can cripple your machine using its overcommitted memory mapping. And so the kernel has protections against this, which should deter language runtime developers from doing this.
And let's not forget that MADV_DONTNEED is both incorrectly expensive on Linux and ridiculously expensive compared to freeing memory and reallocating it when you need it. Bryan Cantrill ranted about this for a solid half an hour in a podcast a year or two ago.
So… does that mean the linux kernel will blow a gasket if I mmap actually large files to play with them but have almost no resident memory? That doesn't seem reasonable.
> What do you mean by “free” memory? Actually unmap it?
Sorry, I didn't phrase it well. MADV_DONTNEED is significantly more expensive than most ways that memory allocators would "free" memory. This includes just zeroing it out in userspace when necessary (so no need for a TLB modification), or simply unmapping it and remapping it when needed.
> Also, I assume the crippling you’re talking about here is just the ability to rapidly apply memory pressure?
Right, and if the memory is overcommitted then you can cause OOM very trivially because you already have more mapped pages than there is physical memory -- writing a byte in each page will cause intense memory pressure. Now, this doesn't mean that it would kernel panic the machine, it just means it would cause issues (OOM would figure out what process is the culprit fairly easily).
This is why vm.overcommit_ratio exists (which is what I was talking about when it comes to killing a machine) -- though I just figured out that not all Linux machines ship with vm.overcommit_memory=2 (which I'm pretty sure is what SUSE and maybe some other distros ship because this is definitely an issue we've had for several years...).
There's also RLIMIT_AS, which applied regardless of overcommit_memory.
Right. I’m very familiar with all these mechanisms, I guess I just don’t agree that the ability to cause an OOM, particularly if applications are isolated in cgroups appropriately, is a big deal. On balance, not allowing applications to use virtual memory for useful things (such as the Go case of future heap reservation) or underutilizing physical memory seems worse.
As an aside, it seems like an apples and oranges comparison to compare “freeing” by zeroing (which doesn’t free at all) to MADV_DONTNEED. I’m also pretty sure that munmap will be much slower than MADV_DONTNEED, or at least way less scalable, given that it needs to acquire a write lock on mmap_sem, which tends to be a bottleneck. It does seem like there’s a lot of opportunity for a better interface than MADV_DONTNEED though (e.g. something asynchronous, so you can batch the TLB flush and avoid the synchronous kernel transition).
> particularly if applications are isolated in cgroups appropriately
Once the cgroup OOM bugs get fixed, amirite? :P
> It does seem like there’s a lot of opportunity for a better interface than MADV_DONTNEED though (e.g. something asynchronous, so you can batch the TLB flush and avoid the synchronous kernel transition).
The original MADV_DONTNEED interface, as implemented on Solaris and FreeBSD and basically every other Unix-like does exactly this -- it tells the operating system that it is free to free it whenever it likes. Linux is the only modern operating system that does the "FREE THIS RIGHT NOW" interface (and it's arguably a bug or a misunderstanding of the semantics -- or it was copied from some really fruity Unix flavour).
In fact, when jemalloc was ported to Solaris it would crash because MADV_DONTNEED was incorrectly implemented on Linux (and jemalloc assumed that MADV_DONTNEED would always zero out the pages -- which is not the case outside Linux).
> As an aside, it seems like an apples and oranges comparison to compare “freeing” by zeroing (which doesn’t free at all) to MADV_DONTNEED. [...] I’m also pretty sure that munmap will be much slower than MADV_DONTNEED.
This is fair, I was sort of alluding to writing a memory allocator where you would prefer to have a memory pool rather than constantly doing MADV_DONTNEED (which is sort of what Go does -- or at least used to do). If you're using a memory pool, then zeroing out the memory on-"allocation" in userspace is probably quite a bit cheaper than MADV_DONTNEED.
But you're right that it's not really an apt comparison -- I was pointing out that there are better memory management setup than just spamming MADV_DONTNEED.
The thing is, people want a way to measure and control the amount of memory that a process uses or is likely to use. Resident memory is one way to measure actually used memory, but from the man 3 vlimit, RLIMIT_RSS is only available on linux 2.4.x, x < 30; which nobody in their right mind is still running. So we have RLIMIT_AS which limits virtual memory, or we have the default policy of hope the OOM killer kills the right thing when you run out of ram.
That you have to keep fighting this battle is an indication that people's needs (or desires) aren't being well met.
There's a third reason: trying to allocate too much virtual memory on machines with limited physical memory will fail on Linux with the default setting vm.overcommit_memory=0. See for instance https://bugs.chromium.org/p/webm/issues/detail?id=78
Great points. A third reason is core files: That 1 TB of unused virtual memory will be written out to the core file, which will take forever and/or run out of disk. This is part of the problem of running with the address sanitizer: you don't get core files on crashing, because they'd be too big.
Not sure whether anyone is writing iOS apps in go, but iOS refuses to allocate more than a relatively small amount of address space to each process (a few gigs, even on 64-bit devices).
I used to think this. Then I deployed on windows. Virtual memory can’t exceed total physical memory (+pagefile) or else malloc() will fail. I am currently having an issue where memory is “allocated” but not used causing software crashes. Actual used memory is 60% of that.
The page file in Windows can grow and the max size, I believe, is 3 times the amount of physical memory in the machine. So, if you're trying to commit more than [Physical Memory x 4] bytes, then yes, it will fail. But, more than likely, you'll get malloc failures long before that due to address space fragmentation (unless you're doing one huge chunk).
Check out GHC 8.0+ where the same program will allocate a terabyte of virtual memory ;)
That said, I don't find it unreasonable at all. Just reserving some bits in the address space isn't unreasonable. It makes the real allocation code simpler.
> The significant increase in virtual memory usage is usually not an issue, however security sensitive programs often lock their memory, causing a far greater performance degradation on low-spec computer hosts.
Operating systems usually cap the amount of locked pages to prevent the system from being DoS'd; on Linux it can be quite low (16kb).
In all the code I've written using locked memory, these limitations have forced me to use separate arenas for the locked/sensitive memory because of its scarcity. In general, it would be incompatible with Go's garbage collected heap unless the GC heap has the concept of "sensitive" objects and pools them in locked memory (which, AFAICT, it doesn't); or the heap limited itself to the locked memory limit (impractical)
If you allocate and maintain unmanaged locked memory yourself in Go it shouldn't matter if the 1.11 runtime uses more virtual memory since you've separated yourself from the problem by going your own route.
what does locking memory mean in this context?
For what purpose do security sensitive programs lock their memory?
what is the performance degradation that happens with low-spec computers?
I think the person is referring to the mlock() and mlockall() functions (or equivalents on other OS), which keep pages resident / prevents pages from being paged out. Forces them to remain in RAM.
It can be used to, e.g., prevent an encryption key or password from being swapped out to disk, where it might then be recoverable. (Personally, this is why I encrypt swap.)
> what is the performance degradation that happens with low-spec computers?
Locking a larger portion of RAM means less room for the OS to page out unused pages and free up the space for other programs.
While one can try to selectively lock buffers with sensitive data with mlock(), you have to be sure they aren't copied into other buffers that aren't locked (and could thus be subsequently paged out). If you're writing a UI program that displays or receives those in a widget, this might be harder (you might not have access to the internal buffer of the widget, as it is an "implementation detail" of your library), and locking the entire process might be a simpler solution (albeit being a bigger hammer).
My understanding was that mlock doesn't really keep the page from being paged out, since it can't even begin to do that in the hibernation case, or if you're running as a VM who's guest RAM wasn't mlocked.
What it does is keep a canonical version of the page in memory. That's useful being able to deterministically touch a piece of memory, but it doesn't really help you as far as making sure the page never touches disk.
I hadn't considered hibernation, and indeed, a deeper reading of the manual confirms that hibernation doesn't count, which is rather interesting (given the implications of hitting disk), but I don't really see a good way around it, short of aborting the hibernation, or providing a mechanism to inform the program that those pages were lost. The man page (later, annoyingly, after its initial description) notes this:
> Memory locking has two main applications: real-time algorithms and high-security data processing. Real-time applications require deterministic timing, and, like scheduling, paging is one major cause of unexpected program execution delays. Real-time applications will usually also switch to a real-time scheduler with sched_setscheduler(2). Cryptographic security software often handles critical bytes like passwords or secret keys as data structures. As a result of paging, these secrets could be transferred onto a persistent swap store medium, where they might be accessible to the enemy long after the security software has erased the secrets in RAM and terminated. (But be aware that the suspend mode on laptops and some desktop computers will save a copy of the system's RAM to disk, regardless of memory locks.)
I would be rather disappointed if a hypervisor swapped out my guest (at least, in a context like AWS; I suppose if you're just running qemu on your laptop, that's a different matter), but I hadn't considered that either, and it is certainly possible.
IMO, the right answer is to better define your threat model. Are you concerned about someone pulling the HDD and reading the swap? Use an FDE scheme that covers your swap too. Are you concerned about someone getting access to the swap file programmatically? At that point they have so many other ways of slurping the memory out of your process that it's a lost cause.
Good explanations of what and why one might lock memory.
But, in the context of this overall discussion, I think it's important to keep in mind that when a process allocates a large amount of virtual memory, it does not automatically allocate any physical memory. So a process allocating a lot of virtual memory up front should not impact other processes which have locked some of their memory into physical memory.
> I think it's important to keep in mind that when a process allocates a large amount of virtual memory, it does not automatically allocate any physical memory.
It will, if you mlock() it, I believe. The manual page notes "real-time processes" as a main user of mlock() (the other being the cryptographic uses I hinted at); it cites their use case as locking the page to avoid delays due to paging during critical sections. In order for that to work, the OS would need to bring the pages in, at the time of locking; so at that point, a large virtual allocation becomes equivalent to a physical one.
It tells the OS to never swap memory allocated to the process to disk.
You should always lock memory if you're going to be storing crypto keys, etc. since once the pages are swapped to disk you're vulnerable to someone pulling the swap partition out and reading it.
1. You don't have to lock all your memory (although it may be hard to capture all the intermediate buffers if you don't).
2. You still need to clear the memory buffers after they're no longer going to be used, otherwise other processes can read /proc/kcore, etc (or cool your RAM and extract it and put it another system)
3. It is possible to encrypt the swap partition with a randomly generated key at boot
One locks memory to keep it from being swapped to disk. You can imagine if you have sensitive data, you want to keep it as ephemeral as possible. Do even security-focused go programs typically lock _all_ of their virtual memory, or just the sensitive pages? I don’t know. But if a bunch of programs hogged hundreds of megabytes of unused RAM each, that’s the problem alluded to on low resource systems.
I assume this is referring to some mechanism like `mlockall(2)` or using cgroups to disable using paging. Is the "excess" memory dirtied ? It's not clear why having the pages mapped but unaccessed entirely would cause a performance issue.
> All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.
So it doesn't matter if the memory is dirty, as long as it's marked as locked.
But since the memory is never actually backed by anything, it doesn't have a real effect except in the accounting of the process. No additional memory is used by locking it.
No, because the pages aren't mapped to anything -- not disk, RAM, or an external paging device. Allocating a chunk of memory typically doesn't map those pages to anything immediately (since there's no benefit to doing that extra work).
The current Linux man page gives a bit more insight:
mlockall() and munlockall()
mlockall() locks all pages
mapped into the address space
of the calling process. This
includes the pages of the code,
data and stack segment, as well
as shared libraries, user space
kernel data, shared memory, and
memory-mapped files. All
mapped pages are guaranteed to
be resident in RAM when the
call returns successfully; the
pages are guaranteed to stay in
RAM until later unlocked.
The flags argument is con‐
structed as the bitwise OR of
one or more of the following
constants:
MCL_CURRENT Lock all pages
which are currently
mapped into the
address space of
the process.
MCL_FUTURE Lock all pages
which will become
mapped into the
address space of
the process in the
future. These
could be, for
instance, new pages
required by a grow‐
ing heap and stack
as well as new mem‐
ory-mapped files or
shared memory
regions.
After some experimenting it will fault all the mapped pages, causing the VM subsystem to try to back them. In Linux 4.4+ there is an a flag to `mlockall(2)` called "MCL_ONFAULT" which does not do that and instead locks the pages as they become backed.
I do this in some programs, it makes many pieces of coding much easier if I know I will never have to move some objects as they grow.
The ideal for me would be a function that marked some memory as "these addresses are taken, do not give them out to malloc, or anything else", but which still required me to actually "ask" for the memory before using it, so it didn't look like I was using 64gb of memory at startup. Is that possible?
> Instead of a mapping, create a guard of the specified size. Guards allow a process to create reservations in its address space, which can later be replaced by actual mappings. mmap will not create mappings in the address range of a guard unless the request specifies MAP_FIXED. Guards can be destroyed with munmap(2). Any memory access by a thread to the guarded range results in the delivery of a SIGSEGV signal to that thread.
So it seems like a good fit for your use case, but I've never used this.
> The ideal for me would be a function that marked some memory as "these addresses are taken, do not give them out to malloc, or anything else", but which still required me to actually "ask" for the memory before using it, so it didn't look like I was using 64gb of memory at startup. Is that possible?
So you want to reserve memory, basically the way malloc does, depriving any other process of that memory, but you then want to reserve it again, and also somehow hide that the memory has been reserved?
What exactly is the benefit except making it not look like you're using the memory? Actually, what's the benefit of that in itself?
No, I want, in my address space, to say to the kernel "when you decide what bit of my memory map to return from mmap and friends, don't use these addresses.". That shouldn't cost anything, except storing the blocks I want reserving.
This is surprisingly hard to do in user space. You can tell mmap where to put allocations, but not where not to put them. Also it's hard to control what your Libc's malloc will do.
So like allocating a large virtual address space and managing that on your own? That already exists, then, and the only remaining issue is making it look like you haven't, for some unclear reason.
No, he wants to separate virtual address space allocation from physical memory backing allocation, with both still happening manually. It's a very reasonable request, and the benefit is that you can be guaranteed to have the address space while still only using a smaller max amount of physical memory.
I don't interpret that as the request. I interpret it as wanting virtual memory "allocated" but not counted against the processes' virtual memory allocation, and then getting a secondary mechanism to allocate the sorta-but-not-quite-allocated memory. I also don't see the value in that.
(I don't see any request for different physical memory allocation, so I assume that would still be handled by page faulting in the kernel.)
The value would be to be able to grow arrays and similar without relocation.
People do it all the time in real life, with area codes, zip codes, case numbers, etc.
For example, a long street full of strip malls in California will often have street numbers 50 apart to allow them to remain sequential even after new developments.
No one would complain that this is a wasteful use of precious street numbers, or that it deprives other streets of those numbers.
Imagine the nightmare if you periodically had to renumber all the buildings instead, like computers routinely have to.
How would this proposed mechanism allow that, and what about the current workings of virtual memory disallow it?
One can already do this with realloc() (https://linux.die.net/man/3/realloc). And if you don't want to use the malloc family, but instead want to manage your virtual memory yourself, you can easily allocate large sections of virtual memory and then manage it yourself - which would include enabling behavior such as growing arrays. (Even though you're really just re-implementing realloc-like behavior yourself.)
And to make sure we're on the same page (ha), I want to reiterate that there is a big difference between virtual and physical memory. You can allocate virtual memory without allocating physical memory.
I misunderstood what it was that you didn't see the value of.
The original point was entirely to avoid scaring newbies who can't tell virtual from physical, while still being liberal with your virtual usage.
Haskell with GHC has a great solution: just allocate 1TB up front. A newbie does not need improved tooling, a reworked memory system or any education to realize that this can't possibly be RAM.
This is kind of the point of this whole post :) reserving lots of virtual memory causes posts like this, where people complain about how much memory you are "using".
I think that just kicks the ball further down the court, as it's yet another thing to track. It think it's better to just explain that (for the most part), high virtual memory usage isn't a problem. But as someone pointed out above, FreeBSD does have a facility for it, which is interesting.
My first thought on unix was mmap() as well, but one of the requirements was that it doesn't "look like" the process is using that memory, which I interpret as "I don't want to allocate this as virtual memory." I don't think mmap() allows that, as its entire purpose is to allocate virtual memory and return a pointer to that memory space.
Personally, my feeling is: why does it matter if it "looks like" a process is using a lot of memory? That is, why does it matter if a process allocates a lot of virtual memory up front? It's not consuming physical memory, it's just updating some bookkeeping in the kernel. I know people feel uneasy about seeing large values for virtual memory, but... they shouldn't.
This is what Linux does by default for anonymous memory (it's called overcommitment). Go makes very liberal use of this, but it has other issues (namely Linux will kill a process if it goes over a certain overcommitment threshold).
Reading that, I think overcommitment is to determine the kernel's behavior when you try to allocate more virtual memory than physical memory that is present on the system. That is a different (but related) concern from the fact that mmap() will allocate virtual memory but not physical memory. (That is, mmap() reserves locations in the address space, but you don't have any physical memory backing it until you use that memory.)
Yeah, sorry -- mmap()'s lazy allocation of pages is a separate but related concept to overcommit (I was writing a tirade about that in a separate thread and my wires got crossed).
The nice thing about virtual memory is that it's, well, virtual. It costs you almost nothing until you've touched it. (Fun exercise for the reader: measure the kernel overhead for an unused 1 TiB VMA.) But creating huge spaces--that terabyte mmap wasn't theoretical--that stay untouched is hugely algorithmically useful, especially for things like malloc implementations.
Why does it bother people? Two reasons. First is mlock to avoid swap. This is solvable in much better ways--I'm a fan of disabling swap in many cases anyway. Second is that, absent cgroups, it's difficult to put hard limits on memory usage in Linux. So people, looking under the streetlight, put limits on virtual usage, even though that's not what they care about limiting! Then they get angry when you break it. My refrain here, as in many cases (see for example measuring process CPU time spent in kernel mode): "X is impossible" doesn't justify Y unless Y correctly solves the problem X does.
(I spent years in charge of a major memory allocator so this is a battle I've fought too many times.)