Hacker News new | past | comments | ask | show | jobs | submit login
Programs compiled by Go 1.11 allocate an unreasonable amount of virtual memory (github.com/golang)
137 points by networkimprov on Oct 10, 2018 | hide | past | favorite | 76 comments



Unreasonable according to you. :)

The nice thing about virtual memory is that it's, well, virtual. It costs you almost nothing until you've touched it. (Fun exercise for the reader: measure the kernel overhead for an unused 1 TiB VMA.) But creating huge spaces--that terabyte mmap wasn't theoretical--that stay untouched is hugely algorithmically useful, especially for things like malloc implementations.

Why does it bother people? Two reasons. First is mlock to avoid swap. This is solvable in much better ways--I'm a fan of disabling swap in many cases anyway. Second is that, absent cgroups, it's difficult to put hard limits on memory usage in Linux. So people, looking under the streetlight, put limits on virtual usage, even though that's not what they care about limiting! Then they get angry when you break it. My refrain here, as in many cases (see for example measuring process CPU time spent in kernel mode): "X is impossible" doesn't justify Y unless Y correctly solves the problem X does.

(I spent years in charge of a major memory allocator so this is a battle I've fought too many times.)


Thirdly, the kernel will kill you if your overcommit ratio is too high. I had this argument with the Go folks several years ago (when Docker would crash after starting 1000 containers because the Go runtime had allocated 8GB of virtual memory while only a tens of MB were in use and the kernel freaked out).

You're right that it doesn't cost anything, other than the risk that a process can cripple your machine using its overcommitted memory mapping. And so the kernel has protections against this, which should deter language runtime developers from doing this.

And let's not forget that MADV_DONTNEED is both incorrectly expensive on Linux and ridiculously expensive compared to freeing memory and reallocating it when you need it. Bryan Cantrill ranted about this for a solid half an hour in a podcast a year or two ago.


So… does that mean the linux kernel will blow a gasket if I mmap actually large files to play with them but have almost no resident memory? That doesn't seem reasonable.


This is only for anonymous memory, obviously it doesn't affect file-backed memory.


What do you mean by “free” memory? Actually unmap it?

Also, I assume the crippling you’re talking about here is just the ability to rapidly apply memory pressure? Otherwise I’m very confused.


> What do you mean by “free” memory? Actually unmap it?

Sorry, I didn't phrase it well. MADV_DONTNEED is significantly more expensive than most ways that memory allocators would "free" memory. This includes just zeroing it out in userspace when necessary (so no need for a TLB modification), or simply unmapping it and remapping it when needed.

> Also, I assume the crippling you’re talking about here is just the ability to rapidly apply memory pressure?

Right, and if the memory is overcommitted then you can cause OOM very trivially because you already have more mapped pages than there is physical memory -- writing a byte in each page will cause intense memory pressure. Now, this doesn't mean that it would kernel panic the machine, it just means it would cause issues (OOM would figure out what process is the culprit fairly easily).

This is why vm.overcommit_ratio exists (which is what I was talking about when it comes to killing a machine) -- though I just figured out that not all Linux machines ship with vm.overcommit_memory=2 (which I'm pretty sure is what SUSE and maybe some other distros ship because this is definitely an issue we've had for several years...).

There's also RLIMIT_AS, which applied regardless of overcommit_memory.


Right. I’m very familiar with all these mechanisms, I guess I just don’t agree that the ability to cause an OOM, particularly if applications are isolated in cgroups appropriately, is a big deal. On balance, not allowing applications to use virtual memory for useful things (such as the Go case of future heap reservation) or underutilizing physical memory seems worse.

As an aside, it seems like an apples and oranges comparison to compare “freeing” by zeroing (which doesn’t free at all) to MADV_DONTNEED. I’m also pretty sure that munmap will be much slower than MADV_DONTNEED, or at least way less scalable, given that it needs to acquire a write lock on mmap_sem, which tends to be a bottleneck. It does seem like there’s a lot of opportunity for a better interface than MADV_DONTNEED though (e.g. something asynchronous, so you can batch the TLB flush and avoid the synchronous kernel transition).


> particularly if applications are isolated in cgroups appropriately

Once the cgroup OOM bugs get fixed, amirite? :P

> It does seem like there’s a lot of opportunity for a better interface than MADV_DONTNEED though (e.g. something asynchronous, so you can batch the TLB flush and avoid the synchronous kernel transition).

The original MADV_DONTNEED interface, as implemented on Solaris and FreeBSD and basically every other Unix-like does exactly this -- it tells the operating system that it is free to free it whenever it likes. Linux is the only modern operating system that does the "FREE THIS RIGHT NOW" interface (and it's arguably a bug or a misunderstanding of the semantics -- or it was copied from some really fruity Unix flavour).

In fact, when jemalloc was ported to Solaris it would crash because MADV_DONTNEED was incorrectly implemented on Linux (and jemalloc assumed that MADV_DONTNEED would always zero out the pages -- which is not the case outside Linux).

> As an aside, it seems like an apples and oranges comparison to compare “freeing” by zeroing (which doesn’t free at all) to MADV_DONTNEED. [...] I’m also pretty sure that munmap will be much slower than MADV_DONTNEED.

This is fair, I was sort of alluding to writing a memory allocator where you would prefer to have a memory pool rather than constantly doing MADV_DONTNEED (which is sort of what Go does -- or at least used to do). If you're using a memory pool, then zeroing out the memory on-"allocation" in userspace is probably quite a bit cheaper than MADV_DONTNEED.

But you're right that it's not really an apt comparison -- I was pointing out that there are better memory management setup than just spamming MADV_DONTNEED.


Do you have a link to the podcast?



Thanks!


I just had the time to find the timestamp, it starts about 57-58 minutes into that second video (The Cantrill Strikes Back)


The thing is, people want a way to measure and control the amount of memory that a process uses or is likely to use. Resident memory is one way to measure actually used memory, but from the man 3 vlimit, RLIMIT_RSS is only available on linux 2.4.x, x < 30; which nobody in their right mind is still running. So we have RLIMIT_AS which limits virtual memory, or we have the default policy of hope the OOM killer kills the right thing when you run out of ram.

That you have to keep fighting this battle is an indication that people's needs (or desires) aren't being well met.


There's a third reason: trying to allocate too much virtual memory on machines with limited physical memory will fail on Linux with the default setting vm.overcommit_memory=0. See for instance https://bugs.chromium.org/p/webm/issues/detail?id=78


And those processes with large VM usage also have a problem doing an exec (fork/execvp pair) because of address space exhaustion.


Great points. A third reason is core files: That 1 TB of unused virtual memory will be written out to the core file, which will take forever and/or run out of disk. This is part of the problem of running with the address sanitizer: you don't get core files on crashing, because they'd be too big.


Shouldn't that be a sparse file?


Not sure whether anyone is writing iOS apps in go, but iOS refuses to allocate more than a relatively small amount of address space to each process (a few gigs, even on 64-bit devices).


I used to think this. Then I deployed on windows. Virtual memory can’t exceed total physical memory (+pagefile) or else malloc() will fail. I am currently having an issue where memory is “allocated” but not used causing software crashes. Actual used memory is 60% of that.


The page file in Windows can grow and the max size, I believe, is 3 times the amount of physical memory in the machine. So, if you're trying to commit more than [Physical Memory x 4] bytes, then yes, it will fail. But, more than likely, you'll get malloc failures long before that due to address space fragmentation (unless you're doing one huge chunk).


The automatic size management doesn't go over 3x RAM, but manual configuration allows for a maximum page file size of 16TB. https://blogs.technet.microsoft.com/markrussinovich/2008/11/...


Re "unreasonable" I simply pasted the title of the Github issue :-)


You cannot possibly be serious:

package main

func main() { for { } }

Using 100mb+ of memory?


It is not "using" but allocating the addresses. The committed memory will be vastly smaller. This is how virtual memory works.


> Using 100mb+ of memory?

It's not using them.

On modern OSX, processes get something like 2.4GB vmem by default, even if they do nothing.

    #include <unistd.h>

    int main() {
      for(;;) {
        sleep(10);
      }
    }
is reported as 2377M VMEM by top/htop, on 10.11.


To quote the parent: `The nice thing about virtual memory is that it's, well, virtual. It costs you almost nothing until you've touched it.`


Virtual memory is different than resident / RSS ...


Check out GHC 8.0+ where the same program will allocate a terabyte of virtual memory ;)

That said, I don't find it unreasonable at all. Just reserving some bits in the address space isn't unreasonable. It makes the real allocation code simpler.


Yes, we are running a Haskell based HTTPS redirector and people noticed this.


Impact:

> The significant increase in virtual memory usage is usually not an issue, however security sensitive programs often lock their memory, causing a far greater performance degradation on low-spec computer hosts.


Is this really a problem?

Operating systems usually cap the amount of locked pages to prevent the system from being DoS'd; on Linux it can be quite low (16kb).

In all the code I've written using locked memory, these limitations have forced me to use separate arenas for the locked/sensitive memory because of its scarcity. In general, it would be incompatible with Go's garbage collected heap unless the GC heap has the concept of "sensitive" objects and pools them in locked memory (which, AFAICT, it doesn't); or the heap limited itself to the locked memory limit (impractical)

If you allocate and maintain unmanaged locked memory yourself in Go it shouldn't matter if the 1.11 runtime uses more virtual memory since you've separated yourself from the problem by going your own route.


I'd love some further explanation here:

what does locking memory mean in this context? For what purpose do security sensitive programs lock their memory? what is the performance degradation that happens with low-spec computers?


> what does locking memory mean in this context?

I think the person is referring to the mlock() and mlockall() functions (or equivalents on other OS), which keep pages resident / prevents pages from being paged out. Forces them to remain in RAM.

It can be used to, e.g., prevent an encryption key or password from being swapped out to disk, where it might then be recoverable. (Personally, this is why I encrypt swap.)

> what is the performance degradation that happens with low-spec computers?

Locking a larger portion of RAM means less room for the OS to page out unused pages and free up the space for other programs.

While one can try to selectively lock buffers with sensitive data with mlock(), you have to be sure they aren't copied into other buffers that aren't locked (and could thus be subsequently paged out). If you're writing a UI program that displays or receives those in a widget, this might be harder (you might not have access to the internal buffer of the widget, as it is an "implementation detail" of your library), and locking the entire process might be a simpler solution (albeit being a bigger hammer).


My understanding was that mlock doesn't really keep the page from being paged out, since it can't even begin to do that in the hibernation case, or if you're running as a VM who's guest RAM wasn't mlocked.

What it does is keep a canonical version of the page in memory. That's useful being able to deterministically touch a piece of memory, but it doesn't really help you as far as making sure the page never touches disk.


I hadn't considered hibernation, and indeed, a deeper reading of the manual confirms that hibernation doesn't count, which is rather interesting (given the implications of hitting disk), but I don't really see a good way around it, short of aborting the hibernation, or providing a mechanism to inform the program that those pages were lost. The man page (later, annoyingly, after its initial description) notes this:

> Memory locking has two main applications: real-time algorithms and high-security data processing. Real-time applications require deterministic timing, and, like scheduling, paging is one major cause of unexpected program execution delays. Real-time applications will usually also switch to a real-time scheduler with sched_setscheduler(2). Cryptographic security software often handles critical bytes like passwords or secret keys as data structures. As a result of paging, these secrets could be transferred onto a persistent swap store medium, where they might be accessible to the enemy long after the security software has erased the secrets in RAM and terminated. (But be aware that the suspend mode on laptops and some desktop computers will save a copy of the system's RAM to disk, regardless of memory locks.)

I would be rather disappointed if a hypervisor swapped out my guest (at least, in a context like AWS; I suppose if you're just running qemu on your laptop, that's a different matter), but I hadn't considered that either, and it is certainly possible.


IMO, the right answer is to better define your threat model. Are you concerned about someone pulling the HDD and reading the swap? Use an FDE scheme that covers your swap too. Are you concerned about someone getting access to the swap file programmatically? At that point they have so many other ways of slurping the memory out of your process that it's a lost cause.


Good explanations of what and why one might lock memory.

But, in the context of this overall discussion, I think it's important to keep in mind that when a process allocates a large amount of virtual memory, it does not automatically allocate any physical memory. So a process allocating a lot of virtual memory up front should not impact other processes which have locked some of their memory into physical memory.


> I think it's important to keep in mind that when a process allocates a large amount of virtual memory, it does not automatically allocate any physical memory.

It will, if you mlock() it, I believe. The manual page notes "real-time processes" as a main user of mlock() (the other being the cryptographic uses I hinted at); it cites their use case as locking the page to avoid delays due to paging during critical sections. In order for that to work, the OS would need to bring the pages in, at the time of locking; so at that point, a large virtual allocation becomes equivalent to a physical one.


Correct, but my point was that a process with allocates a lot of physical memory will not interfere with the other processes that have locked pages.


Linux cgroups and Solaris containers atleast also provide a way to avoid paging to disk, without modifying the program.


It tells the OS to never swap memory allocated to the process to disk.

You should always lock memory if you're going to be storing crypto keys, etc. since once the pages are swapped to disk you're vulnerable to someone pulling the swap partition out and reading it.


It is significant to note that:

1. You don't have to lock all your memory (although it may be hard to capture all the intermediate buffers if you don't).

2. You still need to clear the memory buffers after they're no longer going to be used, otherwise other processes can read /proc/kcore, etc (or cool your RAM and extract it and put it another system)

3. It is possible to encrypt the swap partition with a randomly generated key at boot


One locks memory to keep it from being swapped to disk. You can imagine if you have sensitive data, you want to keep it as ephemeral as possible. Do even security-focused go programs typically lock _all_ of their virtual memory, or just the sensitive pages? I don’t know. But if a bunch of programs hogged hundreds of megabytes of unused RAM each, that’s the problem alluded to on low resource systems.


I assume this is referring to some mechanism like `mlockall(2)` or using cgroups to disable using paging. Is the "excess" memory dirtied ? It's not clear why having the pages mapped but unaccessed entirely would cause a performance issue.


From the manpage of mlockall

> All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.

So it doesn't matter if the memory is dirty, as long as it's marked as locked.


But since the memory is never actually backed by anything, it doesn't have a real effect except in the accounting of the process. No additional memory is used by locking it.


Doesn't "guaranteed to be resident in RAM" mean the memory gets backing?


No, because the pages aren't mapped to anything -- not disk, RAM, or an external paging device. Allocating a chunk of memory typically doesn't map those pages to anything immediately (since there's no benefit to doing that extra work).

The current Linux man page gives a bit more insight:

   mlockall() and munlockall()
       mlockall()  locks   all   pages
       mapped  into  the address space
       of the calling  process.   This
       includes the pages of the code,
       data and stack segment, as well
       as shared libraries, user space
       kernel data, shared memory, and
       memory-mapped    files.     All
       mapped pages are guaranteed  to
       be  resident  in  RAM  when the
       call returns successfully;  the
       pages are guaranteed to stay in
       RAM until later unlocked.

       The  flags  argument  is   con‐
       structed  as  the bitwise OR of
       one or more  of  the  following
       constants:

       MCL_CURRENT Lock    all   pages
                   which are currently
                   mapped   into   the
                   address  space   of
                   the process.

       MCL_FUTURE  Lock    all   pages
                   which  will  become
                   mapped   into   the
                   address  space   of
                   the  process in the
                   future.       These
                   could    be,    for
                   instance, new pages
                   required by a grow‐
                   ing heap and  stack
                   as well as new mem‐
                   ory-mapped files or
                   shared       memory
                   regions.


Ah, I see--it's all mapped pages, not all pages as irishsultan's comment says. That makes a lot more sense, thanks!


After some experimenting it will fault all the mapped pages, causing the VM subsystem to try to back them. In Linux 4.4+ there is an a flag to `mlockall(2)` called "MCL_ONFAULT" which does not do that and instead locks the pages as they become backed.

    $ cat wheres-the-ram.c
    #! /home/rkeene/bin/c
    #include <sys/mman.h>
    #include <stdlib.h>
    #include <stdio.h>
    
    int main(int argc, char **argv) {
            unsigned char *buffer;
            int mla_ret;
    
            mla_ret = mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
            if (mla_ret != 0) {
                    perror("mlockall");
    
                    return(1);
            }
    
            buffer = malloc(1024LLU * 1024LLU * 1024LLU * 32LLU);
            if (!buffer) {
                    perror("malloc");
    
                    return(1);
            }
    
            buffer[0] = 1;
            buffer[1] = buffer[0];
            buffer[2] = buffer[3];
    
            puts("Success !");
    
            return(0);
    }
    $ sudo ./wheres-the-ram.c
    Success !
    $ free -g
                  total        used        free      shared  buff/cache   available
    Mem:             14           1           5           0           7          13
    Swap:             0           0           0
    $


Never seen someone "execute C" before. You are never done learning apparently obvious things...


The canonical shebang line to do this is '/usr/bin/tcc -r' - the Tiny C Compiler in "run" mode.


I do this in some programs, it makes many pieces of coding much easier if I know I will never have to move some objects as they grow.

The ideal for me would be a function that marked some memory as "these addresses are taken, do not give them out to malloc, or anything else", but which still required me to actually "ask" for the memory before using it, so it didn't look like I was using 64gb of memory at startup. Is that possible?


On FreeBSD there's MAP_GUARD.

From the documentation:

> Instead of a mapping, create a guard of the specified size. Guards allow a process to create reservations in its address space, which can later be replaced by actual mappings. mmap will not create mappings in the address range of a guard unless the request specifies MAP_FIXED. Guards can be destroyed with munmap(2). Any memory access by a thread to the guarded range results in the delivery of a SIGSEGV signal to that thread.

So it seems like a good fit for your use case, but I've never used this.


That is exactly what I want! I wonder if it will get added to Linux.


> The ideal for me would be a function that marked some memory as "these addresses are taken, do not give them out to malloc, or anything else", but which still required me to actually "ask" for the memory before using it, so it didn't look like I was using 64gb of memory at startup. Is that possible?

So you want to reserve memory, basically the way malloc does, depriving any other process of that memory, but you then want to reserve it again, and also somehow hide that the memory has been reserved?

What exactly is the benefit except making it not look like you're using the memory? Actually, what's the benefit of that in itself?


No, I want, in my address space, to say to the kernel "when you decide what bit of my memory map to return from mmap and friends, don't use these addresses.". That shouldn't cost anything, except storing the blocks I want reserving.

This is surprisingly hard to do in user space. You can tell mmap where to put allocations, but not where not to put them. Also it's hard to control what your Libc's malloc will do.


You can do this with mmap(). Just use PROT_NONE, you'll get a mapping but you can't read, write, or execute it.

Sure, it will show up in the "virtual memory" usage of your process. That's just how the virtual memory accounting works.


What operating system are you working on that has a shared address space between multiple unrelated processes?

Nobody that I know of works that way besides some L4 kernels and they only use it as an IPC optimization strategy.


What magic computer are you using where reserving a chunk of memory doesn't necessarily mean that no other process can reserve that chunk of memory?


The process isn't reserving memory, it's reserving ranges in its virtual address space:

https://en.wikipedia.org/wiki/Virtual_address_space


So like allocating a large virtual address space and managing that on your own? That already exists, then, and the only remaining issue is making it look like you haven't, for some unclear reason.


The reason is the thread you are currently in :) it is common for people to complain you are "using the memory".


Each process has its own virtual address space. That’s half the fun of virtual addressing.

(and unfortunately, at least according to the L4 people, almost half the cost of a context switch - fixing the TLB)


No, he wants to separate virtual address space allocation from physical memory backing allocation, with both still happening manually. It's a very reasonable request, and the benefit is that you can be guaranteed to have the address space while still only using a smaller max amount of physical memory.


I don't interpret that as the request. I interpret it as wanting virtual memory "allocated" but not counted against the processes' virtual memory allocation, and then getting a secondary mechanism to allocate the sorta-but-not-quite-allocated memory. I also don't see the value in that.

(I don't see any request for different physical memory allocation, so I assume that would still be handled by page faulting in the kernel.)


The value would be to be able to grow arrays and similar without relocation.

People do it all the time in real life, with area codes, zip codes, case numbers, etc.

For example, a long street full of strip malls in California will often have street numbers 50 apart to allow them to remain sequential even after new developments.

No one would complain that this is a wasteful use of precious street numbers, or that it deprives other streets of those numbers.

Imagine the nightmare if you periodically had to renumber all the buildings instead, like computers routinely have to.


How would this proposed mechanism allow that, and what about the current workings of virtual memory disallow it?

One can already do this with realloc() (https://linux.die.net/man/3/realloc). And if you don't want to use the malloc family, but instead want to manage your virtual memory yourself, you can easily allocate large sections of virtual memory and then manage it yourself - which would include enabling behavior such as growing arrays. (Even though you're really just re-implementing realloc-like behavior yourself.)

And to make sure we're on the same page (ha), I want to reiterate that there is a big difference between virtual and physical memory. You can allocate virtual memory without allocating physical memory.


I misunderstood what it was that you didn't see the value of.

The original point was entirely to avoid scaring newbies who can't tell virtual from physical, while still being liberal with your virtual usage.

Haskell with GHC has a great solution: just allocate 1TB up front. A newbie does not need improved tooling, a reworked memory system or any education to realize that this can't possibly be RAM.


This is kind of the point of this whole post :) reserving lots of virtual memory causes posts like this, where people complain about how much memory you are "using".


I think that just kicks the ball further down the court, as it's yet another thing to track. It think it's better to just explain that (for the most part), high virtual memory usage isn't a problem. But as someone pointed out above, FreeBSD does have a facility for it, which is interesting.


On Windows that's VirtualAlloc with MEM_RESERVE:

Reserves a range of the process's virtual address space without allocating any actual physical storage in memory or in the paging file on disk.

https://msdn.microsoft.com/en-us/library/Aa366887(v=VS.85).a...

Perhaps mmap() can achieve the same on Unices?


My first thought on unix was mmap() as well, but one of the requirements was that it doesn't "look like" the process is using that memory, which I interpret as "I don't want to allocate this as virtual memory." I don't think mmap() allows that, as its entire purpose is to allocate virtual memory and return a pointer to that memory space.

Personally, my feeling is: why does it matter if it "looks like" a process is using a lot of memory? That is, why does it matter if a process allocates a lot of virtual memory up front? It's not consuming physical memory, it's just updating some bookkeeping in the kernel. I know people feel uneasy about seeing large values for virtual memory, but... they shouldn't.


This is what Linux does by default for anonymous memory (it's called overcommitment). Go makes very liberal use of this, but it has other issues (namely Linux will kill a process if it goes over a certain overcommitment threshold).


I'm not sure overcommitment is the same thing: https://www.kernel.org/doc/Documentation/vm/overcommit-accou...

Reading that, I think overcommitment is to determine the kernel's behavior when you try to allocate more virtual memory than physical memory that is present on the system. That is a different (but related) concern from the fact that mmap() will allocate virtual memory but not physical memory. (That is, mmap() reserves locations in the address space, but you don't have any physical memory backing it until you use that memory.)


Yeah, sorry -- mmap()'s lazy allocation of pages is a separate but related concept to overcommit (I was writing a tirade about that in a separate thread and my wires got crossed).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: