This reminds me of a horrible hack I once put into an internal ruby (sinatra) REST API server that had to aggregate data from a ton of SOAP servers in a pretty large environment, and return some computed results as JSON. The amount of XML that had to be parsed into objects to do the reporting was massive, and the web workers (being ruby) never gave the memory back to the OS. Running the code inside a unicorn process would basically cause it to start using several GB of ram, and over time all the unicorn workers would become bloated and the service would have to be restarted.
I eventually decided to just fork() the ruby process, do the processing, compute the JSON results (which were comparatively small) and hand the results back as a string over a socket to the parent process, which would cache them and respond. It pretty much solved the memory issue, since the forked process would just exit() gave all the memory back.
In fact, I think I ended up doing a GC.disable() inside the forked process (because I was going to exit anyway) and it sped the process up from taking >2 minutes to just a few seconds. Ruby's GC was utterly terrible back then.
It was an internal tool that never was touched by the general internet, which is why we could "afford" to have it do such intensive work behind a single request, but I always wonder if there was a better approach that wasn't so awful.
Fun fact: Erlang is built to assume that most processes are "missiles" in this sense. GC is quite slow (though only blocks the individual actor being collected), but most actors never perform enough allocations+deallocations across their lifetime to ever incur a GC collection.
One of the main ways to tune Erlang systems for speed is to figure out the amount of memory each type of actor will need, and pass an option to spawn the actor with an arena pre-allocated to that size, so that it can just quietly do its job and quit, never having asked to allocate or deallocate at all.
For the third type of "leak" (GC frees memory from the program, but never returns it to the OS) the article says that this is often an issue for Python and Ruby... Does this happen in other languages with a GC like Java or Go, or are Ruby and Python the only ones/especially bad?
It happens in other VMs. It can happen without a VM if whoever wrote your platform’s memory allocation system hasn’t read anything from the last thirty years (“first fit” versus “best fit”).
I remember hearing about a server program that worked around this problem by creating two memory arenas. All allocations came from, say, arena A until it was full, then they came from arena B. That’s the whole strategy. Things were still deallocated normally. The basic idea was that by the time arena B was full, everything in arena A would have been cleaned up normally and the memory allocator would have coalesced adjacent free memory; in other words, the strategy was to guarantee the allocator had a chance to clean up fragmentation (usually the allocator doesn’t need the help, but this approach guaranteed correct behavior as long as everything was eventually cleaned up; i.e., there were no classic leaks).
Some systems forego memory freeing for performance gains, especially in environments with fixed allocations per process, e.g. Kubernetes. TCMalloc [1] intentionally doesn't release memory to the OS, keeping your max memory size forever.
In C# it cannot happen. The garbage collector phase is compacting.
In the past you could have had this problem in the large object heap (that stores objects larger than 85k) because the GC there is not compacting.
From .Net 4.5 or something similar they added an option to trigger a compacting GC in the large object heap, so this problem is solved.
The article is missing maybe the biggest leak in GC languages that is when resources are not properly closed. A classic example is not unsubscribing an event in .Net that may cause massive leaks if you have a main window that through its ViewModel will listen to events of any child viewmodel but it doesn’t unsubscribe the event when the child window is closed.
As a result the more child window you open and close the more memory is leaking.
I'm surprised the article didn't point out that this applies to C and C++, considering that these are usually thought of as being more efficient than higher-level languages. On Unix-like operating systems (and Windows too I believe), all of a process's data is in a single contiguous block. You can grow or shrink that block by calling the brk() [1] system call, but this allows you to return memory to the operating system only if it's at the end of that block – there's no mechanism to free pages in the middle.
The big problem for these languages is that you can't move objects around in memory because other objects may have pointers to them, and there's not enough metadata at runtime to figure out where those pointers are and fix them up to a new location. This means that if you have an large amount of free memory and have just one little object allocated after it then there's no way, even in principle, to return that free memory to the operating system. In practice I don't think any C allocators even try to return memory to the OS.
There are a couple of mitigating factors to this though:
* If you allocate a huge amount of memory and then free it and don't use much afterwards, eventually the OS will move those unused virtual memory pages to disc in the page file. Your process will still appear to be using a lot of memory, but in practical terms the system will behave as if you had freed it.
* You can allocate memory allocated using memory-mapped files [2], a term which is distorted to include memory allocated by this mechanism but not necessarily backed by an actual file. These are not in the contiguous block of process data and their memory pages really are returned to the OS when the file is unmapped. This requires quite a lot of manual effort on the part of the developer, but for some applications it's worth it.
Note that some allocators (including the standard glibc allocator) do use mmap internally for some allocation sizes.
In this case the memory can be returned to the os when the whole block is freed.
So, the general point about fragmentation, non-moving memory management, and process heaps versus what can actually be returned to the OS is true and common, but your “single contiguous block” model is somewhat off:
On at least modern GNU/Linux, malloc can and does still use brk/sbrk, but will also use mmap for particularly large allocations, as can (on my system) be shown by stracing this C program:
On Windows, HeapAlloc is at least presumed to use VirtualAlloc-or-equivalent underneath, and there is no brk/sbrk.
Regarding the other two factors you mention: firstly, if you make a large number of allocations on a brk-based heap and then free most of them, a page from that heap can still only be paged out if none of the live allocations intersects it. (If you make a single large allocation, see above re automatic use of mmap.) If the heap is highly fragmented, you may not be able to page out many pages. If the heap is a brk-based heap and is not highly fragmented after the free phase, it's more likely there would be a bunch of space near the top, and then malloc could shrink the heap at the OS level too. One allocation pattern I can imagine where what you say would have a higher chance of happening, though, is “allocate many times for intermediary data during an operation, then allocate for the result, then free all the intermediary data”, in which case the result might well be stuck at the top of the heap with a bunch of dead pages underneath. But you still take lazy disk I/O doing the paging, and you still take a reservation across system virtual memory as a whole. On the whole I'm not all that convinced that this is a “mitigation” that's worth paying attention to from a developer perspective; it might be more useful from a sysadmin perspective.
For the second, terminology-wise, I've never heard “memory-mapped file” used to refer to anonymous mmap; I think I've only ever heard it called mmap in a Unixy context. On Windows, VirtualAlloc and MapViewOfFile are separate API functions, and VirtualAlloc is the equivalent of anonymous mmap.
Windows allows the creation of additional heaps for HeapAlloc such that a heap can be destroyed all at once, and I presume these use their own VirtualAlloc regions underneath. There are various C libraries that allow arena-based allocation on Unixy systems as well. These can even both reduce manual effort and increase performance if your memory usage patterns involve a large number of allocations for some operation followed by freeing everything, because then you don't have to free every block individually; you just throw away the entire arena, like an in-process equivalent of what exit() does to whole-process memory resources. Your main option in C if malloc is causing fragmentation/performance problems for you isn't usually to jump to raw mmap unless you're writing your own allocator on top of that; it's to jump to picking a more advanced malloc and maybe giving it some extra context about what you're doing with the memory.
Thanks for correcting my misunderstandings, this is really interesting. There's quite a bit of subtlety in allocations compared to my (old) mental model of malloc calling brk().
I think you're right that I accidentally made up usage of "memory-mapped file" for mapped memory that doesn't correspond to any file. But it is possible to create them on Windows with MapViewOfFile; this function takes a handle returned from CreateFileMapping, which can accept a file handle of INVALID_HANDLE_VALUE [1]
IIRC the golang GC has a similar behavior, because it is a non-moving GC, so it is very fast and C-friendly.
IIRC Java and C# have a moving GC, so the GC may decide to move the object and will update all the references to make the change automatic. This is slower and you can't pass the pointers to an external function that is not aware of the GC. But the memory is compacted and it avoids the #3 type of leaks, and also allocating new memory is faster because the allocator has a big block of free memory instead of having to lookup a good place in the fragmented free memory.
Without compaction this is often very difficult to do because you don't have large chunks of completely empty memory pages that are convenient to return to the OS; often you end up with a single object on an otherwise empty page, or a free block header on the otherwise empty page. If your malloc library coalesces free regions (or doesn't store its metadata in the heap at all) then this may be a bit easier.
In any case, the GC or malloc lib can use madvise() or posix_madvise() with MADV_DONTNEED. This tells the OS that one or more pages are now junk and can be forgotten, but it leaves the pages mapped. If the process touches them again the kernel will page fault and re-allocate a new page of physical memory but otherwise the physical memory is available for the OS to re-use without swapping to storage.
Modern well-designed VMs will defer acting on this advice unless it actually needs the memory. If the process touches the pages before they are reclaimed very little actual work is performed.
Compacting GCs are another story; they can always wait some time period after a new highwater mark is reached then madvise() on the empty heap areas with little penalty.
I'll also caution people to remember that free memory is wasted memory. If a program isn't using it then the file cache should be.
I encountered a similar issue as a learning Objective-C programmer, with autorelease pools on macOS (probably iOS too): If you're allocating and autoreleasing ObjC objects in a thread without an autorelease pool, or not draining the pool, those allocations will still be billed to your application and not available to the OS.
That may (or may not) be true of the GC heap, but the GC heap is only one consumer of JVM memory allocations. I don't know enough to confirm one way or another.
Speaking from experience, OpenJ9's JIT compiler aggressively returns transient compilation memory back to the operating system.
I’m surprised I don’t see reference cycles listed. Strictly speaking I suppose they’re a subset of “unreachable allocations”, but they have a different feel, particularly as they can occur even with GC.
Garbage collection can handle reference cycles -- that's why it is better than reference counting (at handling the eventual freeing of arbitrary data).
I've encountered type 2 leaks while working on a popular web browser. It wasn't type 1 leak and so it got overlooked.
One example: on pages that self-refreshed at regular intervals, the same page URL was added to the history list each time. That ended up leaking a ton of memory if you left the page open over several days.
Sort of. I don't know if I'd call it irony, exactly—the physical metaphor would perhaps be that the garbage truck comes by once a week but I still have to remember to take the bins out to the curb. GC definitely gets overhyped for beginners, though. Most docs/tutorials "forget" (see what I did there?) to mention memory management until the situation becomes so complex that it's unavoidable, by which time the concept is unnecessarily difficult to map onto established patterns.
> If you’re writing in C or C++, especially C++ without ubiquitous use of smart pointers to manage memory lifetimes, this is your first guess.
What absolute rubbish. Ever heard of a constructor/destructor (aka RAII)? You can go really far without the usage of any new/delete or a smart pointer, and this probably should be the case all the way up till the OS boundary.
Responding to a suggestion to use smart pointers with a suggestion to RAII instead doesn't make sense because smart pointers are themselves an example of RAII. I assume what you meant was using stack allocations to avoid heap allocations except (presumably) via library containers like the STL.
This is a good idea where it can be done, and it's worth noting that move semantics replace some uses where a unique pointer would've have been necessary pre-C++11. But there are plenty of applications where it's really not possible. A big one is that if you're using inherence and virtual methods (out of fashion though they are) then you'll almost certainly need a pointer. Plenty of large complex applications cannot avoid object lifetimes with complex rules, which shared pointers track nicely, usually because objects are shared between threads or network requests. For example, if you're using ASIO (which is the basis of standard library networking in C++20) it's hard to imagine how you'd avoid using shared pointers.
It was to Jonathan's Blow presentation for ideas about a language for game programming, where he goes in depth about RAII, C/C++, Go, Rust and other modern language developments.
I eventually decided to just fork() the ruby process, do the processing, compute the JSON results (which were comparatively small) and hand the results back as a string over a socket to the parent process, which would cache them and respond. It pretty much solved the memory issue, since the forked process would just exit() gave all the memory back.
In fact, I think I ended up doing a GC.disable() inside the forked process (because I was going to exit anyway) and it sped the process up from taking >2 minutes to just a few seconds. Ruby's GC was utterly terrible back then.
It was an internal tool that never was touched by the general internet, which is why we could "afford" to have it do such intensive work behind a single request, but I always wonder if there was a better approach that wasn't so awful.