Hidden Costs of Memory Allocation

gargantian · on Dec 10, 2014

I'm actually surprised virtualized page-zeroing isn't a memory controller feature at this point. With all of the things modern CPUs do or you, it seems crazy that everyone is running CPU threads that waste time and voltage to pump a zero page queue.

peterfirefly · on Dec 10, 2014

The Mill, if it ever gets built, does something like that.

http://millcomputing.com/wiki/Memory#Implicit_Zero_and_Virtu...

"The big gain for this is that the OS doesn't have to explicitly zero out new pages, which would be a lot of bandwidth and time, and accesses to uninitialized memory only take the time of the cache and TLB lookups instead of having to do memory round trips."

adwn · on Dec 10, 2014

The Mill [1] does/will do that: When memory is allocated, it is served directly be the cache, which implicitly zeroes the respective cache line (without involving the main memory).

A similar mechanism zeroes a stackframe upon a function call (and function return, I think), which eliminates most attacks that exploit buffer-overflows.

[1] http://millcomputing.com/docs/memory/

oso2k · on Dec 10, 2014

So a Mill CPU can never do shared-memory process parallelization? Two or more processes accessing the same memory could be a hazard. Seems like it would suffer the same issue IA-64 has with parallelism. However, the conveyor belt analogy simplifies register spilling & related issues.

adwn · on Dec 11, 2014

> So a Mill CPU can never do shared-memory process parallelization?

Why not? The caches will still be coherent.

crest · on Dec 11, 2014

Only if you are crazy enough to share stacks in which case you deserve what you get.

stephencanon · on Dec 10, 2014

It's a little bit silly, but zeroing a page is an extremely cheap operation (far cheaper than just about anything you are reasonably going do with the page once it's zeroed -- on the order of a hundred cycles is pretty typical these days). That said, yes, it is a cost, and it's not crazy to want to address it with HW.

FWIW, both powerpc and arm64 have a "just zero the damn cacheline" instruction. That's not quite what you want, but it is quite useful.

brucedawson · on Dec 11, 2014

A hundred cycles to zero a page? If 4 KB can be written in a hundred cycles then, assuming a 3 GHz processor, that's 30 million pages per second or ~120 GB/s. That's pretty fast memory. On x86/x64 processors, which lack a zero-the-cacheline instruction, the memory will also end up being read, so you need 240 GB/s to clear pages that quickly. This ignores the cost of TLB misses, which require further memory accesses.

stephencanon · on Dec 11, 2014

The zeros don't need to get pushed to memory immediately. They go to cache, where they will typically be overwritten with your real data long before they are pushed out to memory. That push of your real data would have needed to happen anyway, so there (usually) minimal extra cost associated with the zeroing.

There are, of course, pathological cases where you touch one byte on a new page, and then don't write anything else before the whole thing gets pushed out, but they are relatively rare in performance-critical contexts.

renox · on Dec 11, 2014

> but zeroing a page is an extremely cheap operation

Uh? Anything which touch main memory is NOT 'an extremely cheap operation': that's why we have registers, L1, L2 (even L3) caches!

stephencanon · on Dec 11, 2014

The whole point of cache is that you don't need to go to main memory every time you do an operation. If you don't write anything else to the page, those zeros will be written out to main memory eventually, but if you weren't going to write anything, there was no need to allocate the page to start with. So the most common scenario is the zeros get written to L1, which is fast, and then your real data overwrites it in L1 before it's pushed out to lower levels of the cache or memory. The initial write may require a read-for-ownership from memory, depending on the exact cache semantics, but if that's required, it would be required for you to write your data anyway.

tehwalrus · on Dec 10, 2014

I know that in C it is relatively easy to define your own memory management functions (I have written my own, complete with very primitive garbage collector, in an attempt to track down memory leaks before I had heard about Valgrind).

I believe Python essentially does this under the hood by allocating buffers and managing its own memory requests on them (allocating more when it needs more, obviously).

I wonder if it is possible to do this kind of thing in C++? It occurs to me that it might be hard to write platform-independent code which mimics the behaviour of the `new` keyword on a preallocated buffer. A quick Google reveals some syntax[1] that I'd not seen before which allows one to do this, although it looks like your call would look something like:

    MyType* instance = new (my_malloc(sizeof(MyType))) MyType();

which is hardly concise. Perhaps a macro called my_new would allow you to text-transform your way there without as many parens or repeats of MyType.

In any case, asking for all your RAM at the beginning, and then re-blanking and re-using it, is certainly much faster for certain problem types (3D games with physics emulation, some scientific problems.)

[1] http://stackoverflow.com/questions/8301043/create-objects-in...

disclaimer: This post is about reinventing too many wheels, and should be taken with a pinch of salt! Moving malloc calls out of loops (and replacing them with hand-written zeroing functions) is probably the most useful tip in practise.

pjc50 · on Dec 10, 2014

In C++ you're supposed to do this with std::allocator and alternatives. This allows you to change the allocator used by vector<>, map<>, etc. Or you use the placement new syntax you found with operator overloading (yes, you can overload "new MyType()")

You're really not supposed to do it with macros.

scott_s · on Dec 10, 2014

To be clear, there are four options you mentioned:

1. Changing the allocator used by the standard library by passing in a special allocator to instantiations of std::vector, std::map, etc.

2. Allocating raw memory elsewhere, and instantiating an object with that memory using the placement syntax for new.

3. Overriding operator new on just a particular class. This does not change the global new; it just means that objects of this class will be allocated with this version of new.

4. Overriding the global definition of new. All objects whose classes do not define their own operator new will be allocated with your own, global new.

More details here: http://en.cppreference.com/w/cpp/memory/new/operator_new

tehwalrus · on Dec 10, 2014

std:allocator and overloading new sound like much better alternatives, thank you.

Someone · on Dec 10, 2014

In C++, apart from the placement new that others mentioned, it is possible to globally replace new and delete by writing the 4 functions

    void* operator new  ( std::size_t count );
    void* operator new[]( std::size_t count );
    void operator delete  ( void* ptr );
    void operator delete[]( void* ptr );

(from http://en.cppreference.com/w/cpp/memory/new/operator_new and http://en.cppreference.com/w/cpp/memory/new/operator_delete)

If you do it that way, you do not need to change any of the code that calls new and delete.

And I disagree about the 'relatively' in "I know that in C it is relatively easy to define your own memory management functions"

It is easy to piggy-back on an existing memory allocator, and it is fairly easy to make an implementation that works from a single thread, or an implementation that performs horribly from multiple threads, but it is not easy to make one that performs well from multiple threads and is useful.

[That "and is useful" is from the hacker in me. It is trivial to make what I think is a conforming implementation that multi-threads exceptionally well along the lines of (I am guessing at the prototypes):

    void * malloc(size_t t)            { return NULL;}
    void * calloc(int n, size_t t)     { return NULL;}
    void   free(void *m)               {}
    void * realloc( void *m, size_t t) { return NULL;}

but that would not be a very useful implementation]

tehwalrus · on Dec 10, 2014

I should clarify, I wasn't actually making my own malloc, I was only macro-switching it so that I could keep a list of all the malloc call returns. I was also interrupting free, so I could see which objects had been malloc'd but not free'd. I think I mentioned it was to track down memory leaks.

The example I quoted in Python's implementation, though, seems to hold truer - rather than calling malloc whenever it needs more space it allocates buffers and hands out memory from them, reducing the frequency of calls to malloc, and meaning that its garbage collector sometimes can't free as much RAM as you hoped (since there are some nearly-empty buffers that can't be released.) This makes running python instructions which need a little more RAM much faster, at the expense of complexity when RAM is short, of course. And overall the program should be faster, since it has fewer system calls (assuming that the code which hands out RAM is well optimised, which should be possible since it can be much simpler than the system malloc.)

kabdib · on Dec 10, 2014

... as long as you understand what's going on with dynamically loaded code, and you try to free something that was allocated by a module that uses a totally different allocator.

C++ standard meets modern runtimes. Wheee! :-)

maxlybbert · on Dec 10, 2014

The allocators that appeared in the first C++ Standard were designed by the Committee and turned out to not be as useful as expected. Both Electronic Arts ( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n227... , https://github.com/paulhodge/EASTL/ ) and Bloomberg ( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n185... , https://github.com/bloomberg/bde ) proposed a different approach, which allow what you want. C++11 includes stateful allocators and scoped allocators.

cdevfr · on Dec 10, 2014

Sure it is possible. Just pass the constructor your type:

my_type* t = New( MyType ) ;

Where MyType passes information to New() of what to create.

...

I don't know what I'm downvoted.

This clearly works. The naive solution without macros would have MyType be a simple integer and New() has a switch statement. But it is easy to avoid that.

pmr_ · on Dec 10, 2014

You are being downvoted because the language comes with well-defined and specifically designed ways to customize allocation/deallocation. Your approach choses to ignore those facilities and even adds a runtime decision to code that absolutely has no need for it. You are reinventing the wheel and it is not even round.

pjc50 · on Dec 10, 2014

Because what you've written doesn't make much sense? You can't pass a type as a function parameter to a user defined function, and if you mean the builtin "new" (no capital) then that uses the normal allocator.

fla · on Dec 10, 2014

I think he meant that MyType is an enum and New is a user-defined function with a huge switch that does allocate (with placement new?) the correct type based on the value of MyType. Yea works, but introduces runtime overhead and is the gate to maintenance hell.

cdevfr · on Dec 10, 2014

It is basic OOP in C. Look it up.

bluetomcat · on Dec 10, 2014

Does Windows really keep a pool of zero pages? With COW, a single zero page is sufficient -- any copying would be done when a write is attempted over the page.

ars · on Dec 10, 2014

COW isn't free. You still have to spend the time doing the copy.

Better to zero memory ahead of time.

fulafel · on Dec 10, 2014

Windows-style ahead of time zeroing thread will typically take the cache misses and memory bandwidth for that page twice. (Assuming your code subsequently puts its own data on the pages, instead of just ordering up zeroed pages to sit on)

kevingadd · on Dec 10, 2014

Are you certain it's not bypassing cache on the writes? It's been relatively straightforward to do this on x86 for ages now, especially if you're using SSE to fill many bytes at once. I would be shocked if the page zeroing thread did cached reads or writes.

brucedawson · on Dec 11, 2014

The page-zeroing code may be able to minimize its effect on the cache, but it will then necessarily consume memory bandwidth -- 4 KB of bandwidth per page zeroed as it writes each page. So, it still affects overall performance.

And it guarantees that when a process goes to use the pages they will not be in the cache. Ah, tradeoffs.

ars · on Dec 11, 2014

It does it when the system is otherwise idle.

> And it guarantees that when a process goes to use the pages they will not be in the cache.

There isn't much point in caching a page full of zeros. Let the cache fill when there is actual data.

BBlarat · on Dec 10, 2014

From what I read in the linked msdn blog link (http://blogs.msdn.com/b/tims/archive/2010/10/29/pdc10-myster...) it sure seems like it's a list.

bartwe · on Dec 10, 2014

Which makes it an ondemand cost with added latency instead of something that can be done ahead of time.

on Dec 10, 2014

[deleted]

cbd1984 · on Dec 10, 2014

> The article states that "relational databases" arose to conserve "rare" and "expensive" disk space.

It seems that navigational databases, which relational databases largely replaced, would be even more sparing of disk space, which makes me think that claim is rather suspect.

How about another explanation:

Relational databases are to data what structured programming (using for loops and while loops and functions and local variables instead of, you know, not) is to code: It reduces the potential for error by, first, encouraging single-point-of-truth via normalization, and, second, encouraging people to make as many tables as possible automatically via queries which use relational algebra.

In short, every table knows about one kind of thing. Everything your database knows about that kind of thing is in that one table. If you want to know more than one kind of thing at a time, you create a new table automatically with a query.

Compare: Every function does one kind of thing. Everything your program knows about doing that kind of thing is in that function. If you want to do more than one kind of thing, you compose functions and let the details of flow control be handled automatically by the runtime.