Virtual Memory Tricks

krylon · on Nov 7, 2017

It might be worth pointing out that the OpenBSD developers replaced their malloc(3) implementation with one based on mmap(2) ... quite a while ago.

IIRC, it attempts to use end-of-page allocation by default for blocks of a certain size. It is also possible to enable guard pages on a per process basis or globally.

Thanks to this, memory corruption is much easier to detect and debug on OpenBSD.

slededit · on Nov 7, 2017

Windows also has this with Page Heap mode.

https://msdn.microsoft.com/en-us/library/ms220938(v=vs.90).a...

umanwizard · on Nov 7, 2017

Isn’t malloc based on mmap (or some similar equivalent like mach_vm_allocate) in all systems?

chrisseaton · on Nov 7, 2017

I think so in practice, but you could also use brk and sbrk.

quotemstr · on Nov 7, 2017

Yeah, but brk amounts to the same thing, just worse in every way.

zokier · on Nov 7, 2017

glibc malloc uses brk for small allocations (<128k) on Linux

rurban · on Nov 7, 2017

Yes, we know that OpenBSD is famous for ignoring performance. A syscall for every tiny malloc, where in normal systems the threshold is 128K.

quotemstr · on Nov 7, 2017

Except it doesn't use a syscall for every malloc. That would be crazy. It's a conventional block allocator.

krylon · on Nov 8, 2017

Not sure why you are getting downvoted.

IIRC, OpenBSD is slightly more sophisticated than issuing one syscall per call to malloc(3) and friends. Just slightly, but still. ;-)

Still, I do not like expressing disagreement by downvoting. Sorry about that.

rurban · on Nov 8, 2017

Yeah, I was a bit wrong. It does a syscall for every 4K. Normal systems use a slow syscall only for malloc > 128K. I see no other advantage than extremism in security, and giving a damn about performance.

dragontamer · on Nov 7, 2017

Shared virtual memory: when a CPU allocates a region of memory, and then the GPU synchronizes to the same region of memory.

Any reads / writes to that virtual memory address work in both CPU and GPU code. So the CPU can do highly sequential tasks like balancing a binary tree (with all of the pointers involved), while the GPU can do highly parallel tasks like traversing the tree exhaustively. Pointers inside of this shared region remain legitimate, because each address is correctly identified between the CPU and GPU.

See this for example: https://software.intel.com/en-us/articles/opencl-20-shared-v...

I'm describing "Fine Grained Shared Virtual Memory" btw, in case anyone wants to know what to search for.

In effect, it works because the CPU does an mmap, and then the GPU does an mmap, and then the PCIe drivers work to keep the mmap'd region in sync. With modern acquire / release memory fence semantics + atomics properly defined, its even possible to write "Heterogeneous" code where the CPU and GPU works simultaneously on the same memory space.

In theory. I haven't tested these features out yet, and its quite possible that current implementations are buggy. But the features exist and its totally possible (to attempt) to do this sort of stuff on hardware today.

chubot · on Nov 9, 2017

Hm I hadn't heard of this, that's interesting. That seems fragile but I'm not an expert in this area.

I found this recent CppCon video pretty interesting. It's about running vanilla C++ on (nvidia) GPUs and what the implications are for the C++ memory model:

https://www.youtube.com/watch?v=86seb-iZCnI

_dcwr · on Nov 7, 2017

I like the article and the tricks. I recall seeing a presentation by a game developer who mentioned something similar, something about intermixing actual data pages with something else (read only ones or even unusable ones?) to trigger an error in case of an out of bounds read/write that was far enough (which apparently happened often enough for this to be helpful).

The address space is not actually 64 bit and it's not just the OS limiting it. The CPU/MMU/IDK (I'm not low level enough to get all this stuff) does too in some ways. I assume these values mean that I can have 2^39 bytes of actual RAM and 2^48 bytes of virtual address space:

>[frex@localhost ~]$ cat /proc/cpuinfo | grep address

>address sizes : 39 bits physical, 48 bits virtual

This also came up back when I was looking through PUC Lua 5.1 code for a paper and (out of curiosity) looking into code of PUC 5.2 and 5.3 to see the changes and into code of LuaJIT for comparison.

LuaJIT uses top bits of a pointer to store type info in case of light userdata pointers: https://www.freelists.org/post/luajit/Few-questions-about-Lu...

This reminds me I should probably do my write up of how PUC Lua 5.1 VM works and implements it's features in English finally. If someone's interested in any way I'd appreciate dropping me a word to let me know that there's market for that so I don't fear that I will be writing down stuff I already grok for no one to read and benefit from.

mikulas_florek · on Nov 7, 2017

maybe this https://www.youtube.com/watch?v=qYN6eduU06s&feature=youtu.be... presentation

_dcwr · on Nov 7, 2017

Yes, this presentation. I mistakenly thought it had to be a GDC one because it was about games.

colejohnson66 · on Nov 7, 2017

> I assume these values mean that I can have 2^39 bytes of actual RAM and 2^48 bytes of virtual address space:

Yes. There’s no point having 64 physical address lines if 26 of them aren’t going to be used by anything

IcePic · on Nov 7, 2017

Amiga programming manuals were very clear about not (ab)using the top 8 bits even if the A1000/500/2000 were only equipped with 24 physical pins for memory (limited to 16M), so of course Microsoft Basic for amiga used the upper bits, and when the 68020+ cards arrived, it broke since the upper 8 bits actually pointed somewhere.

What is inconceivable today, will be reality at some point.

dfox · on Nov 7, 2017

On amd64 (and IIRC also Alpha) hardware checks that unused bits of address are all equal to highest used bit (or IIRC all zero on Alpha) to discourage this kind of tricks.

And I vaguely remember that ESA/390 is actually 32bit address architecture in hardware and it's 31bitness comes from using highest bit of address as a flag that such tricks are taking place.

bonzini · on Nov 7, 2017

The same happened for Mac OS.

andrewmu · on Nov 7, 2017

One use is tagged pointers, using the top unused address bits for e.g. type information (https://en.wikipedia.org/wiki/Tagged_pointer)

nemanjaboric · on Nov 7, 2017

We're using this technique to stuff the file descriptor with the accompanying object for the epoll registration, so one can check if the (reusable) object was reused for something else while registered with other file descriptor in the epoll: https://github.com/sociomantic-tsunami/ocean/blob/ee4f78d527...

_dcwr · on Nov 7, 2017

LuaJIT is where I originally ran into this. It uses both NaN tagging with C doubles and tagged pointers with 64 bit pointers.

_dcwr · on Nov 7, 2017

Where does the number 21 come from?

colejohnson66 · on Nov 7, 2017

My brain said 64 - 39 is 21... my bad

tom_mellior · on Nov 7, 2017

> Also, it is possible that another set of pages could be allocated directly after ours in memory, in which case we wouldn’t detect overwrites beyond the last page. [...] For the second issue, we could reserve an extra page after our pages, but not commit that page. No other object could then claim those addresses and writing to them would still cause an access violation. (Note: This only works on Windows, where reserve and commit are separate operations.)

On Linux (and POSIX in general), you can do the same thing by allocating the guard page as described, then setting its protection to PROT_NONE using mprotect(2). Any access to the page will cause an error.

j_s · on Nov 7, 2017

http://www.triplefault.io/2017/08/detecting-debuggers-by-abu...

>Since Windows will incorrectly assume that our int 3 exception was generated from the single-byte variant, it is possible to confuse the debugger into reading "extra" memory. We leverage this inconsistency to trip a "guard page" of sorts.

nayuki · on Nov 7, 2017

My favorite virtual memory trick, in the context of operating system kernel development, is to make one of the top-level page table entries point to the physical location of the page table itself. This makes it easy to manipulate the page table in the virtual address space, without turning off paging. Some articles on the topic:

* http://wiki.osdev.org/Page_Tables#Recursive_mapping

* https://medium.com/@connorstack/recursive-page-tables-ad1e03...

magnat · on Nov 7, 2017

Wouldn't using those tricks cause significant performance loss due to kernel syscall and page table updates needed for each virtual_alloc call?

DSMan195276 · on Nov 7, 2017

I don't know for sure for Windows, but for Linux for most of these the answer is no. Until the physical memory is actually allocated, the page tables will not be touched. All that happens when you call `mmap` is that an entry is added to the internal table in the Linux kernel that describes your processes memory layout, which takes basically no time.

When you attempt to access the memory at one of those locations for the first time though, you will experience a performance hit because your program will page fault due to the mapping for that address not actually existing yet. The kernel will then allocate and map a new page and then return to your program so you can execute the instruction again. After that first access there is no other performance penalties.

The unique ID thing may get expensive if you need tons of them, but also since each address within a page is its own unique ID, you get a few thousand unique IDs per syscall, which should cut down on the number of syscalls you have to do. You could also allocate larger blocks of addresses if you wanted too, rather then just a single page.

ZeppelinPL · on Nov 7, 2017

Yeah, using syscall to get uniqueId is kind of controversial...

gizmo · on Nov 7, 2017

You don't do a syscall for _every_ uniqueId, only once every PAGE_SIZE. It's not that expensive.

Not every object needs a unique identifier. Objects that need them tend to be larger and have indeterminate lifecycles. Otherwise the overhead of storing a unique id would by itself be egregious, regardless of the process of generating the id. If you can afford to store the id then you can afford the overhead of an occasional syscall.

amscanne · on Nov 7, 2017

There are secondary effects that also make this not a great idea.

You may wind up slowing leaking vmas (the kernel needs to track what’s been mapped, even if it doesn’t allocate physical pages). This makes future mmap calls and page faults slower.

Generating a uniqueID is trivial and super fast (one atomic operation). What’s the benefit here?

jcbeard · on Nov 7, 2017

yes...yes they would.

blibble · on Nov 7, 2017

author's virtual_alloc() function is mostly redundant as most malloc() and calloc() implementations will use mmap() internally if the allocation is large (the result being that that physical page won't be allocated until you touch the memory)

also I think that double mapped ring buffer implementation is bogus without a volatile keyword in there somewhere

DSMan195276 · on Nov 7, 2017

> also I think that double mapped ring buffer implementation is bogus without a volatile keyword in there somewhere

I agree the ring buffer is problematic. The `read` function for the `ring_buffer_t` is a bad design - it returns a pointer into the actual ring_buffer_t buffer, but also increments the read pointer. There's no guarantee for how long the returned pointer is actually valid - eventually writers will overwrite it and you will have no idea. I suppose if you're only looking at a single-threaded environment though then it's not as a big of a deal, but for the normal implementation "read" should just do a `memcpy` like `write` does.

As for `volatile`, making the `data` pointer `volatile` should make this valid - though you can't use `memcpy` after that, so you have to do the loop yourself unfortunately.

That said, I think you could "cheat" (IE. don't do this) and ensure the correct behavior by changing `ring_buffer_t` to have two `data` pointers to the buffer instead of one - one for reading, and one for writing. The compiler would have to assume the two pointers can alias each-other at any point within the buffer, and thus it should force the compiler produce correct code even with multiple reads and writes in the same function (Because when you do a write, it has to assume data it previously read may have changed). I don't think there's any guarantee though - a sufficiently smart compiler could possibly make inferences that would break this. In particular, the C standard says that the C memory layout is completely linear, so the compiler could make an assumption like "If address 1 aliases with the first byte of the other buffer, then address 4097 doesn't" and that would obviously not be true. Unless you hard-coded the addresses though, I don't think any current compilers would ever make such an optimization - but again, nobody should do this, I don't know for sure. It's more of a curiosity then a good idea.

faragon · on Nov 7, 2017

realloc() is very efficient in most POSIX systems (Linux included) because of the C library using mmap techniques (e.g. Glibc). As contrast, in Windows, using the realloc() function provided in the Microsoft C library, it is slow on large arrays because of not using mmap techniques (as far as I know, the only option is to use VirtualAlloc() function and tune it by hand), involving a memory copy on every resize if the allocator has no adjacent memory reserved. So this post is interesting for Windows users using large arrays involving realloc() calls (the typical way of reducing the impact of the non-optimal realloc() on systems without MMU is to do geometric space allocation, in order to reduce the number of calls to the re-allocator, e.g. when growing, instead of adding space for one element, grow 25% the size of the array).

mcculley · on Nov 7, 2017

On Linux, one used to be able to execute mmap() on /proc/self/mem. I used this to implement bank switching for an emulator. Despite Torvalds' recent rant about preserving backward compatibility for userspace, this feature was removed at some point.

koverstreet · on Nov 8, 2017

Did you ever complain on the list, or look into why it was removed?

I'd be really curious why it was removed myself.

mcculley · on Nov 9, 2017

I didn't complain. At some point I upgraded my kernel and the code stopped working. This was over 10 years ago. I seem to remember going to the list and seeing that it had been removed due to a security problem. I also seem to remember a Torvalds line about only crazy people wanting to use such a feature. I had lost interest in maintaining the code, so I didn't pursue it.

cat199 · on Nov 7, 2017

why is that necessary in this case vs some other use of mmap?

mcculley · on Nov 7, 2017

It allowed one to map the zero page to a region and then quickly swap it for another region. This benefited code that emulated 16-bit address spaces.

cat199 · on Nov 9, 2017

so, it made the equivalent portable code easier to write, am I correct?

mcculley · on Nov 9, 2017

No, if I remember correctly, there was no other way to do what I was doing without using some file on the filesystem as the chunk of memory to be swapped. Calling mmap() on /proc/self/mem allowed one to make a block of memory show up at the zero page and swap it with another with a single system call, instead of copying the block back and forth. It has been at least a decade since I touched that code, so I may be misremembering.