It might be worth pointing out that the OpenBSD developers replaced their malloc(3) implementation with one based on mmap(2) ... quite a while ago.
IIRC, it attempts to use end-of-page allocation by default for blocks of a certain size. It is also possible to enable guard pages on a per process basis or globally.
Thanks to this, memory corruption is much easier to detect and debug on OpenBSD.
Yeah, I was a bit wrong. It does a syscall for every 4K.
Normal systems use a slow syscall only for malloc > 128K.
I see no other advantage than extremism in security, and giving a damn about performance.
Shared virtual memory: when a CPU allocates a region of memory, and then the GPU synchronizes to the same region of memory.
Any reads / writes to that virtual memory address work in both CPU and GPU code. So the CPU can do highly sequential tasks like balancing a binary tree (with all of the pointers involved), while the GPU can do highly parallel tasks like traversing the tree exhaustively. Pointers inside of this shared region remain legitimate, because each address is correctly identified between the CPU and GPU.
I'm describing "Fine Grained Shared Virtual Memory" btw, in case anyone wants to know what to search for.
In effect, it works because the CPU does an mmap, and then the GPU does an mmap, and then the PCIe drivers work to keep the mmap'd region in sync. With modern acquire / release memory fence semantics + atomics properly defined, its even possible to write "Heterogeneous" code where the CPU and GPU works simultaneously on the same memory space.
In theory. I haven't tested these features out yet, and its quite possible that current implementations are buggy. But the features exist and its totally possible (to attempt) to do this sort of stuff on hardware today.
Hm I hadn't heard of this, that's interesting. That seems fragile but I'm not an expert in this area.
I found this recent CppCon video pretty interesting. It's about running vanilla C++ on (nvidia) GPUs and what the implications are for the C++ memory model:
I like the article and the tricks. I recall seeing a presentation by a game developer who mentioned something similar, something about intermixing actual data pages with something else (read only ones or even unusable ones?) to trigger an error in case of an out of bounds read/write that was far enough (which apparently happened often enough for this to be helpful).
The address space is not actually 64 bit and it's not just the OS limiting it. The CPU/MMU/IDK (I'm not low level enough to get all this stuff) does too in some ways. I assume these values mean that I can have 2^39 bytes of actual RAM and 2^48 bytes of virtual address space:
This also came up back when I was looking through PUC Lua 5.1 code for a paper and (out of curiosity) looking into code of PUC 5.2 and 5.3 to see the changes and into code of LuaJIT for comparison.
This reminds me I should probably do my write up of how PUC Lua 5.1 VM works and implements it's features in English finally. If someone's interested in any way I'd appreciate dropping me a word to let me know that there's market for that so I don't fear that I will be writing down stuff I already grok for no one to read and benefit from.
Amiga programming manuals were very clear about not (ab)using the top 8 bits even if the A1000/500/2000 were only equipped with 24 physical pins for memory (limited to 16M), so of course Microsoft Basic for amiga used the upper bits, and when the 68020+ cards arrived, it broke since the upper 8 bits actually pointed somewhere.
What is inconceivable today, will be reality at some point.
On amd64 (and IIRC also Alpha) hardware checks that unused bits of address are all equal to highest used bit (or IIRC all zero on Alpha) to discourage this kind of tricks.
And I vaguely remember that ESA/390 is actually 32bit address architecture in hardware and it's 31bitness comes from using highest bit of address as a flag that such tricks are taking place.
We're using this technique to stuff the file descriptor with the accompanying object for the epoll registration, so one can check if the (reusable) object was reused for something else while registered with other file descriptor in the epoll: https://github.com/sociomantic-tsunami/ocean/blob/ee4f78d527...
> Also, it is possible that another set of pages could be allocated directly after ours in memory, in which case we wouldn’t detect overwrites beyond the last page. [...] For the second issue, we could reserve an extra page after our pages, but not commit that page. No other object could then claim those addresses and writing to them would still cause an access violation. (Note: This only works on Windows, where reserve and commit are separate operations.)
On Linux (and POSIX in general), you can do the same thing by allocating the guard page as described, then setting its protection to PROT_NONE using mprotect(2). Any access to the page will cause an error.
>Since Windows will incorrectly assume that our int 3 exception was generated from the single-byte variant, it is possible to confuse the debugger into reading "extra" memory. We leverage this inconsistency to trip a "guard page" of sorts.
My favorite virtual memory trick, in the context of operating system kernel development, is to make one of the top-level page table entries point to the physical location of the page table itself. This makes it easy to manipulate the page table in the virtual address space, without turning off paging. Some articles on the topic:
I don't know for sure for Windows, but for Linux for most of these the answer is no. Until the physical memory is actually allocated, the page tables will not be touched. All that happens when you call `mmap` is that an entry is added to the internal table in the Linux kernel that describes your processes memory layout, which takes basically no time.
When you attempt to access the memory at one of those locations for the first time though, you will experience a performance hit because your program will page fault due to the mapping for that address not actually existing yet. The kernel will then allocate and map a new page and then return to your program so you can execute the instruction again. After that first access there is no other performance penalties.
The unique ID thing may get expensive if you need tons of them, but also since each address within a page is its own unique ID, you get a few thousand unique IDs per syscall, which should cut down on the number of syscalls you have to do. You could also allocate larger blocks of addresses if you wanted too, rather then just a single page.
You don't do a syscall for _every_ uniqueId, only once every PAGE_SIZE. It's not that expensive.
Not every object needs a unique identifier. Objects that need them tend to be larger and have indeterminate lifecycles. Otherwise the overhead of storing a unique id would by itself be egregious, regardless of the process of generating the id. If you can afford to store the id then you can afford the overhead of an occasional syscall.
There are secondary effects that also make this not a great idea.
You may wind up slowing leaking vmas (the kernel needs to track what’s been mapped, even if it doesn’t allocate physical pages). This makes future mmap calls and page faults slower.
Generating a uniqueID is trivial and super fast (one atomic operation). What’s the benefit here?
author's virtual_alloc() function is mostly redundant as most malloc() and calloc() implementations will use mmap() internally if the allocation is large (the result being that that physical page won't be allocated until you touch the memory)
also I think that double mapped ring buffer implementation is bogus without a volatile keyword in there somewhere
> also I think that double mapped ring buffer implementation is bogus without a volatile keyword in there somewhere
I agree the ring buffer is problematic. The `read` function for the `ring_buffer_t` is a bad design - it returns a pointer into the actual ring_buffer_t buffer, but also increments the read pointer. There's no guarantee for how long the returned pointer is actually valid - eventually writers will overwrite it and you will have no idea. I suppose if you're only looking at a single-threaded environment though then it's not as a big of a deal, but for the normal implementation "read" should just do a `memcpy` like `write` does.
As for `volatile`, making the `data` pointer `volatile` should make this valid - though you can't use `memcpy` after that, so you have to do the loop yourself unfortunately.
That said, I think you could "cheat" (IE. don't do this) and ensure the correct behavior by changing `ring_buffer_t` to have two `data` pointers to the buffer instead of one - one for reading, and one for writing. The compiler would have to assume the two pointers can alias each-other at any point within the buffer, and thus it should force the compiler produce correct code even with multiple reads and writes in the same function (Because when you do a write, it has to assume data it previously read may have changed). I don't think there's any guarantee though - a sufficiently smart compiler could possibly make inferences that would break this. In particular, the C standard says that the C memory layout is completely linear, so the compiler could make an assumption like "If address 1 aliases with the first byte of the other buffer, then address 4097 doesn't" and that would obviously not be true. Unless you hard-coded the addresses though, I don't think any current compilers would ever make such an optimization - but again, nobody should do this, I don't know for sure. It's more of a curiosity then a good idea.
realloc() is very efficient in most POSIX systems (Linux included) because of the C library using mmap techniques (e.g. Glibc). As contrast, in Windows, using the realloc() function provided in the Microsoft C library, it is slow on large arrays because of not using mmap techniques (as far as I know, the only option is to use VirtualAlloc() function and tune it by hand), involving a memory copy on every resize if the allocator has no adjacent memory reserved. So this post is interesting for Windows users using large arrays involving realloc() calls (the typical way of reducing the impact of the non-optimal realloc() on systems without MMU is to do geometric space allocation, in order to reduce the number of calls to the re-allocator, e.g. when growing, instead of adding space for one element, grow 25% the size of the array).
On Linux, one used to be able to execute mmap() on /proc/self/mem. I used this to implement bank switching for an emulator. Despite Torvalds' recent rant about preserving backward compatibility for userspace, this feature was removed at some point.
I didn't complain. At some point I upgraded my kernel and the code stopped working. This was over 10 years ago. I seem to remember going to the list and seeing that it had been removed due to a security problem. I also seem to remember a Torvalds line about only crazy people wanting to use such a feature. I had lost interest in maintaining the code, so I didn't pursue it.
No, if I remember correctly, there was no other way to do what I was doing without using some file on the filesystem as the chunk of memory to be swapped. Calling mmap() on /proc/self/mem allowed one to make a block of memory show up at the zero page and swap it with another with a single system call, instead of copying the block back and forth. It has been at least a decade since I touched that code, so I may be misremembering.
IIRC, it attempts to use end-of-page allocation by default for blocks of a certain size. It is also possible to enable guard pages on a per process basis or globally.
Thanks to this, memory corruption is much easier to detect and debug on OpenBSD.