Am I misreading the bitmask code? It looks like (in addition to a few other ideas) it's using the old "stick a few extra bits in an aligned pointer", but it seems to be only manipulating high bits, whereas aligned pointer zeroes are low-order bits.
I'd suggest a heavier investment in testing infrastructure.
64 bit architectures don't actually have 64 bit address spaces, both AMD64 and ARM64 have 48 bit address spaces by default (some CPUs have extensions you can enable to request larger address spaces e.g. LVA on ARM64 and 5LP / LA57 on AMD64 but that's opt-in on a per-process basis).
So while you have 3 bits available at the bottom of the pointer, there are 16 at the top. That's a lot more payload you can smuggle. There are even CPU extensions which tell the processor to ignore some of that (Linear Address Masking for Intel, Upper Address Ignore for AMD, and Top Byte Ignore for ARM).
Small correction. That's true for 4-level paging where virtual memory is capped at 256 TiB (LAM48). There's a 5-level extension that reduces the wasted space to 7 bits (LAM57) to allow for 128 PiB of virtual space.
I have no idea what the purpose of the extension is that when I don't believe you can get that much secondary storage attached to a CPU unless you go to tape which is pointless for virtual mapping.
I literally mentioned five-level paging in my comment.
But it's an opt-in extension, you need the kernel to support it, and then you need to ask the kernel to enable it on your behalf. So it's not much of an issue, you just have to not to use this sort of extensions (it's not the only one, ARM64 also has a similar one though it only goes up to 52 bit address spaces) with 16~19 bits of tagging.
You do need to opt-in when compiling the kernel, but on Linux it doesn't take anything particularly special on the program's side to enable it. The rule is that non-fixed mmap() will only ever yield 48-bit addresses, until the program specifies a 57-bit address as a hint, after which it can yield 57-bit addresses at will. (This rule has lots of curious consequences for the location of the vDSO, when 5-level paging is enabled and an ELF program uses big addresses for its segments.)
Any idea what the use case is for such large addresses? Is it RDMA or something else? Even with RDMA I find it hard to believe that companies have > 256TiB of RAM installed behind a single switch.
I recall hearing one use case where you have a whole lot of threads, and most of them will only use a little memory, but a few will need a whole lot of memory. So you assign each thread a very large segment of the address space upfront, so that they never have to coordinate with other threads at runtime. At 1 GiB of space allocated to each thread, it takes only 262k threads to use up the whole 256 TiB address space.
Good question. I also may not understand that since as of today, and to my knowledge, top of the line Intel Xeon's can support up to 4TB per socket. This means that for a 8-socket system, largest amount of physical RAM would equate to 32TB ... which is not even close to the addressable (virtual) memory on even 48-bit systems (256TB).
Why? Alignment causes some low–order bits to be zero. Shifting the pointer to the right drops those zeros on the floor, leaving high–order bits zero (or sign extended or whatever). Then you can put your tag up there instead. Shifting the value left by the same amount drops the tag and recovers the same pointer for use.
I'd suggest a heavier investment in testing infrastructure.