Stealing bits from pointers makes sense sometimes, but in most cases an easier and less error-prone way to increase memory efficiency is to just use indices or relative pointers. Instead of shaving individual bits of a pointer take advantage of the fact that most pointers in a program are almost exactly alike.
Instead of storing a 64 bit prev and next pointer for an item in a linked list just store e.g. two 16 bit offsets. Then the next_ptr = (this_ptr & base_mask) + offset * alignment.
Most of the time wasting memory by storing full 64bit pointers is no big deal. But when it is, you can go a lot further than just shaving off a few bits.
If you want to be extra clever you can encode values in the base address of your pointers by taking advantage of how virtual memory is allocated. Every process has its own virtual memory pages and you can request memory pages at specific address ranges. For instance if you align all your memory at a 32gb boundary then you can use 32bit pointers + a base pointer without even having to mask. You can also use virtual memory allocation tricks to convey information about alignment, typing, whether a pointer is allowed to be shared between threads and so on. And again, without having to butcher the pointer itself.
Considering how common it is to use indices in lieu of pointers in Rust, it would be nice if it had native NonMax integer types similar to the NonZero types it already has. NonZero gives you the niche-filling optimization (e.g. Option<NonZero> and NonZero are the same size) but with indices you want zero to be a valid value, so it would be natural to use the maximum as the illegal value instead. You can kind of do it by offsetting the index by 1 or inverting the bits then putting it in a NonZero but that's a small extra runtime cost on every use.
You can do a handy struct wrapper over a private Nonzero that xors the given value with the prohibited value (max in this case) at the new constructor. And like so for the get method. Xoring is very cheap. That's my favorite way of storing indices/links for nodes, since you can wrap them in Option for free.
That's really not the application for this, it's tags, which are endlessly useful or characterizing the pointer. For instance you can use the MSBs to establish the type of whats pointed to, or for reference counting. I use the LSB to indicate if the value is a pointer or an unsigned integer, whereby you can store both an (integer) key to a record in a database, or a pointer to the record data in memory. You could use another bit in the pointer to indicate that the record is dirty.
Using unused bits in pointers can make you code more efficient and simpler.
> Pointer Authentication is a mechanism by which certain pointers are signed. When a pointer gets signed, a cryptographic hash of its value and other values (pepper and salt) is stored in unused bits of that pointer.
> Before the pointer is used, it needs to be authenticated, i.e., have its signature checked. This prevents pointer values of unknown origin from being used to replace the signed pointer value.
ECC can't solve for that because it would only have one key for all processes.
Given a presumption of side channel writes, pointer auth is another layer of defense that may or may not be worth the loss in performance.
This requires using arenas extensively in your program or perhaps playing tricks with virtual memory, because your system allocator will give you arbitrary pointers by default and you'll often want to point to an object from a different allocation.
Right. Which is why I favor doing your own memory management entirely (which has some huge advantages) or not worrying about memory at all and trusting a garbage collector. I don't think there are many situations left where memory management is too important to leave to a gc but not important enough to do from scratch.
Definitely indices are a great option, but then you need a base pointer and a way to allocate off of it. That can add significant complexity. So it is all a tradeoff.
I was just thinking this. You'd either need to special-case your debuggers or be able to use something that provided bespoke hooks for pointer-following.
- The game used a tricky pointer hack, basically on the PSX accessing a pointer with the highest-24-bit set means read it from the CPU cache, otherwise not (or maybe the other way around). This was used for example to indicate whether the C4 bomb was planted on the ground, or on the wall instead of keeping a booblean/bit flag for it. Possibly was used for some more things. Since it was 24-bit, that meant 16mb.
Are there ways on Linux to make sure that all user-visible pointers in a process are allocated within some subset of the overall address space? For instance, imagine writing a 64-bit program such that all userspace pointers are allocated within the first 32 bits. This would allow for representing them in memory using four bytes, as is commonly done on 32-bit architectures--without adding more extensive or ad-hoc changes such as a new "x32" binary format. Except that you could also limit to 24-bit (giving 16M of address space), 16-bit (for a "tiny" code model) or anything else. Of course, this would require some amount of support in the linker/loader, userspace libraries and the kernel itself. But the effort might be worthwhile if it can allow for replacing more ad-hoc approaches.
Linux mmap (since 2.6) has a MAP_32BIT flag, but only for x86-64:
Put the mapping into the first 2 Gigabytes of the process address space. This flag is supported only on x86-64, for 64-bit programs. It was added to allow thread stacks to be allocated somewhere in the first 2 GB of memory, so as to improve context-switch performance on some early 64-bit processors. Modern x86-64 processors no longer have this performance problem, so use of this flag is not required on those systems. The MAP_32BIT flag is ignored when MAP_FIXED is set.
I never had the problem of too many pointers, but if I did I'd try the "arena"[0] approach.
Preallocate a huge area of memory and use smaller indexes relative to the start instead. Provided a pointer is bigger than a int. You might be able to use this in that approach and keep partial pointers instead.
> What do you call a pointer with the high bits stolen? An ointer!
What is the name a pointer with the low bits stolen? Machines with strict word addressing will ignore the bottom two bits so you can safely store what you want there.
Not sure I've used, much less programmed a machine with that strict constraint in a while though. These days CPUs are all designed to implement the C abstract machine. At least the GPU folks are not subject to this constraint.
Am I misreading the bitmask code? It looks like (in addition to a few other ideas) it's using the old "stick a few extra bits in an aligned pointer", but it seems to be only manipulating high bits, whereas aligned pointer zeroes are low-order bits.
I'd suggest a heavier investment in testing infrastructure.
64 bit architectures don't actually have 64 bit address spaces, both AMD64 and ARM64 have 48 bit address spaces by default (some CPUs have extensions you can enable to request larger address spaces e.g. LVA on ARM64 and 5LP / LA57 on AMD64 but that's opt-in on a per-process basis).
So while you have 3 bits available at the bottom of the pointer, there are 16 at the top. That's a lot more payload you can smuggle. There are even CPU extensions which tell the processor to ignore some of that (Linear Address Masking for Intel, Upper Address Ignore for AMD, and Top Byte Ignore for ARM).
Small correction. That's true for 4-level paging where virtual memory is capped at 256 TiB (LAM48). There's a 5-level extension that reduces the wasted space to 7 bits (LAM57) to allow for 128 PiB of virtual space.
I have no idea what the purpose of the extension is that when I don't believe you can get that much secondary storage attached to a CPU unless you go to tape which is pointless for virtual mapping.
I literally mentioned five-level paging in my comment.
But it's an opt-in extension, you need the kernel to support it, and then you need to ask the kernel to enable it on your behalf. So it's not much of an issue, you just have to not to use this sort of extensions (it's not the only one, ARM64 also has a similar one though it only goes up to 52 bit address spaces) with 16~19 bits of tagging.
You do need to opt-in when compiling the kernel, but on Linux it doesn't take anything particularly special on the program's side to enable it. The rule is that non-fixed mmap() will only ever yield 48-bit addresses, until the program specifies a 57-bit address as a hint, after which it can yield 57-bit addresses at will. (This rule has lots of curious consequences for the location of the vDSO, when 5-level paging is enabled and an ELF program uses big addresses for its segments.)
Any idea what the use case is for such large addresses? Is it RDMA or something else? Even with RDMA I find it hard to believe that companies have > 256TiB of RAM installed behind a single switch.
I recall hearing one use case where you have a whole lot of threads, and most of them will only use a little memory, but a few will need a whole lot of memory. So you assign each thread a very large segment of the address space upfront, so that they never have to coordinate with other threads at runtime. At 1 GiB of space allocated to each thread, it takes only 262k threads to use up the whole 256 TiB address space.
Good question. I also may not understand that since as of today, and to my knowledge, top of the line Intel Xeon's can support up to 4TB per socket. This means that for a 8-socket system, largest amount of physical RAM would equate to 32TB ... which is not even close to the addressable (virtual) memory on even 48-bit systems (256TB).
Why? Alignment causes some low–order bits to be zero. Shifting the pointer to the right drops those zeros on the floor, leaving high–order bits zero (or sign extended or whatever). Then you can put your tag up there instead. Shifting the value left by the same amount drops the tag and recovers the same pointer for use.
And it's not like Apple isn't aware of the dangers, either; they used top-byte tagging on the 68000, and it took them a lot of time and effort to migrate away from that to support 16+ MB systems.
IIRC the issue was that 68000 would transparently mask the top bits, so code could get away with avoiding the explicit masking, which of course breaks when you move to a larger address space.
More modern cpus enforce specific bit patterns for the MSBs, so broken code would be caught immediately.
It's common for libraries to use the 2 lowest bits in tree structures e.g. RB tree. It's not a problem at all since the data structure is controlled by the library. Even if tree nodes are allocated/provided by the user, having a "4-byte aligned" requirement in the API contract isn't a problem at all -- in fact you need to work quite hard to allocate a structure with a pointer that isn't at least 4 byte aligned (i386), or 8 byte aligned (x64).
Well no it isn't since malloc returns a void* and incrementing a void* is impossible for a compiler to figure out the offset that it needs to increment address for. For that operation to succeed/compile you need to cast the result of malloc first to p*. This means that the resulting alignment will still be correct.
The actual danger is in reinterpreting the pointers, or thereof some arbitrary chunk of memory, to a type that doesn't necessarily satisfy the alignment requirements for that given memory address.
gcc will compile this code and advance the pointer address by 1 byte. Yes, it's not standard conformant.
My point though is that this is unlikely to happen by accident. The code above is clearly constructed. It's more work to type it than the correct version. The code makes no sense at all. 1 byte is wasted and it's in general a bad idea to allocate unaligned pointers.
And when you for example provide your own allocator for nodes, you'll probably do it like this:
And the memory will be aligned properly. Even when you use a plain malloc (no type or alignment information given to the allocator) you will get memory that is at least pointer-aligned.
<source>: In function 'void foo(int)':
<source>:4:28: warning: pointer of type 'void *' used in arithmetic [-Wpointer-arith]
4 | int * p = malloc(size) + 1;
| ~~~~~~~~~~~~~^~~
<source>:4:28: error: invalid conversion from 'void*' to 'int*' [-fpermissive]
4 | int \* p = malloc(size) + 1;
| ~~~~~~~~~~~~~^~~
| |
| void\*
Compiler returned: 1
first, yes it did accept the pointer arithmetic even though it issued a warning.
Second, the error about implicit conversion from void-pointer to typed-pointer is in C++ only -- not in C, where you won't even get a warning. The error happens because you didn't cast the void pointer to the target type and doesn't have to do anything with the fact that gcc _does_ let you (as a GCC extension) do arithmetic on void pointers.
If you remove the "+ 1" you'll still see the same error in C++. In C++, change the target pointer to a void pointer:
void *ptr = malloc(42) + 1;
or, alternatively, cast the RHS expression properly (doesn't matter with "+ 1" or without)
I don't think there's a need to explain to me how pointers work, or what to do to make this error/warning go away, since it should be pretty obvious from my response that this is something I know very well.
I was only disputing the claim that ending up with the unaligned ptr is not trivial for given line of code by providing explanation, and counter-example, why this is not possible. At least not in C++ because that's the language I had in mind.
What you're saying is that it's possible to do void pointer arithmetic in C by using the GNU extension - fine, I can't disagree with that.
You corrected me and I corrected you. No need to take personal offense. I don't care what you know -- I obviously took the compiler output that you presented above as a refutal to what I said earlier, and I pointed out why it's not a valid refutal.
But those pointers still have bits in them that indicate the type of the object that they point to. It is merely the case that the tag bits are chosen at run time so that they correspond to allocated memory regions, rather than being chosen ahead of time.
None of the bits in the pointer have been "stolen" though, which is the point of the referenced article, they are all still available to access the full address space of the CPU.
That doesn't mean that they aren’t still there, and don't tell you the type. It’s still a tagged pointer, and because all objects of the same type must be allocated out of the same page (or collection of pages), it still limits how much memory can be dedicated to objects of that type.
No. Their reputation for poor performance mostly arises because the default data structure is a singly–linked list. Although the language makes these lists extremely convenient for the programmer to use, linked lists invariably result in terrible cache locality. High–performance systems that are written in lisp (and they do exist), achieve that performance by avoiding lists and using vectors and arenas instead.
Of course, it must be said that this performance penalty rarely matters in practice. For most programs, the limiting factor is not performance but the cost of implementation and maintenance. Lisp has has many powerful and convenient features that make implementation faster and surer than in any other language, and which usually greatly lower the cost of maintenance as well. Thus by writing a new program in Lisp you can get it off the ground and earning you money faster than you could in almost any other language. You can also add features more cheaply as time goes on, allowing you to keep up with your competitors. It is only when your system is nearing completion, and all necessary features have been invented and implemented, that you should think about rewriting select parts of that system in a systems language where more efficiency is possible. Not only will you only understand the system after it is written, but you can measure the actual performance of the system to figure out which parts you need to rewrite, and which parts have acceptable performance already and thus do not need to be rewritten.
Cache locality is improved if the Lisp has a compacting garbage collector, but you still need extra memory for the cdr links. Lisp Machines had cdr-coding that allowed those to be avoided but I don't think that's practical on stock architectures.
True, but don’t forget that one of the _other_ causes of performance problems in lisp programs is the creation and subsequent collection of garbage. If you know ahead of time which of your data needs to be accessed or created by the most performance–sensitive parts of your program, then you can put that data into vectors to start with and avoid all of the extra work.
Long-lived things end up in older generations and aren't traversed much during GC, which is the moral equivalent of being statically allocated.
There's overhead if you mutate old objects in a GCed system due to card marking.
Lisp vectors are vectors of pointers, so there's still an overhead of dereferencing through it. Presumably the objects being pointed to end up compacted together, eventually.
I should have been more clear; I meant using vectors as arena allocators, so that all of your data ends up stored contiguously with no extra indirection.
> linked lists invariably result in terrible cache locality
With a compacting GC (ie most of the relevant ones) lists often provide better memory locality than arrays. They only lose at a large scale due to having +8 bytes of memory, which makes arrays occasionally cheaper for large sequences.
Even with perfect cache locality, a linked list will be significantly slower to traverse than an array because of worse pipelining and parallelization. Partially unrolled lists can help, but at the cost of additional complexity.
Of course not all container accesses require traversal.
Why do lists provide better memory locality than arrays? The array already stores its elements next to each other in memory. The compacting GC helps arrange the elements of the list close to each other too, right? Then wouldn’t the compacting GC at best even out the performance difference for sequential traversal?
Success, yes. It need not be direct commercial success; lots of programs are written purely for internal use within some larger organization. And even when there are known performance requirements, Lisp is often still a good choice because the performance is handled by specialized hardware. Consider games, for example. Time to market and rapid iteration are both paramount, while performance is taken care of not by writing a fast renderer but by sending everything to the GPU to be rendered on specialized hardware.
It's not. Two counterexamples: LuaJIT's interpreter/VM, and Wren's, both use NaN boxing, and are among the fastest VMs out there (we're ignoring JITs as out of scope).
It isn't tagging pointers that makes things (some Lisps are 'things', some are not) slow: it's pervasive abstraction and indirection. Doing some masking on a pointer is pipelined, out-of-order, one to three cycles, and is absolutely dwarfed by cache misses and conditional mispredictions, let alone the sort of pointer chasing which is common in idiomatic Lisp (or indeed Python) code.
I don't know what you mean. Maybe in the early early 80s that was the case, but PCs were still 16-bit back then, and would've been a poor fit for Lisp anyway.
One of the reasons why the Lisp machines died out is that round about the mid-80s or so, compiler technology improved for generic 32-bit processors, and it became possible to run Lisp software on a VAX, 68k, or RISC CPU faster than it ran on the Lisp machines' bespoke architecture. Back during the first AI hypecycle, the makers of Golden Common Lisp introduced the Hummingboard to the market, billed as an inexpensive solution to have a "Lisp machine in a PC". It was just a 386 on an ISA card (predating Compaq's 386 PC by about a year) with gobs of memory on board; a special build of Golden Common Lisp allowed code to be run on that CPU rather than the main one.
Symbolics had its own PC implementation of Lisp: CLOE.
From the Lisp FAQ: "CLOE (Common Lisp Operating Environment) is a cross-development
environment for IBM PCs (MSDOS) and Symbolics Genera. It includes
CLOS, condition error system, generational garbage collection,
incremental compilation, code time/space profiling, and a stack-frame
debugger. It costs from $625 to $4000 and requires 4-8mn RAM and a 386
processor. "
Later they also ported Genera to DEC Alpha. Currently I have access to an implementation which runs on ARM64 and Intel x64.
>One of the reasons why the Lisp machines died out is that round about the mid-80s or so, compiler technology improved for generic 32-bit processors, and it became possible to run Lisp software on a VAX, 68k, or RISC CPU faster than it ran on the Lisp machines' bespoke architecture.
I'd say Lisp machines died out because Lisp died out (commercially). Other languages got popular, the second AI winter didn't help at all either.
If Lisp itself had fared better (even if it was on generic hardware), Lisp machines could have been saved too, they could still use a VAX, 68k, or RISC CPU underneath, and optimize for the developer experience. Or they'd have turned Lisp machines from hardware into a "Lisp machine" OS or IDE/REPL/etc environment for other OSes succeeding. But none of that took off.
> Or they'd have turned Lisp machines from hardware into a "Lisp machine" OS or IDE/REPL/etc environment for other OSes succeeding.
You mean like Allegro CL? LispWorks? Genera on DEC Ultrix?
With the exception of Genera, these solutions are available, maintained, and supported today. Even Genera hung on for a good few years. Lisp machines started to wane a few years before the AI winter hit in full force, because the idea of dedicated hardware to run Lisp programs made sense when your alternative was Maclisp struggling on a PDP-10, but became expensive, slow, and fiddly compared to the generic boxes running fast 32-bit processors with, again, much improved compiler tech. Genera on DEC Alpha, even as an interpreted VM, was so much faster than any of Symbolics's bespoke CPU architectures that Symbolics just quit making hardware and declared the Alpha version of Genera the official upgrade path.
One road not taken would have been to port the MIT/LMI/TI Lisp Machine software to stock hardware, it was 32-bit until the end so didn't need an Alpha as the host. A fast 68020 with a custom MMU would have been fine. There was only around 12k of "assembler" to rewrite to provide the VM and simple RTOS.
>You mean like Allegro CL? LispWorks? Genera on DEC Ultrix?
None of these cover my whole description, namely the last part:
"Or they'd have turned Lisp machines from hardware into a "Lisp machine" OS or IDE/REPL/etc environment for other OSes SUCCEEDING".
I wasn't talking about mere existance of those.
My argument was "if the issue was not Lisp falling in adoption in general, but merely Lisp Machines being dedicated hardware (as the parent claimed), then Lisp OSes for generic hardware and Lisp IDEs/envs for popular OSes would have succeeded.
> Or they'd have turned Lisp machines from hardware into a "Lisp machine" OS or IDE/REPL/etc environment for other OSes succeeding.
That was a very different business model, very different offering into a different market. One could not move that business model easily, in an existing company to a different market. There were offices, factories, contracts, ... -> costs to get rid of. The thing collapsed, before anything usefully could be scaled down.
For example the Symbolics graphics products were very expensive, just the software. A new owner ported it to SGI and Windows NT machines. To be a technically viable product it used a commercial Lisp vendor. It survived a while in that market (modelling/animation/game tools for game studios, animation studios, ...), changed owners again and then died.
Lisp (the forbidden word during/after the AI Winter) was a part of the problem, but generally a new business model and customers for it wasn't found/searched. For example TI just closed its AI business and never cared about it from then on.
Something like Genera was a huge pile of Lisp code written during 1.5 decades. During its best times the OS and maintenance upgrades were already more expensive than a PC.
Applications from the Lisp Machine were ported away. One no longer needed the OS and no longer had the OS. Some applications died, some survived, some died later.
Some applications (or the development environment) survived for some years on emulators (-> Interlisp-D was ported to SUNs and PCs, Genera was ported to DEC Alpha).
> Something like Genera was a huge pile of Lisp code written during 1.5 decades. During its best times the OS and maintenance upgrades were already more expensive than a PC.
Genera's biggest contemporary problem is that John C. Mallery seems to want to just sit on it rather than make it available to people.
Likely future revenues must be close to nil, so why not open source it? Apparently he talked about it years ago but never actually has.
Yes, a lot of the code is really dated, but if it were open source, maybe some people might freshen some of it up.
Franz and LispWorks are still in business through several boom-bust cycles. You never specified, what are your criteria for success? Some market share percentage? My point was, the second AI winter happened, but interest in Lisp machines waned first, because Lisp on generic processors reached a point where they could run the same code faster for less money.
> Maybe in the early early 80s that was the case, but PCs were still 16-bit back then, and would've been a poor fit for Lisp anyway.
There were LISPs for 16 bit MS-DOS. The most famous was arguably XLISP, which was used as the basis for AutoCAD’s AutoLISP extension language, and also the XLISPSTAT statistics package. Another was muLISP, which was resold by Microsoft as Microsoft LISP, and also used as the basis of the muMATH computer algebra system, and also its successor Derive.
I'm familiar with Xlisp, Microsoft Lisp, and PC-Scheme. What I'm not sure about is how performant they were, or whether large systems could be built with them.
No, the reason for less than stellar performance of Lisp on the 32bit x86 CPUs were things like too few registers.
On 64bit Intel CPUs this is much less of a problem and Lisp runs roughly twice as fast in 64bit mode. I would think that this even holds without taking advantage of wider data words.
On other architectures this was less of a problem. 32bit CPUs had more registers on a RISC or some other CISC CPUs.
If you build your whole app with it, yeah. If you use it tactically for a certain data type it can be very nice. Just don't try to dissimulate it's specialness.
Besides that it makes some lock-free algorithms even feasible and some easier to implement, yes, it also can be used to downsize the memory pressure at the cost of extra CPU cycles if you know that the metadata you want to track fits in those 16+ bits.
Instead of storing a 64 bit prev and next pointer for an item in a linked list just store e.g. two 16 bit offsets. Then the next_ptr = (this_ptr & base_mask) + offset * alignment.
Most of the time wasting memory by storing full 64bit pointers is no big deal. But when it is, you can go a lot further than just shaving off a few bits.
If you want to be extra clever you can encode values in the base address of your pointers by taking advantage of how virtual memory is allocated. Every process has its own virtual memory pages and you can request memory pages at specific address ranges. For instance if you align all your memory at a 32gb boundary then you can use 32bit pointers + a base pointer without even having to mask. You can also use virtual memory allocation tricks to convey information about alignment, typing, whether a pointer is allowed to be shared between threads and so on. And again, without having to butcher the pointer itself.