I started out as a Lisp hacker on machines designed for it (PDP-10 and CADR, later D-machines) so I was very much in the camp you describe. They had hardware / microcode support for tagging, unboxing, fundamental Lisp opcodes, and for the Lispms specifically, things like a GC barrier and transporter support. When I looked at implementations like VAXLisp, the extra cycles needed to implement these things seemed like a burden to me.
Of course those machines did lots of other things as well, and so were subject to a lot of evolutionary pressure the research machines were not subject to.
The shocker that changed my mind was the idea of using the TLB to implement the write barrier. Yes, doing all that extra work cost cycles, but you were doing on a machine that had evolved lots of extra capabilities that could ameliorate some of the burden. Plus the underlying hardware just got faster faster (I.e. second derivative was higher).
Meanwhile, the more dedicated architectures were burning valuable real estate on these features and couldn’t keep up elsewhere. You saw this in the article when the author wrote about gates that could have been used elsewhere.
Finally, some decisions box you in — the 64kb object size limitation being an example in the 432. Sure, you can work around it, but then the support for these objects becomes a deadweight (part of the RISC argument).
You see this also in the use of GPUs as huge parallel machines, even though the original programming abstraction was triangles.
Going back to my first sentence about “at the margins”: optimize at the end. Apple famously added a “jvm” instruction — must have been the fruit of a lot of metering! Note that they didn’t have to do this for Objective-C: some extremely clever programming made dispatch cheap.
Tagging/unboxing can be supported in a variety of (relatively) inexpensive ways by using ALU circuitry otherwise idle during address calculation OR (more likely these days) by implementing a couple of in demand ops, either way pretty cheap.
Finally, we do have a return to and flourishing of separate, specialized functional units (image processors, “learning” units and such, like, say, the database hardware of old) but they aren’t generally fully programmable (even if they have processors embedded in them) but they key factor is that they don’t interfere (except via some DMA) with the core processing operations.
“Going back to my first sentence about “at the margins”: optimize at the end. Apple famously added a “jvm” instruction — must have been the fruit of a lot of metering! Note that they didn’t have to do this for Objective-C: some extremely clever programming made dispatch cheap.”
I’m struggling to think of what you are referring to here. ARM added op codes for running JVM byte code on the processor itself, but I think those instructions were dropped a long time ago. ARM also added an instruction (floating point convert to fixed point rounding towards zero) as it became such a common operation in JS code. There have also been various GC related instructions and features added to POWER, but I think all that was well after Apple had abandoned the architecture.
GP probably meant a “JS” instruction rather than a “JVM” one: FJCVTZS, “Floating-point Javascript Convert to Signed fixed-point rounding towards Zero”[1,2], introduced in ARMv8.3 at Apple’s behest (or so it is said). Apparently the point is that ARM float-to-integer conversions normally saturate on overflow while x86 reduces the integer mod 2^width, and JavaScript baked the x86 behaviour into the language.
Not adding tagging is basically a negligence crime. That feature isn't that expensive and it could have saved most of the security issues that have happened to last 20+ years.
I think you are talking about different types of tagging. Tagging on Lisps and other language VMs is where some bits in a pointer are reserved to indicate the type. So with a single tag bit integers might be marked as type 0 (so you don’t need to shift things when doing arithmetic) and other objects would be type 1. This provides no real protection against malicious code at all. There are other types of pointer tagging that do provide security, and we are starting to see some support in hardware.
It depends - at the time capability machines were all the rage - the idea there is that you add EXTRA bits to memory (say 33-bit or 34-bit memory - Burroughs large systems were 48-bit machines with 3-bit tags - plus parity) with careful rules about how they could be made, so that pointers could not just be created from integers.
At the time memory was really expensive so blowing lots of bits on tags was a real issue (late 70s we bought 1.5Mb of actual core for our Burroughs B6700 for $1M+) plus as memory moved onto chips powers of two became important, getting someone to make you a 34-bit DRAM would be hard, much less getting a second source as well
"powers of two became important, getting someone to make you a 34-bit DRAM would be hard, much less getting a second source as well".
I have several pounds of 9-bit memory. It's 'parity' memory, was fairly common back in the day, and adding another bit just means respinning the simm carrier to add a pad for another chip. Of course, I don't know if anyone is making 1- or 4-bit DRAMS any more, so you might be stuck adding a 8-bit or larger additional chip. Memory is probably cheap enough now if you wanted 34 bits to play with and were somehow tied to larger power-of-two, you could just go to 40 or more bits and do better ECC or more tag space or just call the top 6 'reserved' or something. It's a solvable problem.
I have built CPUs with 9-bit bytes (because of subtle MPEG details), they made sense at the time and ram with 9-bit bytes was available at a reasonably low premium - that probably wasn't true when the 432 was a thing, RAM was so much more expensive back then.
You could get special sized DRAM made for you, it wouldn't be cheap, getting a second source even more expensive - you'd have to be an Intel, IBM or someone of that size to guarantee large enough volumes to get DRAM manufacturers to bite
If you'll notice, memory is often made on a carrier (e.g. SIMM module). So you don't have to find someone to make you a x9 or x34 or whatever bit wide chip; you find someone to make a carrier out of off the shelf parts with enough chips for your word width (possibly burning some bits). Early 9-bit SIMMs had 2 4-bit wide DRAMS and 1 1-bit wide DRAM. Just need a memory controller that makes sense of it.
(I design memory controllers ....) you can sort of do that depending on where your byte boundaries are (and whether your architecture needs to be able to do single byte writes to memory) - more though I was trying to point out that historically just 'burning some bits' was not something you could practically do cost wise (it's why we built a 9/72/81-bit CPU in the 90s rather than a 16/125/128 one - the system cost of effectively doubling the memory size would not have made sense)
These days (and actually in those days too) memory isn't really the size of the memory bus, often it's a power of two multiple - those 9-bit RAMBUS drams we were using really moved data on both edges of a faster clock - our basic memory transfer unit was 8 clock edges x 9 == 72 bits per core clock - as a designer with even 1 DRAM out there that's the minimal amount you can deal with and you'd best design to make the most of it
My point was there was more than one way of solving the problem (economically optimized or not), and having custom width memory silicon wasn't the only answer. But sure, if you move the goalpost around enough you get to be right.
That's an incredibly bad argument because these are the same computer the NSA and the US govenrment itself uses and they are exposing themselves. Not to mention all the US company who are subject to intellectual property theft and so on.
I started out as a Lisp hacker on machines designed for it (PDP-10 and CADR, later D-machines) so I was very much in the camp you describe. They had hardware / microcode support for tagging, unboxing, fundamental Lisp opcodes, and for the Lispms specifically, things like a GC barrier and transporter support. When I looked at implementations like VAXLisp, the extra cycles needed to implement these things seemed like a burden to me.
Of course those machines did lots of other things as well, and so were subject to a lot of evolutionary pressure the research machines were not subject to.
The shocker that changed my mind was the idea of using the TLB to implement the write barrier. Yes, doing all that extra work cost cycles, but you were doing on a machine that had evolved lots of extra capabilities that could ameliorate some of the burden. Plus the underlying hardware just got faster faster (I.e. second derivative was higher).
Meanwhile, the more dedicated architectures were burning valuable real estate on these features and couldn’t keep up elsewhere. You saw this in the article when the author wrote about gates that could have been used elsewhere.
Finally, some decisions box you in — the 64kb object size limitation being an example in the 432. Sure, you can work around it, but then the support for these objects becomes a deadweight (part of the RISC argument).
You see this also in the use of GPUs as huge parallel machines, even though the original programming abstraction was triangles.
Going back to my first sentence about “at the margins”: optimize at the end. Apple famously added a “jvm” instruction — must have been the fruit of a lot of metering! Note that they didn’t have to do this for Objective-C: some extremely clever programming made dispatch cheap.
Tagging/unboxing can be supported in a variety of (relatively) inexpensive ways by using ALU circuitry otherwise idle during address calculation OR (more likely these days) by implementing a couple of in demand ops, either way pretty cheap.
Finally, we do have a return to and flourishing of separate, specialized functional units (image processors, “learning” units and such, like, say, the database hardware of old) but they aren’t generally fully programmable (even if they have processors embedded in them) but they key factor is that they don’t interfere (except via some DMA) with the core processing operations.