I bought an Alpha for a project that needed a lot of directly addressable memory, it was the first 64 bit architecture that was affordable and ran RedHat on it. That box paid for itself within the first week.
It was also the first (non-research) processor I'm aware of that was designed from the ground up to be 64-bit, without 32-bit addressing. Of course, as long as you can get the OS to allocate only within a given 4 GB range, you could emulate 32-bit pointers by storing only 32-bit offsets.
The processor's firmware (PALCode) was essentially a single-tenant hypervisor, and the OS kernel made upcalls to the firmware in order to perform any privileged instructions. Had the architecture survived longer, this would have been handy for virtualization. Modern OS kernels have special cases for upcalls when running on top of hypervisors in order to avoid some of the overhead of the trap-and-emulate code in the hypervisor.
The designers were brutal in only including instructions that could show a performance improvement in simulations. The first versions of the processor didn't have single byte loads or stores, presuming that the standard library string functions would load and store 64-bit words at a time and perform any necessary bit manipulations in registers. They later relented and included an instruction set extension for single-byte operations.
They were also famously brutal in their memory model, leaving as much leeway as possible for hardware to re-order operations. As long as you're correctly using mutexes to protect shared state, the mutex acquisition and releasing code will properly synchronize all of your memory operations. However, if you're implementing lockfree data structures, the Alpha is particularly liberal in its read ordering, and you need read fences on the reader side of lockfree structures, which is unusual. Experience has shown that for most code, the potential performance improvements aren't very significant, especially considering the increased potential for concurrency bugs.
I loved that box. It worked for many years and when we finally shut it down it really felt like the end of an era. This was when I emigrated to Canada where I stayed until 2007, it would have been nice to take it along but we were shipping enough stuff across the Atlantic as it was.
I'm pretty sure that if you had dropped that from the 10th floor of a random office building you'd be fined for damage to the pavement but that machine would have still worked ;) It also took two people to lift it.
I had a couple of Alphas in my home lab up until about 9-10 years ago (a largish DEC3000, a 'generic' 164PC, a DS20). Even as elderly boxes they were astonishingly well built, performant enough to do real work on, and gave a useful 'not an x86' check when I was testing for portability and such. However, I figured out they used a significant fraction of all the power consumed in the lab and generated heat like furnaces. When I did a tech refresh regrettably they needed to go. The 3000 in particular seemed like it was designed to go into combat.
Ah yes, the power consumption... don't get me started on that one. That box alone probably took more power than whatever else was living in that rack :)
And now your average phone has more CPU power and more storage...
I switched to 100% solar power here a few months ago and have powered down all of the more beefy stuff, power consumption went from 30 KWh / day to < 10...
To be fair, much potential of such a memory model (notably, where the data dependency of pointer-chasing (and anything else where an earlier load is used to compute a later load's address) doesn't force a matching order of when the load instructions execute (pull the data out of cache and into a register)) is only unlocked once you have substantial contention (or some contention plus latency between cache and actual memory, like if you go across PCIe/some-other-NUMA-fabric, or access Optane memory) along with load address speculation (this can be done automatically or through explicit prefetch instructions).
In the absence of a read fence, you can speculate the address of a second load, execute this second load in parallel with a first load, and ignore a cache invalidation (or just not wait until you can rule out an asynchronously transmitted one) hitting the second load so long as the address (computed from the first load's data) was correctly predicted.
It hits even harder when the predicted address was a cache hit and the first load experienced a cache miss, because now you can speculate execution using the second load's data (delivered from the cache) and retain/confirm/retire the results of the speculated computation as soon as the first load's data returns and the computation of the second load's address confirms the speculated one.
An example is read-only access to data structures with pointer chasing while a different core performs copying garbage collection. Because the data is (semantically) read-only, the old copy and the new copy are both equally valid, and as long as you don't accidentally read the new space before the copy was written into it, you can pointer-chase freely through these structures reading the next e.g. linked-list entry from either the old or the new place (if you speculate correctly).
Critically, this could get by with invalidating only cached data for the copy target range, ensuring readers don't get the uninitialized data, without invalidating their cache of the copy source range. Of course that would require sufficiently targeted invalidation.
Other cases like e.g. typical union-find / disjoint-set datastructures work just fine with standard fence-free Alpha memory accesses, at the slight cost of `union` operations not coherently affecting outcomes of `find` operations. That's often not a problem, though, as parallel applications already have to cope with the `union` racing the "subsequent" `find` operations (and ending up with the `union` happening last).
One thing that I found pretty interesting is handling of exceptions, you could basically delay dealing with them and then at the end of a block check if anything had happened.
I not a architect and don't know enough about the topic, but I thought that might be something interesting for RISC-V. Love to read about the advantages and disadvantages of that.
It's quite simple, actually: a customer needed a very large database (> 4G) and one proposal was to create some kind of sharding mechanism because all of the ways that they could think of to do this in RAM meant that they had to use a pretty large number of machines, complicating all of the ways in which updates, queries and keeping it all synchronized would have to be done, besides requiring a rack full of hardware.
The Alpha made all of that moot because in one fell stroke it increased the amount of RAM that could be addressed directly to the point where the whole thing could happen in memory without any cluster communications overhead. It was still an expensive machine but it cost a fraction of the setup that it replaced, and performed really very well. A nice example of how vertical scaling can be a very viable option. The 64 bit file system also allowed for much larger files, which helped the project in different ways.
One downside was that spare hardware was difficult to obtain but the system was built like a tank and ran for many years until there were many other suppliers of 64 bit systems.
It was way ahead of the anything else in the 'affordable' range of computers. Though it still cost as much as a nice car fully decked out, especially the RAM was quite expensive.
Having a 64 bit system at that time could have also really helped with implementing super fast virtual machines. You have so much space to store information in pointers.
Azul later realized some of these things on Java. Building a virtual machine and even language to take advantage of that from the ground up would have been cool.