A little comment regarding the Z80 vs 6502 register count: I always consider the 6502 to have 256 registers (zero page) and an accumulator and two index registers. Never a shortage if you use it like that...
On a related note: I wish that more CPUs had an explicit cache. So data has to be explicitly loaded into cache, etc.
Modern CPUs are NUMA. Don't treat memory as RAM any more, because that's not true.
Biggest problem with this is that not all CPUs have the same amount of cache. But you can get around this by treating the cache as the low area of RAM, with instructions to get the amount of cache available. Especially if cache is also paged.
Other issue with this is context switches, but this is conceptually no different than paging RAM to disk when required.
I'm not too familiar with many other architectures, but MIPS has dcache "fill", "flush", and "lock" operations, so the user can both do a premature fill operation, or even lock data in the cache so it won't be evicted.
I haven't seen many people actually use these ops, because it's actually pretty hard to do better than the built-in cache allocation policies for most applications, especially if you take into account that your app is going to get swapped out consistently by the operating system task switches.
There is a distinction between allowing said operations and being designed for such operations. It is possible even in x86, although difficult, and requires privileged operations. (A user-mode program can request that something be prefetched or flushed, and can do non-temporal loads (and stores?), but in order to get "true" scratchpad memory you have to play with the MTRR, and even then the processor doesn't support hardware paging of cache, like it does with, for example, RAM)
> I haven't seen many people actually use these ops, because it's actually pretty hard to do better than the built-in cache allocation policies for most applications, especially if you take into account that your app is going to get swapped out consistently by the operating system task switches.
And again, this is largely because the cache is implicit to the OS. There's no way to go "this is the stuff that was cached last time this process has control, when you can, reload it back in" to the processor, because you can't tell what in cache is "owned" by what in anything like an efficient manner - and even if you could, the moment you start executing a context switch you've overwritten random bits of cache.
It's like if the processor was set up to directly talk to the hard drive to do paging on demand, to the point that the OS wasn't even aware of it. In theory it's a good idea, but the more you look at it the more flaws emerge.
If your PPC is using a Discovery PHC you can map half or all of your L2 to a block of PAs and then map it where ever you want with VM. I'm sure this was cause of the experience that Genesis had with MIPS. It's a nifty feature.
In comparison with the Z80 it's roughly equivalent. It took about the same time to load/store indirectly in/from zero-page as it took the Z80 to load/store directly from registers. The 6502 ran roughly in 2 cycles per instruction (average) while the Z80 took about 3 average M-cycles (3~6 T-cycles per M-cycle) per instruction. The 6510 (6502-compatible but with some extra I/O) in the Commodore 64 could out-pace the Z80 in the Spectrum 48 which ran at more than triple the clock speed except maybe in some carefully crafted proof of concept.
Conversions to C64 from Spectrum were routinely done by hand, routine by routine, as the C64 could "emulate" the Speccy this way. For instance games like The Great Escape ( http://www.crashonline.org.uk/35/greatescape.htm ) was converted this way, as were other games (it was a bit disappointing when this happened, as it basically meant not using the C64 graphic sprites and its tricks.
It'd certainly be interesting to write a C compiler codegen target for the 6502 that treated the zero-page as registers (and a lot easier to map modern compiler intrinsics to than the tiny base register set.)
Some years ago I worked on a C compiler for a processor, which, like the 6502, had no general-purpose registers and a ‘fast page’, and we did exactly that.