The easiest way to shift bits isn't "adder like" with carry propagation across data lines (sorta), although that could be done. The easiest way is just to phase distort the bus (sorta) so what you called bit X is now bit X-4 for all bits on the bus, and use a 2x1 mux to select which, and cascade rollers in binary (you don't need a roller for all integer numbers if you can cascade them 16, 8, 2, 1 position only activating some rolls). If you can rotate, you can shift by adding one more stage at the end that does bit oriented ops to clear MSBs or LSBs. Also if your roller is capable of rolling all bits on the bus, you don't need two rollers one in each direction. Latency does add up of course, all those cascaded ops.
Also I saw in the notes the Rodney Zaks z80 book being referenced. I learned assembly from that book, back when it was new. I enjoyed that book. I also looked at the PDF scan of someone's beat up copy... remember when textbooks were only $10.95 each?
I still have my original copy of Zaks' book in my collection. I obtained it in 1981 from a friend. I did so much z80 assembly on the TRS-80 that I could "see" instructions in a hex dump before asking zbug to disassemble them.
The z80 was quite the chip for its time and I enjoy reading these articles has they look back into the deeper inner workings of the chip.
One could master the z80, I'm not sure many modern processors could be mastered at that level by a single programmer. They seem to have gotten so complicated that they're beyond comprehension in a single person's mind. I could be wrong but that's the general feeling I get. Sure you can be really good at them and maybe an expert in many applications of a modern complex processor but I doubt many people have a singular understanding of everything each chip has to offer.
The 8086, which is based on the Z80, is not that much more complicated (even the opcode structure is similar, being an octal format), and same goes for the 80186. Things started becoming a bit more complex with the '286 and its protected mode, and then the '386+ with 32-bit and its various protected-mode enhancements really made it harder for a single person to master fully, but I think it was still mostly possible; the really hairy stuff started with 64-bit mode and all the extensions (e.g. virtualisation) that have been made to that since.
On the non-x86 side, modern ARM SoCs like the ones used in smartphones are not all that much simpler despite a less complex CPU core - they're still thousands of pages of documentation in total.
Blimey, I'd forgotten all about Rodney Zaks. I had a copy of his "Programming the 6502" book back in the early and mid 80's when I used to hack out BBC Micro sideways ROM odds and sods. What a blast from the past.
Meh, at least it's sublinear latency with the number of bits. (I'm assuming, based on the description. Unless there's something I'm missing, you've got log2(d) stages of O(1) latency.)
The nasty operations are the ones that are linear (or worse!).
A little comment regarding the Z80 vs 6502 register count: I always consider the 6502 to have 256 registers (zero page) and an accumulator and two index registers. Never a shortage if you use it like that...
On a related note: I wish that more CPUs had an explicit cache. So data has to be explicitly loaded into cache, etc.
Modern CPUs are NUMA. Don't treat memory as RAM any more, because that's not true.
Biggest problem with this is that not all CPUs have the same amount of cache. But you can get around this by treating the cache as the low area of RAM, with instructions to get the amount of cache available. Especially if cache is also paged.
Other issue with this is context switches, but this is conceptually no different than paging RAM to disk when required.
I'm not too familiar with many other architectures, but MIPS has dcache "fill", "flush", and "lock" operations, so the user can both do a premature fill operation, or even lock data in the cache so it won't be evicted.
I haven't seen many people actually use these ops, because it's actually pretty hard to do better than the built-in cache allocation policies for most applications, especially if you take into account that your app is going to get swapped out consistently by the operating system task switches.
There is a distinction between allowing said operations and being designed for such operations. It is possible even in x86, although difficult, and requires privileged operations. (A user-mode program can request that something be prefetched or flushed, and can do non-temporal loads (and stores?), but in order to get "true" scratchpad memory you have to play with the MTRR, and even then the processor doesn't support hardware paging of cache, like it does with, for example, RAM)
> I haven't seen many people actually use these ops, because it's actually pretty hard to do better than the built-in cache allocation policies for most applications, especially if you take into account that your app is going to get swapped out consistently by the operating system task switches.
And again, this is largely because the cache is implicit to the OS. There's no way to go "this is the stuff that was cached last time this process has control, when you can, reload it back in" to the processor, because you can't tell what in cache is "owned" by what in anything like an efficient manner - and even if you could, the moment you start executing a context switch you've overwritten random bits of cache.
It's like if the processor was set up to directly talk to the hard drive to do paging on demand, to the point that the OS wasn't even aware of it. In theory it's a good idea, but the more you look at it the more flaws emerge.
If your PPC is using a Discovery PHC you can map half or all of your L2 to a block of PAs and then map it where ever you want with VM. I'm sure this was cause of the experience that Genesis had with MIPS. It's a nifty feature.
In comparison with the Z80 it's roughly equivalent. It took about the same time to load/store indirectly in/from zero-page as it took the Z80 to load/store directly from registers. The 6502 ran roughly in 2 cycles per instruction (average) while the Z80 took about 3 average M-cycles (3~6 T-cycles per M-cycle) per instruction. The 6510 (6502-compatible but with some extra I/O) in the Commodore 64 could out-pace the Z80 in the Spectrum 48 which ran at more than triple the clock speed except maybe in some carefully crafted proof of concept.
Conversions to C64 from Spectrum were routinely done by hand, routine by routine, as the C64 could "emulate" the Speccy this way. For instance games like The Great Escape ( http://www.crashonline.org.uk/35/greatescape.htm ) was converted this way, as were other games (it was a bit disappointing when this happened, as it basically meant not using the C64 graphic sprites and its tricks.
It'd certainly be interesting to write a C compiler codegen target for the 6502 that treated the zero-page as registers (and a lot easier to map modern compiler intrinsics to than the tiny base register set.)
Some years ago I worked on a C compiler for a processor, which, like the 6502, had no general-purpose registers and a ‘fast page’, and we did exactly that.
Interesting that the flags are stored like the other registers. I had always assumed that the flag bits would be "spread out" across parts of the ALU, and only combined with the accumulator into a virtual register when you read the AF register. In fact they are stored in the register file and copied to and from the ALU on every(?) operation.
I don't know if the term "register renaming" was even around at the time since the other instructions (e.g. MOVs) don't "rename" registers, but they certainly had the right idea - "why move the data around when you can just change the register selection bits?" It's also how I expected the exchange instructions to work when I first saw the instruction set and the timings, so it's probably a rather natural and obvious way of doing it.
I did it the same way when I designed a Z80-compatible CPU in a (graphical) logic simulator, although I used a dual-ported register file.
http://en.wikipedia.org/wiki/Barrel_shifter
The easiest way to shift bits isn't "adder like" with carry propagation across data lines (sorta), although that could be done. The easiest way is just to phase distort the bus (sorta) so what you called bit X is now bit X-4 for all bits on the bus, and use a 2x1 mux to select which, and cascade rollers in binary (you don't need a roller for all integer numbers if you can cascade them 16, 8, 2, 1 position only activating some rolls). If you can rotate, you can shift by adding one more stage at the end that does bit oriented ops to clear MSBs or LSBs. Also if your roller is capable of rolling all bits on the bus, you don't need two rollers one in each direction. Latency does add up of course, all those cascaded ops.
Also I saw in the notes the Rodney Zaks z80 book being referenced. I learned assembly from that book, back when it was new. I enjoyed that book. I also looked at the PDF scan of someone's beat up copy... remember when textbooks were only $10.95 each?