I always thought this was quite elegant; in the original ARM ISA most ALU operations and register moves (and even some load/store indexing) could pass through the barrel shifter at no extra cost, so a 'ROR' was just a move from a register to itself with a pass through the shifter. This made up (somewhat) for the low code density implied by fixed-length instructions and uniform load/store architecture. AArch64 removes this capability from most of the arithmetic operations I believe.
The design of the orginal ARM ISA is very interesting historically, mainly informed by many man years of hand-optimising 6502. As such it was quite idiosyncratic, nice to write by hand, and rather awkward for compilers...
> could pass through the barrel shifter at no extra cost
There was always a cost. You had to spend those bits in the instruction encoding that could have been used for more registes, more instructions, more operands, or (the x86 choice) the ability to fit more (smaller) instructions in the instruction cache. The existence of this extra thing in the instruction data path meant you needed an extra cycle or two in the execute pipeline for every instruction, not just the ones with shifts. You also had to have the single-cycle barrel shifter implemented in hardware (this is something that smaller microcontrollers used to skip).
In fact that weird ARM shift field is broadly held to have been a mistake. Note that A64 skips it.
Well, yeah, run-time cost, I meant. Originally this was nil (unless you had a register-specified shift) on the 'classic' 3-stage pipeline in the original ARM (1/2/3/6/7). As the pipeline got deeper this was more problematic, as you say, and higher frequencies make the silicon implementation awkward too. But even in A64, the barrel shifter is available on the second register operand of the non-arithmetic data processing operations.
To let everyone in on a little !secret, the 'objections' to these encodings is satire, something that was laid on a bit thick at the end of this post. There have been no real objections to ARM so far. x86 is a different story, the AAD/AAM instructions being the biggest example. In that case, being able to do something at the machine level that the assembly level abstracts away (converting bases other than base 10). Regardless of any kind of usefulness, any non 1-to-1 mappings between abstractions highly interest me.
The BCD instructions aren't "too high level", these are[1] real hardware operations that had real utility to real problems. In the late 70's, the modular math required to format decimal numbers for display could be a big chunk of your ROM budget, and these instructions eliminated the problem.
This is like saying SSE is "too high level" because you could just do all the operations independently with scalar math.
[1] Were, anyway. They're surely microcoded on modern processors.
I have no objection to the utility of something like AAD. What I'm saying is that this same very instruction can do more at a machine level. AAD assembles to D5 0A, even though D5 is the part that refers to AAD, 0A is hardcoded for base 10. One could machine code something like D5 08 (to have base 8 conversions). You can really do just about any base. Even the Intel manual states you can do this, you just have to do it at the machine code level, you can't do it with the assembly level instruction of AAD (it's too high level or abstracted). This is all I meant by my comment.
XlogicX wasn’t complaining about the existence of the AAD and AAM instructions. He was complaining that they had useful variants that couldn’t be expressed in assembly.
I see now. I probably didn't interpret it as satire because the post was pretty incoherent (and it's very hard for me to pay close attention to things that I perceive as incoherent), and I had no interest in reading the rest of your blog to see if the writing was in similar style.
edit: Also, I'm probably unfamiliar with the type of blog post you're satirizing because I don't read many programming blogs these days. I'd rather just read the documentation (and source code), and any other questions I have are best answered by just asking the computer.
Good satire, like all worthwhile endeavors, is hard. Don't let people like me who don't immediately get it stop you from practicing, though :)
You say this: "Looking at instruction encodings, ‘ROR r0, #0’ should be the same as ‘RRX r0, r0’.", but don't you mean ROR, r0, #1? Since RRX is a shift of 1.
lol "isn't permitted", "unsupported", "undefined", are all trigger words for me; when I see them, it's the only thing I can think of doing. I get that more than 90% of the time something fucky is going to happen, doesn't stop me from wanting to know exactly what will happen. And sometimes, rarely, something really cool happens. In this case, doing ROR r0, #0 is just useless (as documented), and my 'objections' to it are satire. With that context, my 'rant' at the end of the blog should be more clear. And here I thought my satire was obvious. That said, can't say I don't love the serious technical discussion in these comments ;)
I 100% prefer text/book learning to lecture/video. But I guess not everyone is like us, so I tried experimenting with a dual format. Regarding the font, I never thought to change the ugly gray small font. I changed the CSS to make all 'paragraph' text 18pt (across the blog), for that specific post, it is now black. But be warned, if someone complains that it's too big, I will make it 32pt Comic Sans.
can't really comprehend why anyone would prefer a video to a well laid out and illustrated blog post
It's alright to have your own preferences, but so do other people. Some people (like me) prefer audiobooks instead of paper books, some people are militantly the other way around. It is what it is :-) Kudos, certainly, to people who share their work in multiple media.
I have ADD and video has its place, just not in this case.
A technical article such as this one is ideally disseminated using a regular web page. Pictures, text, code all in easy view and scrolled conveniently at will.
Compare that to a video which in essence is an auto-scrolling page that can only be paused or slightly slowed down. You wind up pausing, rewinding and skipping parts. Annoying.
I don't quite understand the objection here, this is fairly elegant in my opinion, you have a single opcode used to encode several operations by using "special cases". In my experience it's rather common in RISC IAs.
If the author doesn't like this they shouldn't look into MIPS because it goes well beyond that. You see, MIPS has a special "R0" register that's always 0 (AARCH 64 does as well by the way) so you can always use it as a placeholder in other instructions.
As such, there's no real MOVE instruction, it's just an assembler mnemonic that assembles down to `OR $target, $src, $R0`. NOP? It's by convention `SLL $R0, $R0, 0` (which has the nice property of being an instruction encoding as "0x00000000"). You want to negate a number? `SUB $target, $R0, $src".
Since all instructions are 32bit wide you can't load a 32bit immediate value in a single instruction, instead the assembler's "LI" mnemonic generates a pair of instructions (LUI/ORI) for large immediate values (ARM prefers PC-relative loads).
You have a whole bunch of mnemonics in MIPS that are just aliases around other instructions. I always thought it was pretty clever.
In summary you can have this assembler listing:
1:
sll $0, $0, 0
or $t0, $t1, $0
li $t0, 0xabcdef
sub $t0, $0, $t1
j 1b
That will disassemble to:
nop
move t0,t1
lui t0,0xab
ori t0,t0,0xcdef
j 0x0
neg t0,t1
The only operation here that I would qualify as "high level" is
the reordering of the "neg" instruction into the delay slot (note
that j is no longer the last instruction). Everything else is
very straightforward substitution and if the assembler didn't
support these mnemonics we could implement them with very trivial
macros.
Note that even x86 assemblers do that to some extent, for
instance "nop" assembles down to an instruction with no side
effect (typically `xchg eax, eax`). Furthermore there are a bunch
of mnemonics for the same encoding, for instance JAE (jump if
above or equal), JNB (jump if not below) and JNC (jump if not
carry). Overall instruction encoding is also massively more
complicated in x86 (and even more for amd64) so the assembler
needs to handle many more corner cases than the simple
substitutions of ARM and MIPS. As a brain teaser, consider the
following similar looking amd64 instructions that load the 32bit
value pointed at by a register (the only difference is that the
first one dereferences the pointer in %rax, the second in %r12):
mov %eax, (%rax) ; assembles to 89 00
mov %eax, (%r12) ; assembles to 41 89 04 24
I can't even be bothered to walk you through this but basically
it has to do with the fact that %r12 happens to be encoded as
%rsp + 8 (because registers r8 to r15 are effectively a hack
since x86 only supported 8 GPRs) and %rsp has special semantics
in this addressing mode which mandate a different, longer
encoding otherwise you end up with an ambiguous instruction.
Yeah, I think in retrospect we can give ARM a pass for their ROR
shenanigans.
I love x86 complications (ARM is way more elegant in comparison), and I WILL be bothered to walk through mov %eax, (%r12), or 41 89 04 24
----
41 - being the amd64 prefix that 'unlocks' rax to being interpreted as r12
----
89 - still the same MOV instruction as the first one in mov %eax, (%rax)
----
04 - The ModR/M byte that specifies the %eax part, and [--][--] as the 'Effective Address', which is another way of saying that we need a SIB byte because this (%r12) doesn't already have a simple encoding with the ModR/M byte alone (the same would be needed even if it was a simple (%esp)).
----
24 - A 'Scaled Index' of none and as simias stated, RSP (but R12 in the context of the 41 prefix for this instruction).
----
All of this is easier to visualize when using the ModR/M and SIB tables in Volume II of the Intel manual. In my copy of the manual it's pretty early on in Chapter 2 (2.1.5 Addressing-Mode Encoding of ModR/M and SIB Bytes))
Flattening the ROR and RRX special cases seems pretty straightforward to me as well. That's why I thought that the punch line was at the end of the post though:
ror r0, #0
gets assembled to
mov r0, r0.
I see how both are nops and assume they affect state congruently, but why is one nop encoding preferred over another? Does the architecture do something desirable when encountering the MOV incarnation vs. any other?
The point is that the binary encoding is the same for ROR and MOV, and only the disassembly of the binary is special cased: if it has a shift, it's disassembled into the ROR text with the shift otherwise it is displayed as a MOV. RRX is a special case to ROR as ROR is to MOV.
The ARM Instruction Set PDF I'm looking at only lists MOV as a real instruction -- one with a distinct opcode -- out of the above three.
It's MOV and LSL that have the same binary encoding; everything is identical with exception to the imm5 vs the hardcoded 0's. MOV and LSL share the same op2 field of '00'. ROR and RRX share the op2 field of '11'. ROR and MOV have a similar binary encoding, but the op2 field is distinctly different in this case.
As a note, I'm basing this off of the v7-A and v7-R manual.
Well, there's the problem: the v7-A and v7-R manual.
That manual uses unified assembly syntax, meaning that old ARM and Thumb are described together. Because of irregularities in Thumb, the old ARM instructions end up being described in a way that is needlessly verbose. You lose the insight into how the opcodes are actually decoded.
Look at an older manual. ARMv5 will do nicely. There, you can see that the MOV instruction is mostly described by 2 bits. (one condition code is stolen) In an even older ARM, such as ARMv4 I think, it really is just 2 bits.
Unless I'm missing something, the encodings are different at bits 6 and 7. Looking at xlogicx's post, MOV has them set to 0 while ROR has them at 1.
On a higher level, this post got me to realize that there could be ops (in this case nops) that have equivalent effects but are encoded as different instructions. Even though registers might state-change equivalently, I can see that the internal processor state might mutate differently. Maybe one nop encoding is faster, and maybe the caches get hit differently, etc. As an assembler, what reasons might there be to prefer one encoding over another?
When designing an architecture, I can imagine putting a "fast nop" in the instruction encoder that essentially just short circuits around it. Is this something that's done in practice?
> Since all instructions are 32bit wide you can't load a 32bit immediate value in a single instruction, instead the assembler's "LI" mnemonic generates a pair of instructions (LUI/ORI) for large immediate values (ARM prefers PC-relative loads).
IMO it really is odd that some instruction to the assembler are not direct translations. In theory it can cause some issues if the developer makes irresponsible assumptions based on the program text. The mnemonics make life easier for folks who are familiar with the assembler, but harder for someone poking around doing some detective work.
> I don't quite understand the objection here, this is fairly elegant in my opinion, you have a single opcode used to encode several operations by using "special cases".
I don't think this characterizes an instruction set as CISC at all. In any case, having those "special cases" means that if an operation can be subsumed by another operation, the former is just an alias of the latter on the instruction encoding level, thereby reducing the actual number of instructions. Think of it as syntactic sugar.
IMO RISC is more a philosophy than a technical term, your definition is something that was created post-facto to try and come with a definition. It's more like "all currently accepted RISC IAs have the following characteristics" but I disagree that they're an appropriate definition. For instance:
>Have 1 size of instruction in an instruction stream and that size if 4 bytes
So that means that Thumb isn't RISC because it has 16 bits instructions and a few double-width opcodes? Even though its instruction set if effectively even more restricted than ARM? That doesn't make sense to me.
>Do NOT support arbitrary alignment of data for loads/stores
MIPS has SWL/SWR LWL/LWR, does that count? I suppose you could say that RISC has no support for arbitrary alignment in regular load and store instructions but again, is that really enough to disqualify an IA? What if I made a tweaked MIPS CPU with an identical instruction set with the only difference being that unaligned LW/SW would work as intended instead of raising an exception, would it stop being RISC?
>Have >= 5 bits per integer register specifier, Have >= 4 bits per FP register specifier
That actually disqualifies ARM32 as far as I can tell, since it only has 16GPRs encoded using 4 bits. I fail to see how this small encoding detail is relevant to RISC anyway. Maybe it just meas that you need at least 32GPRs?
Wikipedia has a much broader (and IMO more reasonable) definition of RISC:
>Various suggestions have been made regarding a precise definition of RISC, but the general concept is that such a computer has a small set of simple and general instructions, rather than a large set of complex and specialized instructions.
By this definition an instruction such as "Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero" is very much un-risc-y.
Yes, but by that classic definition ARM is certainly one of the CISCiest of the RISC architectures with its condition codes, barrel shifter, auto-increment loads and stores, and load/store multiple. And x86 is one of the RISCiest of the CISCs with its lack of indirect memory access. All of which might represent a sort of semantic happy medium.
CISC has different encodings for MOV and OR (and often subtle differences, e.g. OR updates the flags and MOV doesn't). On RISC processors, which have MOV as a special case of OR, the assembler accepts MOV at the source code level but the processor does not have to implement a separate MOV instruction at the binary level. Therefore the instruction set is indeed reduced.
ARM's peculiarity is that the fundamental ALU operation is "R1 op (R2 shiftop R3)" or "R1 op (R2 shiftop #nn". But it's still not a CISC design, it's just that the barrel shifter is at a different place in the ALU and that shows in the instruction encoding. Apart from this quirk the ideas from the previous paragraph apply just as well to ARM.
The basic idea of RISC was to reduce the complexity of individual instructions. Rather than having one instruction, you had several, but each of those would be easier to decode and execute. And, as history has shown, it was also quite a bit faster to execute such simplified instructions.
Hence, assembling a single mnemonic into two or more instructions is fairly common on RISC architectures. But the instruction set at the machine level is still simple and direct, even if the assembler expands.
As a fun fact, modern x86 will often just microcode complex instructions inside the CPU to several simple mu-ops and then execute those. In turn, they are just doing the same work as the assembler, but in hardware. It is necessary for backwards compatibility, but it is hardly elegant.
I don't understand how you reach this conclusion, if anything
it's the opposite of CISC: instead of having highly specialized
instructions you have a single generic instruction that can be
used in all sorts of context. Note that this is very different
from x86 "overloaded" encoding where a single mnemonic can have a
million different encodings doing wildly different things
depending on the operands, prefixes, lunar phase and/or operating mode[1], in
this case there's no additional complexity to the "OR"
implementation besides the adjunction of an R0 register (which is
not specific to this opcode). You can still describe the encoding
and functionality in a single line[2].
It sounds like the exact opposite: you're reducing the actual instruction set, just providing convenient shorthands for common cases at the assembly level.
But why does anyone care where the boundary is drawn, especially based on entirely subjective opinions about whether one thing is more complex than another?
Would I be right in thinking 'Assembly is Too High Level' is a kindof title or catchphrase for a series of blog posts, and that the actual article is just an analysis of how those instructions work?
The main (actual) theme of the series is focusing on non 1-to-1 mappings, for any reason (useful or not). It's a thing that fascinates me. The phrase "$x is too high level" is mostly satirical, a phrase I've used for more than just assembly language ($x = [asm, regex, scapy, inflate/deflate, zip, elf, burritos, etc...]).
Taking a look at the blog's main page, it looks like your intuition is pretty much right. There look to be a lot of posts with that catchphrase in the title, but I looks more like a mini-genre than a serial collection.
The design of the orginal ARM ISA is very interesting historically, mainly informed by many man years of hand-optimising 6502. As such it was quite idiosyncratic, nice to write by hand, and rather awkward for compilers...