ARM Assembly Is Too High Level: ROR and RRX

jlarcombe · on Sept 26, 2018

I always thought this was quite elegant; in the original ARM ISA most ALU operations and register moves (and even some load/store indexing) could pass through the barrel shifter at no extra cost, so a 'ROR' was just a move from a register to itself with a pass through the shifter. This made up (somewhat) for the low code density implied by fixed-length instructions and uniform load/store architecture. AArch64 removes this capability from most of the arithmetic operations I believe.

The design of the orginal ARM ISA is very interesting historically, mainly informed by many man years of hand-optimising 6502. As such it was quite idiosyncratic, nice to write by hand, and rather awkward for compilers...

ajross · on Sept 26, 2018

> could pass through the barrel shifter at no extra cost

There was always a cost. You had to spend those bits in the instruction encoding that could have been used for more registes, more instructions, more operands, or (the x86 choice) the ability to fit more (smaller) instructions in the instruction cache. The existence of this extra thing in the instruction data path meant you needed an extra cycle or two in the execute pipeline for every instruction, not just the ones with shifts. You also had to have the single-cycle barrel shifter implemented in hardware (this is something that smaller microcontrollers used to skip).

In fact that weird ARM shift field is broadly held to have been a mistake. Note that A64 skips it.

jlarcombe · on Sept 26, 2018

Well, yeah, run-time cost, I meant. Originally this was nil (unless you had a register-specified shift) on the 'classic' 3-stage pipeline in the original ARM (1/2/3/6/7). As the pipeline got deeper this was more problematic, as you say, and higher frequencies make the silicon implementation awkward too. But even in A64, the barrel shifter is available on the second register operand of the non-arithmetic data processing operations.

cesarb · on Sept 26, 2018

> The design of the orginal ARM ISA is very interesting historically

Also very interesting is to see how it was originally implemented (a series of blog posts by Ken Shirriff and Dave Mugridge): https://www.righto.com/search/label/arm and https://daveshacks.blogspot.com/2015/12/inside-alu-of-armv1-...

XlogicX · on Sept 26, 2018

To let everyone in on a little !secret, the 'objections' to these encodings is satire, something that was laid on a bit thick at the end of this post. There have been no real objections to ARM so far. x86 is a different story, the AAD/AAM instructions being the biggest example. In that case, being able to do something at the machine level that the assembly level abstracts away (converting bases other than base 10). Regardless of any kind of usefulness, any non 1-to-1 mappings between abstractions highly interest me.

ajross · on Sept 26, 2018

The BCD instructions aren't "too high level", these are[1] real hardware operations that had real utility to real problems. In the late 70's, the modular math required to format decimal numbers for display could be a big chunk of your ROM budget, and these instructions eliminated the problem.

This is like saying SSE is "too high level" because you could just do all the operations independently with scalar math.

[1] Were, anyway. They're surely microcoded on modern processors.

XlogicX · on Sept 26, 2018

I have no objection to the utility of something like AAD. What I'm saying is that this same very instruction can do more at a machine level. AAD assembles to D5 0A, even though D5 is the part that refers to AAD, 0A is hardcoded for base 10. One could machine code something like D5 08 (to have base 8 conversions). You can really do just about any base. Even the Intel manual states you can do this, you just have to do it at the machine code level, you can't do it with the assembly level instruction of AAD (it's too high level or abstracted). This is all I meant by my comment.

amluto · on Sept 26, 2018

XlogicX wasn’t complaining about the existence of the AAD and AAM instructions. He was complaining that they had useful variants that couldn’t be expressed in assembly.

edit: See http://www.rcollins.org/secrets/opcodes/AAD.html

pharrington · on Sept 26, 2018

I see now. I probably didn't interpret it as satire because the post was pretty incoherent (and it's very hard for me to pay close attention to things that I perceive as incoherent), and I had no interest in reading the rest of your blog to see if the writing was in similar style.

edit: Also, I'm probably unfamiliar with the type of blog post you're satirizing because I don't read many programming blogs these days. I'd rather just read the documentation (and source code), and any other questions I have are best answered by just asking the computer.

Good satire, like all worthwhile endeavors, is hard. Don't let people like me who don't immediately get it stop you from practicing, though :)

mfgmfg · on Sept 26, 2018

You say this: "Looking at instruction encodings, ‘ROR r0, #0’ should be the same as ‘RRX r0, r0’.", but don't you mean ROR, r0, #1? Since RRX is a shift of 1.

pharrington · on Sept 26, 2018

ROR r0, #0

0 isn't permitted with ROR.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

xlogicx should read the manual one more time.

XlogicX · on Sept 26, 2018

lol "isn't permitted", "unsupported", "undefined", are all trigger words for me; when I see them, it's the only thing I can think of doing. I get that more than 90% of the time something fucky is going to happen, doesn't stop me from wanting to know exactly what will happen. And sometimes, rarely, something really cool happens. In this case, doing ROR r0, #0 is just useless (as documented), and my 'objections' to it are satire. With that context, my 'rant' at the end of the blog should be more clear. And here I thought my satire was obvious. That said, can't say I don't love the serious technical discussion in these comments ;)

TFortunato · on Sept 26, 2018

Hah! Glad to know I'm not the only one who reads "undefined" as "why don't you try it and see ;-)"

I have to be a bit more careful though with this one lately, when I'm working on industrial robots vs. the embedded boards I'm used to!

andybak · on Sept 26, 2018

> Note: If you prefer video format to reading stuff, there’s a companion video for this

Thanks. I much prefer this format and can't really comprehend why anyone would prefer a video to a well laid out and illustrated blog post.

However your font size is pretty uncomfortable to read for me.

XlogicX · on Sept 26, 2018

I 100% prefer text/book learning to lecture/video. But I guess not everyone is like us, so I tried experimenting with a dual format. Regarding the font, I never thought to change the ugly gray small font. I changed the CSS to make all 'paragraph' text 18pt (across the blog), for that specific post, it is now black. But be warned, if someone complains that it's too big, I will make it 32pt Comic Sans.

kazagistar · on Sept 26, 2018

For me, multitasking. There is a lot of stuff you can get done while just listening to something.

petercooper · on Sept 26, 2018

can't really comprehend why anyone would prefer a video to a well laid out and illustrated blog post

It's alright to have your own preferences, but so do other people. Some people (like me) prefer audiobooks instead of paper books, some people are militantly the other way around. It is what it is :-) Kudos, certainly, to people who share their work in multiple media.

make3 · on Sept 26, 2018

MisterTea · on Sept 26, 2018

I have ADD and video has its place, just not in this case.

A technical article such as this one is ideally disseminated using a regular web page. Pictures, text, code all in easy view and scrolled conveniently at will.

Compare that to a video which in essence is an auto-scrolling page that can only be paused or slightly slowed down. You wind up pausing, rewinding and skipping parts. Annoying.

dogma1138 · on Sept 26, 2018

Dyslexia?

simias · on Sept 26, 2018

I don't quite understand the objection here, this is fairly elegant in my opinion, you have a single opcode used to encode several operations by using "special cases". In my experience it's rather common in RISC IAs.

If the author doesn't like this they shouldn't look into MIPS because it goes well beyond that. You see, MIPS has a special "R0" register that's always 0 (AARCH 64 does as well by the way) so you can always use it as a placeholder in other instructions.

As such, there's no real MOVE instruction, it's just an assembler mnemonic that assembles down to `OR $target, $src, $R0`. NOP? It's by convention `SLL $R0, $R0, 0` (which has the nice property of being an instruction encoding as "0x00000000"). You want to negate a number? `SUB $target, $R0, $src".

Since all instructions are 32bit wide you can't load a 32bit immediate value in a single instruction, instead the assembler's "LI" mnemonic generates a pair of instructions (LUI/ORI) for large immediate values (ARM prefers PC-relative loads).

You have a whole bunch of mnemonics in MIPS that are just aliases around other instructions. I always thought it was pretty clever.

In summary you can have this assembler listing:

    1:
        sll $0, $0, 0
        or  $t0, $t1, $0
        li  $t0, 0xabcdef
        sub $t0, $0, $t1
        j   1b

That will disassemble to:

        nop
        move    t0,t1
        lui     t0,0xab
        ori     t0,t0,0xcdef
        j       0x0
        neg     t0,t1

The only operation here that I would qualify as "high level" is the reordering of the "neg" instruction into the delay slot (note that j is no longer the last instruction). Everything else is very straightforward substitution and if the assembler didn't support these mnemonics we could implement them with very trivial macros.

Note that even x86 assemblers do that to some extent, for instance "nop" assembles down to an instruction with no side effect (typically `xchg eax, eax`). Furthermore there are a bunch of mnemonics for the same encoding, for instance JAE (jump if above or equal), JNB (jump if not below) and JNC (jump if not carry). Overall instruction encoding is also massively more complicated in x86 (and even more for amd64) so the assembler needs to handle many more corner cases than the simple substitutions of ARM and MIPS. As a brain teaser, consider the following similar looking amd64 instructions that load the 32bit value pointed at by a register (the only difference is that the first one dereferences the pointer in %rax, the second in %r12):

        mov %eax, (%rax) ; assembles to 89 00
        mov %eax, (%r12) ; assembles to 41 89 04 24

I can't even be bothered to walk you through this but basically it has to do with the fact that %r12 happens to be encoded as %rsp + 8 (because registers r8 to r15 are effectively a hack since x86 only supported 8 GPRs) and %rsp has special semantics in this addressing mode which mandate a different, longer encoding otherwise you end up with an ambiguous instruction.

Yeah, I think in retrospect we can give ARM a pass for their ROR shenanigans.

XlogicX · on Sept 26, 2018

I love x86 complications (ARM is way more elegant in comparison), and I WILL be bothered to walk through mov %eax, (%r12), or 41 89 04 24 ---- 41 - being the amd64 prefix that 'unlocks' rax to being interpreted as r12 ---- 89 - still the same MOV instruction as the first one in mov %eax, (%rax) ---- 04 - The ModR/M byte that specifies the %eax part, and [--][--] as the 'Effective Address', which is another way of saying that we need a SIB byte because this (%r12) doesn't already have a simple encoding with the ModR/M byte alone (the same would be needed even if it was a simple (%esp)). ---- 24 - A 'Scaled Index' of none and as simias stated, RSP (but R12 in the context of the 41 prefix for this instruction). ---- All of this is easier to visualize when using the ModR/M and SIB tables in Volume II of the Intel manual. In my copy of the manual it's pretty early on in Chapter 2 (2.1.5 Addressing-Mode Encoding of ModR/M and SIB Bytes))

xelxebar · on Sept 26, 2018

Flattening the ROR and RRX special cases seems pretty straightforward to me as well. That's why I thought that the punch line was at the end of the post though:

ror r0, #0

gets assembled to

mov r0, r0.

I see how both are nops and assume they affect state congruently, but why is one nop encoding preferred over another? Does the architecture do something desirable when encountering the MOV incarnation vs. any other?

palotasb · on Sept 26, 2018

The point is that the binary encoding is the same for ROR and MOV, and only the disassembly of the binary is special cased: if it has a shift, it's disassembled into the ROR text with the shift otherwise it is displayed as a MOV. RRX is a special case to ROR as ROR is to MOV.

The ARM Instruction Set PDF I'm looking at only lists MOV as a real instruction -- one with a distinct opcode -- out of the above three.

XlogicX · on Sept 26, 2018

It's MOV and LSL that have the same binary encoding; everything is identical with exception to the imm5 vs the hardcoded 0's. MOV and LSL share the same op2 field of '00'. ROR and RRX share the op2 field of '11'. ROR and MOV have a similar binary encoding, but the op2 field is distinctly different in this case.

As a note, I'm basing this off of the v7-A and v7-R manual.

tropo · on Sept 26, 2018

Well, there's the problem: the v7-A and v7-R manual.

That manual uses unified assembly syntax, meaning that old ARM and Thumb are described together. Because of irregularities in Thumb, the old ARM instructions end up being described in a way that is needlessly verbose. You lose the insight into how the opcodes are actually decoded.

Look at an older manual. ARMv5 will do nicely. There, you can see that the MOV instruction is mostly described by 2 bits. (one condition code is stolen) In an even older ARM, such as ARMv4 I think, it really is just 2 bits.

xelxebar · on Sept 27, 2018

Unless I'm missing something, the encodings are different at bits 6 and 7. Looking at xlogicx's post, MOV has them set to 0 while ROR has them at 1.

On a higher level, this post got me to realize that there could be ops (in this case nops) that have equivalent effects but are encoded as different instructions. Even though registers might state-change equivalently, I can see that the internal processor state might mutate differently. Maybe one nop encoding is faster, and maybe the caches get hit differently, etc. As an assembler, what reasons might there be to prefer one encoding over another?

When designing an architecture, I can imagine putting a "fast nop" in the instruction encoder that essentially just short circuits around it. Is this something that's done in practice?

mfgmfg · on Sept 26, 2018

Interestingly, older ARM architectures (ARMv6-M), do not support RRX

masklinn · on Sept 26, 2018

Do they not support the feature or do they just not have a convenient mnemonic for it?

davemp · on Sept 26, 2018

> Since all instructions are 32bit wide you can't load a 32bit immediate value in a single instruction, instead the assembler's "LI" mnemonic generates a pair of instructions (LUI/ORI) for large immediate values (ARM prefers PC-relative loads).

For example:

    ldr rn,=0xCAFEBABE

Will actually be assembled as:

    ldr rn,[pc,#(literals - .)]

  literals:
    .word 0xCAFEBABE

IMO it really is odd that some instruction to the assembler are not direct translations. In theory it can cause some issues if the developer makes irresponsible assumptions based on the program text. The mnemonics make life easier for folks who are familiar with the assembler, but harder for someone poking around doing some detective work.

megaremote · on Sept 26, 2018

> I don't quite understand the objection here, this is fairly elegant in my opinion, you have a single opcode used to encode several operations by using "special cases".

Sounds like CISC.

kccqzy · on Sept 26, 2018

I don't think this characterizes an instruction set as CISC at all. In any case, having those "special cases" means that if an operation can be subsumed by another operation, the former is just an alias of the latter on the instruction encoding level, thereby reducing the actual number of instructions. Think of it as syntactic sugar.

I still find this classic to be the best explanation of the technical characteristics of CISC/RISC: https://userpages.umbc.edu/~vijay/mashey.on.risc.html

Here's a short summary:

Most RISCs:

- Have 1 size of instruction in an instruction stream

- And that size is 4 bytes

- Have a handful (1-4) addressing modes

- Have NO indirect addressing in any form (i.e., where you need one memory access to get the address of another operand in memory)

- Have NO operations that combine load/store with arithmetic, i.e., like add from memory, or add to memory.

- Have no more than 1 memory-addressed operand per instruction

- Do NOT support arbitrary alignment of data for loads/stores

- Use an MMU for a data address no more than once per instruction

- Have >= 5 bits per integer register specifier

- Have >= 4 bits per FP register specifier

simias · on Oct 8, 2018

IMO RISC is more a philosophy than a technical term, your definition is something that was created post-facto to try and come with a definition. It's more like "all currently accepted RISC IAs have the following characteristics" but I disagree that they're an appropriate definition. For instance:

>Have 1 size of instruction in an instruction stream and that size if 4 bytes

So that means that Thumb isn't RISC because it has 16 bits instructions and a few double-width opcodes? Even though its instruction set if effectively even more restricted than ARM? That doesn't make sense to me.

>Do NOT support arbitrary alignment of data for loads/stores

MIPS has SWL/SWR LWL/LWR, does that count? I suppose you could say that RISC has no support for arbitrary alignment in regular load and store instructions but again, is that really enough to disqualify an IA? What if I made a tweaked MIPS CPU with an identical instruction set with the only difference being that unaligned LW/SW would work as intended instead of raising an exception, would it stop being RISC?

>Have >= 5 bits per integer register specifier, Have >= 4 bits per FP register specifier

That actually disqualifies ARM32 as far as I can tell, since it only has 16GPRs encoded using 4 bits. I fail to see how this small encoding detail is relevant to RISC anyway. Maybe it just meas that you need at least 32GPRs?

Wikipedia has a much broader (and IMO more reasonable) definition of RISC:

>Various suggestions have been made regarding a precise definition of RISC, but the general concept is that such a computer has a small set of simple and general instructions, rather than a large set of complex and specialized instructions.

By this definition an instruction such as "Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero" is very much un-risc-y.

Symmetry · on Sept 26, 2018

Yes, but by that classic definition ARM is certainly one of the CISCiest of the RISC architectures with its condition codes, barrel shifter, auto-increment loads and stores, and load/store multiple. And x86 is one of the RISCiest of the CISCs with its lack of indirect memory access. All of which might represent a sort of semantic happy medium.

bonzini · on Sept 26, 2018

No, it's very different.

CISC has different encodings for MOV and OR (and often subtle differences, e.g. OR updates the flags and MOV doesn't). On RISC processors, which have MOV as a special case of OR, the assembler accepts MOV at the source code level but the processor does not have to implement a separate MOV instruction at the binary level. Therefore the instruction set is indeed reduced.

ARM's peculiarity is that the fundamental ALU operation is "R1 op (R2 shiftop R3)" or "R1 op (R2 shiftop #nn". But it's still not a CISC design, it's just that the barrel shifter is at a different place in the ALU and that shows in the instruction encoding. Apart from this quirk the ideas from the previous paragraph apply just as well to ARM.

jlouis · on Sept 26, 2018

The basic idea of RISC was to reduce the complexity of individual instructions. Rather than having one instruction, you had several, but each of those would be easier to decode and execute. And, as history has shown, it was also quite a bit faster to execute such simplified instructions.

Hence, assembling a single mnemonic into two or more instructions is fairly common on RISC architectures. But the instruction set at the machine level is still simple and direct, even if the assembler expands.

As a fun fact, modern x86 will often just microcode complex instructions inside the CPU to several simple mu-ops and then execute those. In turn, they are just doing the same work as the assembler, but in hardware. It is necessary for backwards compatibility, but it is hardly elegant.

simias · on Sept 26, 2018

I don't understand how you reach this conclusion, if anything it's the opposite of CISC: instead of having highly specialized instructions you have a single generic instruction that can be used in all sorts of context. Note that this is very different from x86 "overloaded" encoding where a single mnemonic can have a million different encodings doing wildly different things depending on the operands, prefixes, lunar phase and/or operating mode[1], in this case there's no additional complexity to the "OR" implementation besides the adjunction of an R0 register (which is not specific to this opcode). You can still describe the encoding and functionality in a single line[2].

[1] https://svkt.org/~simias/up/20180926-112325_x86-mov.png

[2] https://svkt.org/~simias/up/20180926-112224_or-encoding.png

masklinn · on Sept 26, 2018

It sounds like the exact opposite: you're reducing the actual instruction set, just providing convenient shorthands for common cases at the assembly level.

pjc50 · on Sept 26, 2018

But why does anyone care where the boundary is drawn, especially based on entirely subjective opinions about whether one thing is more complex than another?

jlarcombe · on Sept 26, 2018

AArch64 has a zero register too, I believe.

pdw · on Sept 26, 2018

PowerPC assembly is going to blow this guy's mind.

bopbop · on Sept 26, 2018

Do you have an online reference for that?

I tried this IBM one but it has a terrible clickthrough to attempt to avoid GDPR:

https://www.ibm.com/developerworks/library/l-ppc/index.html

And, as seems to pretty much always be the case, the wikibooks looks promising but then appears to be empty:

https://en.wikibooks.org/wiki/PowerPC_Assembly/Instructions

This one seemed pretty good:

https://www.cs.uaf.edu/2011/fall/cs301/lecture/11_21_PowerPC...

dasmoth · on Sept 26, 2018

If you're looking for a reasonably readable summary, there was a fair amount on Raymond Chen's blog recently. Part one (of about 15 IIRC) here:

https://blogs.msdn.microsoft.com/oldnewthing/20180806-00/?p=...

codeulike · on Sept 26, 2018

Would I be right in thinking 'Assembly is Too High Level' is a kindof title or catchphrase for a series of blog posts, and that the actual article is just an analysis of how those instructions work?

XlogicX · on Sept 26, 2018

The main (actual) theme of the series is focusing on non 1-to-1 mappings, for any reason (useful or not). It's a thing that fascinates me. The phrase "$x is too high level" is mostly satirical, a phrase I've used for more than just assembly language ($x = [asm, regex, scapy, inflate/deflate, zip, elf, burritos, etc...]).

xelxebar · on Sept 26, 2018

Taking a look at the blog's main page, it looks like your intuition is pretty much right. There look to be a lot of posts with that catchphrase in the title, but I looks more like a mini-genre than a serial collection.

Senderman · on Sept 26, 2018

I've gotten lost in the 'assembly is too high level' series on this blog recently and it was very enjoyable. Also learned some things I didn't know.