Loading an x64 register, how hard could it be?

tbirdz · on Sept 20, 2015

Is this really that crazy? A 64-bit immediate value takes 8 bytes. So of that 10 byte instruction, 8 bytes of it are the value to load into the register. Similarly, a 32-bit immediate value takes 4 bytes, so 4 bytes of the other instructions are the 32-bit immediate values. Taking this into account, we see that the non-immediate overhead is 1 byte for movl, 2 bytes for movq, and 2 bytes for movabsq.

I don't really think this is as crazy as the article is implying.

0xcde4c3db · on Sept 20, 2015

It's not crazy at all. This sort of shuffling is pretty standard for RISC ISAs, where it's typical to have fixed-length instructions that are no larger than the registers.

For example, MIPS has no actual "load immediate" instruction. Instead, say you want to load the value 0x12345678 into register t0. You can go ahead and tell your MIPS assembler:

    li $t0, 0x12345678

but it will actually emit something like:

    lui $t0, 0x1234
    ori $t0, $t0, 0x5678

"lui" here is the "load upper immediate" instruction, which loads the target register with the 16-bit immediate, left-shifted by 16 bits. "ori $t0, t0, 0x5678" then performs a bitwise OR with its immediate to get the lower 16 bits into the register. This can be done in a single instruction if the immediate is 16 bits or smaller by using the dedicated zero register (a read-only register that always contains 0x00000000, which turns out to be extremely handy for minimizing an instruction set):

    ori $t0, $zero 0xabcd

JoshTriplett · on Sept 20, 2015

SPARC does something similar, with a special "sethi" instruction to set the high 22 bits of a register, using more space for an argument than any other instruction does.

Some architectures manage to do something more clever than that, as well. See http://alisdair.mcdiarmid.org/arm-immediate-value-encoding/ for instance.

pm215 · on Sept 20, 2015

The SPARC call instruction beats that with a 30 bit immediate, sacrificing a huge chunk of the encoding space for being able to reach any target in a 32 bit address space with a single instruction.

userbinator · on Sept 20, 2015

I think the author is saying what's crazy is the lack of 64-bit immediates in a machine that contains 64-bit registers. There's an instruction to add a 32-bit immediate (and I believe 8 and 16 as well), but not a 64-bit one.

devit · on Sept 20, 2015

It's rare for programs to contain integer constants that don't fit in 32 bits, and they aren't needed for 64-bit code/global memory references due to the introduction of addressing relative to the instruction pointer.

Also there is a 15 bytes instruction length limit that would have to be extended if 64-bit immediates were allowed on all instructions.

hrydgard · on Sept 20, 2015

No processor out there has 64-bit arithmetic immediates - it's not useful enough in practice that you'd want it, and most other architectures have fixed-length 4-byte instructions anyway so it's not practical.

pcwalton · on Sept 20, 2015

It's not crazy: it's just standard operating procedure for RISC-like architectures. In fact, anything that brings the x86 architecture closer to RISC is nice in my book.

TheCondor · on Sept 20, 2015

Alpha had all sorts of exotic load intermediate and shift instructions to help with this. 64bit registers, 32bit instructions, I forget the exact number but you only had 20ish bits of immediate value.

It seemed like slightly more of a chore then, with x8664 you can use memory with a lot of instructions..

gsg · on Sept 20, 2015

Populating x86-64 floating point registers is also an amusing subject.

The obvious instruction for loading a (64-bit) float into an xmm register is movsd. With a memory source operand, the higher part of the register is zeroed, which is what you want. No problem.

Now the fun part: if the source is not memory but another xmm register, the higher part of the register is not zeroed. This induces a false dependency on the previous value of the destination register that can cause performance issues. To avoid this problem, such register-register copies should be done with a packed move instruction. (Or vmovsd, but that was added much later.)

The obvious packed move instruction for 64-bit floats is movapd, but we can do better than that by using movaps - it is still a float domain instruction but is a byte smaller.

So the optimal way to move a single double from one register to another is to use a vector move of the wrong type.

waterhouse · on Sept 20, 2015

> "it is impossible to add 2^33 to rax using one instruction only."

This can in fact be done, with a memory operand. I'm not sure about the performance compared to a 64-bit load immediate followed by an add, but this will do it (NASM syntax):

          add rax, [rel the_constant]
          ...
  
  the_constant:
          dq 8589934592

raverbashing · on Sept 20, 2015

This is the right answer.

This is what ARM does a lot of times (their 'immediate' value ops allow you to pick an 8 bit number and rotate it a bit)

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

justin_ · on Sept 20, 2015

For those wondering what he meant by the "data dependency" issue that zeroing out the upper 32 bits avoids, the first answer to this SO question does a decent job of explaining it: https://stackoverflow.com/questions/11177137/why-do-most-x64...

WalterBright · on Sept 20, 2015

It's actually 7 bytes, not 6, to load a sign extended value into a register.

        48 C7 C0 FF FF FF FF    mov     RAX,0FFFFFFFFh

mpu · on Sept 20, 2015

Thank you, I fixed that. In my head I knew a REX prefix was necessary and thought it was sufficient, but i386 has shortcuts to load an immediate into a register that are not available in 64 bits version!

Thanks for mentioning that.

transfire · on Sept 20, 2015

Makes one miss the 6502.

tacos · on Sept 20, 2015

A better example is the 6809. LDA #$FF takes two bytes and two cycles. LDX #$FFFF takes three bytes and three cycles. "Isn't that crazy?" Sigh.