Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What are some examples of beautiful x86 assembly code?
278 points by xelxebar on April 27, 2018 | hide | past | favorite | 87 comments
I've been bumping into a lot of x86 assembly lately and, unrelated, recently read about the stellar reputation of SQLite's code base.

This got me thinking: what are some examples of high-quality and/or beautiful x86 assembly? In fact, what about for other processor families as well?




I bought a nice little booklet of small x86 gems a few years back[1][2] and this one is my favorite:

  .loop:
      xadd     rax,rdx
      loop     .loop
Fibonacci with two instructions.

[1] https://www.amazon.com/dp/1502958082

[2] https://www.xorpd.net/pages/xchg_rax/snip_00.html


I’d like to add to this comment to say that the author has a fantastic assembly course available on udemy - very thorough and practical (teaching the basics in a win32 environment)

https://www.udemy.com/x86-asm-foundations/


Is that series endorsed by xorpd or an account just called xorpd? Udemy is known for allowing anyone to steal, upload, and profit from video series.

EDIT: Looks like it is. https://www.xorpd.net/pages/x86_adventures.html


Yes, I made the videos a few years ago. This video series is mostly for beginners as it goes pretty slowly.

The exercises are open source and can be found here: https://github.com/xorpd/asm_prog_ex I think that more experienced developers can take the instruction set and bash through the exercises.


I think it’s created by the man himself - having been through it it’s a very thorough coverage of the basic instruction set without delving too much into the windows specifics (it uses FASM syntax from what I recall)


Having barely any assembly experience, and looking up what xadd means[0] it really is quite a nice little snippet of code.

[0] https://stackoverflow.com/a/30131285


Can someone more versed in x86 explain to me how this works?

It's been a long time since I've done any assembly, and that was MIPS, but I don't see how there's any sort of exit condition or anything. I'm guessing there's more to xadd than `x + y = z`?


I don't know much Assembly, but at what point does Fibonacci end? Shouldn't it just keep going?


No, loop decrements rcx until it reaches zero and then stops


But rcx has not been initialised.


This is correct, from the snippet given - what the other commentators are failing to say is that they're assuming the registers are initialised "offstage" by some other part of the program.


If all functions took (rax, rbx, rcx, ...) and you had loop statement that defrements rcx and jumps unless zero, then how would you write fib-function’s body?


You raise a good question. At one time you would have only one loop running through the system as per my understanding. I am a little lost here.


In a single-core cpu, the operating system's scheduler manages the register state for each process: basically, when switching from one context to another, it dumps the old process' registers to memory, and loads the register state to the new one. From the user's point of view, the register state will appear unaffected by different processes: your loop register will not changed by other processes and threads. There is no parallelism, so only one program and register state is active at once, but there is concurrency (if the OS supports it). On a multi-core processor, each core has its own set of registers, so the scheduler could theoretically run a multiple processes uninterupted on one core per process.


Answer is you don’t have to initialize rcx, since it is already a prerequisite for a looping subroutine. There is no for (i=0; i<n; i++). Argument n is a counter and it is rcx itself. It is ‘discharged’ by a loop. If caller wants to save it, then it does that by his own means. This contradicts all regular languages’ rules, but this is low-level, far below all the safety.

On your second question about different loops. First, you can loop through any register via (dec <reg>; jnz <label>), thus having nested loops. Second, all valuable registers are pushed onto the stack and popped before/after a call or an interrupt, so their contents don’t change. Some registers have to be saved by the caller and some by the callee, depending on a calling convention and an actual usage. The stack is always available (rsp points to its top), so you can offload temporary values and concentrate on current evaluation, loading values on demand later.


I think this is the biggest problem with learning x86 assembly (or ARM or anything else) on modern systems (or more specifically modern operating systems).

It’s sometimes difficult to think about the assembly code in situ when you start to think about the operating system doing a ton of context switching and paging etc. in the background, which can distract your thought process from what’s right in front of you (as well as the operating system’s software interrupts / system calls on top of the basic ISA, which is another abstraction!)

Older systems had the currently running program as the entire context of the system at that point in time - in a similar way to embedded programming, which is imho a much easier realm to learn assembly in once you’ve got a bit of basic electronics under your belt!


The whole point of how interrupt handling works is that it returns back to the same state of the program already in progress when finished. The abstraction is such that the interrupted program doesn't need to care.

Even in those "old" systems of single address spaces and no protection, you're constantly getting timer interrupts, interrupts for I/O, etc., which your application might not have installed its own handler for.


I agree - my main point is that an OS is ‘just a program’ as well

I suspect we’re both making a similar point in a roundabout way - the operating system is both another layer of abstraction on top of the Instruction Set, while also making the programming process for that chipset somewhat easier (providing software interrupts etc. at the expense of bare metal understanding).

My argument is loosely that modern (x64) assembly is not so much targeting hardware as it is programming into a software abstraction (the operating system).


The code above I believe is the point to jump to when the register has been assigned some value for x. Think along the same lines as a function call once the pre-amble is out of the way (or a ‘goto’ by someone who knows when to use it!)


It's a hardware register, so it has always has a value (in 64-bit mode, at least), whether or not the program explicitly sets it.


But it has a value regardless


"xadd r1,r2" sets r2=r1 and r1=r1+r2 at the same time.


I was confused until I noticed the "x" in front of "add".


interesting but i prefer Binet's formula


My interest, save for occasional tweak in ARM boot code, is mostly historical. Grew up on Z-80 assembly but really enjoy PDP-11 code for didactic purposes. You can easily see that the rise of optimising C compilers was helped by divergence from PDP instruction set. Not ever since C idioms map down so nicely.

strlen():

  LEN:   MOV #-1, R0
  1⊙     INC R0
         TSTB (R1)+
         BNE 1⊙
or take strcpy():

  COPY:  MOVB (R1)+, (R0)+
         BNE COPY
versus the C version:

  void strcpy(char *s, char *t) 
  {
    while (*s++ = *t++)
      ;
  }
It translates to the optimal machine code verbatim. It does not require any smarts of the compiler, at the expense of understanding on the side of the programmer. This is made possible by orthogonality of the instruction set, which in less well thought out designs was hacked around with special instructions (like LDIR on Z80). The price for that is you have to make compiler optimize these situations.


That's remarkably close to 68k str-functions.

Page 133 http://www.atarimania.com/documents/Asm_Lang_Prog_68K_Family...

  * STRLEN - RETURNS LENGTH OP NULL TERMINATED STRING IN D0
  * A0 -> STRING
  STRLEN: MOVE.L A0,-(SP) SAVE REG
          CLR.L  D0       INITIALIZE
  STRLENI:TST.B  (A0)+    NULL?
          BEQ    STRLENR  YES, RETURN
          ADDQ.L #1, D0   BUMB COUNT
          BRA    STRLENI  LOOP
  STRLENR:MOVE.L (SP)+,A0 RESTORE REG
          RTS

  We might also want to copy a string:
  
  * STRCPY - COPY A NULL TERMINATED STRING
  * A0 -> SOURCE STRING
  * A1 -> DESTINATION STRING
  STRCPY: MOVEM.L A0-A1,-(SP) SAVE REGS
  STRCPY1:MOVE.B  (A0)+,(A1)+ MOVE A BYTE
          BNE     STRCPY1     GET ANOTHER IF NOT NULL
          MOVEM.L (SP)+/A0-A1 RESTORE REGS
          RTS
  
  Next, we will want to compare two strings:
  
  * STRCMP - COMPARE TWO NULL TERMINATED STRINGS
  * A0 -> STRING 1
  * A1 -> STRING 2
  STRCMP: MOVEM.L A0-A1,-(SP)  SAVE REGS
  STRCMP1:CMPM.B  (A0)+,(A1)+  COMPARE BYTES
          BNE     STRRET       RETURN IF DIFFERENT
          TST.B   -1(A0)       HAVE WE HIT A NULL?
          BNE     STRCMP1      NOW MORE BYTES LEFT
  STRRET: MOVEM.L (SP)+,A0-A1  RESTORE REGS
          RTS
Although I guess I'd implement strlen more like this (not sure if it works, been a long time since I last wrote anything for 68k):

  strlen:   movem.l a0-a1,-(sp)
            move.l  a0, a1   ; copy a0 to a1
  slenloop: tst.b   (a0)+
            bne     slenloop
            addq.l  #1, a1
            sub.l   a0, a1
            move.l  a1, d0
            movem.l (sp)+,a0-a1
            rts
Just two instructions in the inner loop. But not sure whether address registers supported addq etc.


Wow it's close indeed, even stack manipulation looks the same. Would be

  MOV  R0, -(SP)
  ....
  MOV  (SP)+, R0
for the PDP code.


My faster (?, see [0]) strlen seems to compile (using MIT 68k syntax):

  > cat > test.asm
  strlen:   movem.l %a0-%a1,-(%sp)
            movl    %a0, %a1     ;# copy a0 to a1
  slenloop: tst.b   (%a0)+
            bne.b   slenloop
            addq.l  #1, %a1
            sub.l   %a0, %a1
            move.l  %a1, %d0
            movem.l (%sp)+,%a0-%a1
            rts
  ^D

  > m68k-linux-gnu-as test.asm && m68k-linux-gnu-objdump -d a.out
  a.out:     file format elf32-m68k
  
  Disassembly of section .text:
  
  00000000 <strlen>:
     0:	48e7 00c0      	moveml %a0-%a1,%sp@-
     4:	2248           	moveal %a0,%a1
  
  00000006 <slenloop>:
     6:	4a18           	tstb %a0@+
     8:	66fc           	bnes 6 <slenloop>
     a:	5289           	addql #1,%a1
     c:	93c8           	subal %a0,%a1
     e:	2009           	movel %a1,%d0
    10:	4cdf 0300      	moveml %sp@+,%a0-%a1
    14:	4e75           	rts
I wonder whether it actually works... I think same idea should be translatable to PDP-11 as well.

[0]: It runs slower for short strings, though. Not sure where the break even point is.


Previous one didn't work, because I mixed up a0 and a1 pointers when subtracting.

But that gave me a new idea: copy pointer to d0 and complement the difference instead of addq, saving a1 register (+stack operations) and a move-instruction:

  strlen:   move.l  %a0,-(%sp)
            movl    %a0, %d0   ;# d0 = a0;
  slenloop: tst.b   (%a0)+     ;# test *a0, post incr
            bneb    slenloop   ;# loop if non-zero
            sub.l   %a0, %d0   ;# d0 = d0 - a0;
            not.l   %d0        ;# d0 = ~d0;
            move.l  (%sp)+,%a0
            rts                ;# d0 is the return value
Equivalent C-code (http://cpp.sh/2oecd):

  #include <stdio.h>
  #include <stdint.h>
  
  size_t strlen68k(char* a0) {
      int zero_flag;
      uintptr_t d0 = (uintptr_t) a0;
      do {
          // tstb (%a0)+
          if (*a0 == 0) 
              zero_flag = 1;
          else
              zero_flag = 0;
          a0++; // (%a0)+ (post increment)
      } while(!zero_flag); // bneb slenloop
      d0 = d0 - (uintptr_t) a0; // sub.l %a0, %d0
      d0 = ~d0;  // not.l %d0
      return d0;
  }
  
  void strlen_test(char* str) {
      printf("str \"%s\" length is %lu\n", str, strlen68k(str));
  }
  
  int main(void) {
      strlen_test("");
      strlen_test("a");
      strlen_test("abcd");
      return 0;
  }
Again, untested, but I think it probably works. I guess this is also faster for all string lengths >0 as well.

This idea might not work on PDP-11 anymore, depending on how binary arithmetic is implemented.


68xx is another case that got the conditions which set flags _just_ right. IMHO, obv


The source code of AGC (Apollo Guidance Computer) written by Margaret Hamilton and her team. Clean and obviously it worked well. Comments are also interesting to read. https://github.com/chrislgarry/Apollo-11


That's certainly not x86 code, but loved the link anyway, thanks.


What are some particularly interesting parts of this poke around in?


Here's a wiki dedicated to very small x86 programs: http://www.sizecoding.org/wiki/Main_Page

Lots of fun stuff there, like a scrolling Matrix screen saver in 8 bytes (!!) that jumps into the middle of an instruction: http://www.sizecoding.org/wiki/M8trix_8b


That is a great resource. Another great example is a 16 byte paint program http://www.sizecoding.org/wiki/Paint16b


If you're interested in demoscene, this explains how to make x86 programs in 256b or less: http://sizecoding.org/

...

and yes, I consider it beautiful x86 code :)


Jones Forth: http://git.annexia.org/?p=jonesforth.git;a=blob_plain;f=jone...

There was some discussion of it here a while ago: https://news.ycombinator.com/item?id=942684 (though unfortunately the site is now dead)


In my work book club we've been reading the book "zero bugs" by Kate Thompson.

The first half is some fun writing with stock good advice for software development, and the second half is a bunch of code examples the author finds interesting, many of which are in various flavors of assembly.

(I have no affiliation with the author, publisher, nor any other ulterior motive. Office consensus is that this is the best book we've done a book club around - highly recommended as just plain fun to read, if you're already writing software.)


It's out of date now, but back in the day Michael Abrash's 8-cycle per pixel texture mapper was considered a thing of magic:

http://www.jagregory.com/abrash-black-book/#thats-nicebut-it...


Come on, the movfuscator has to be the best x86 I've seen:

https://github.com/xoreaxeaxeax/movfuscator


I was going to ask why? But then I read the single faq at the end: because I thought it would be funny...


In addition to being funny, it's appealing to me to have the option of generating assembly to perform a task in such a way that a human looking at the assembly would have enormous difficulty in determining what is performed. As suggested by its name, it serves as a nice obfuscator.

It also fully avoids all branches.


No movfuscator post is complete without mentioning the demovfuscator [1]

https://github.com/kirschju/demovfuscator


This injection thunk I wrote in order to do facilitate remote code and context (file handles, event handles, shared memory maps etc) injection: https://github.com/tpn/tracer/blob/master/Asm/InjectionThunk...


IBM PC x86 BIOS code:

https://github.com/kaneton/appendix-bios

OT: This brings back memories of tinkering with the MS-DOS boot process. Back then, the BIOS would read the MBR and copy its contents to 0x7C00 and start execution from there. So you could assemble your own code (using no less than MS-DOS debug) and plonk it into the MBR. I remember doing things like fooling the boot loader into thinking there's less ram than there actually was (639kB instead of 640kB) and using the unaccounted 1kB for placing your own code that could be triggered by a captured interrupt... Fun times!


What was the advantage of this over 'Terminate and Stay Resident'?


So yes, this was a TSR, except that it resided in unaccounted for memory. Also, since you put it into the MBR, it would get loaded into ram automatically with every boot, and no fiddling with autoexec.bat or anything where it could be easily discovered. All this of course, in trying to understand the fascinating world of viruses.


Maybe stealthiness? TSRs could be viewed with the right tools, IIRC. This seems like a technique to hide code?


Well, not written in assembly but uses x86 intrinsics:

https://github.com/leni536/fast_hilbert_curve

I'm pretty proud of it. It calculates the coordinates of the nth point on the hilbert curve. No loops, no branches. It uses the pext, pdep and popcount intrinsics creatively.


Interesting. Have you tried generalizing that for higher dimensions?


Duff's device, as emitted by GCC[0], is a bit on the verbose side but still quite neat. In particular the single-instruction computed goto that uses a look-up table made up of 8 quad-words, filled in by the linker.

Note the '.section .rodata' directive which actually places the quads pointers, seemingly interleaved with code, in a read-only data section.

Note also the dec/test/jle instructions implementing the while loop occur before the last of the eight copy operations, and interleaved with the next-to-last copy operation.

  duff:
  .LFB0:
          .cfi_startproc
          lea     eax, [rdi+7]
          mov     r8d, 8
          mov     rcx, rdx
          cdq
          idiv    r8d
          mov     r9d, eax
          mov     eax, edi
          cdq
          idiv    r8d
          cmp     edx, 7
          ja      .L2
          mov     edx, edx
          jmp     [QWORD PTR .L4[0+rdx*8]]
          .section        .rodata
          .align 8
          .align 4
  .L4:
          .quad   .L3
          .quad   .L5
          .quad   .L6
          .quad   .L7
          .quad   .L8
          .quad   .L9
          .quad   .L10
          .quad   .L11
          .text
  .L11:
          mov     al, BYTE PTR [rsi]
          inc     rsi
          mov     BYTE PTR [rcx], al
  .L10:
          mov     al, BYTE PTR [rsi]
          inc     rsi
          mov     BYTE PTR [rcx], al
  .L9:
          mov     al, BYTE PTR [rsi]
          inc     rsi
          mov     BYTE PTR [rcx], al
  .L8:
          mov     al, BYTE PTR [rsi]
          inc     rsi
          mov     BYTE PTR [rcx], al
  .L7:
          mov     al, BYTE PTR [rsi]
          inc     rsi
          mov     BYTE PTR [rcx], al
  .L6:
          mov     al, BYTE PTR [rsi]
          inc     rsi
          mov     BYTE PTR [rcx], al
  .L5:
          mov     al, BYTE PTR [rsi]
          dec     r9d
          inc     rsi
          test    r9d, r9d
          mov     BYTE PTR [rcx], al
          jle     .L2
  .L3:
          mov     al, BYTE PTR [rsi]
          inc     rsi
          mov     BYTE PTR [rcx], al
          jmp     .L11
  .L2:
          ret
          .cfi_endproc
___

edit: formatting, sigh

[0] v 7.3.0 64bit; gcc -S -Os -masm=intel


> Note the '.section .rodata' directive which places ..., seemingly interleaved with code, in a read-only data section.

It only specifies which section that part of "code" goes into, the linker pools it all up in a binary image and fills in the address for the tags when linking as directed by the linker script[0].

[0]: https://sourceware.org/binutils/docs/ld/Simple-Example.html


I think that Duff device is more interesting in C :)

https://en.wikipedia.org/wiki/Duff%27s_device


I've always found this 16 byte bubble sort implementation to be absolutely beautiful.

https://gist.github.com/jibsen/8afc36995aadb896b649


Pokémon! GBA (ARM7) is even being used in some undergrad level OS classes. Although you may get more mileage out of RP3 development.

disassembly of Pokémon Red/Blue

https://github.com/pret/pokered

TONC GBA Programming Principles

https://www.coranac.com/tonc/text/toc.htm


Red and blue are gbc (z80-like, not ARM).


Well, Z80 shares a common ancestor with x86 and resembles it quite a bit [1]. The assembly syntax is just different because Zilog was afraid of lawsuits…

[1]: https://en.wikipedia.org/wiki/Zilog_Z80#Datapoint_2200_and_I...


GB was Sharp LR35902. Similar but not quite an z80 or x80.

https://realboyemulator.wordpress.com/2013/01/02/the-nintend...




[xchg rax,rax] has a collection of interesting snippets. You can either try to reason what they're doing yourself, or there are a variety of writeups out there for each of the snippets.


This is not x86 but certainly an optimised code https://www.pagetable.com/?p=774


Pretty incredible that such world changing code fits on one page.

It’s ineresting that when tools are new, like ASM in 1978, they give high leverage to the first to use them. Microsoft was able to leverage a small amount of code into a world changing platform. Now it would be nearly impossible to do the same with a team the same size.

But in 2018, the nascent state of ML tools looks similar to the nascent state of programming tools in 1978. And indeed we are seeing entire companies built around relatively basic AI in the scheme of things. As first movers these companies have the same kind of leverage with respect to AI that Microsoft did to Software in the 1980s.

Perhaps in 2058 someone will share a link to a Tensorflow script and we will all marvel at its terseness and apparent simplicity.


> Pretty incredible that such world changing code fits on one page.

What code are you looking at? I see "Microsoft BASIC for 6502 Original Source Code" which runs for 6955 lines. By my reasoning that is more than a hundred pages.


Ok, “one page” may not be a technically fair description (though it does fit on a web page). Still, I think most programmers would agree that nowadays, few world changing technologies can be expressed in 6955 lines of code. That’s what I mean by high leverage.


The current time on the command line in words - inspired by the original by Jim Button. Nothing fancy, but I remember at the time it was a personal challenge to shave as many bytes from the code as possible:

https://github.com/linker3000/Historic-code-PC-Pascal-and-AS...


The very first programming book I read was a programmed instruction text on Machine Language. This would have been in 1965, and the book probably dated from a few years earlier. It was another four years before I could actually run a program on a real computer (an Algol program on a Burroughs B5500, input via punched cards).

I tried to find the book a few years back, for nostalgic reasons, but couldn't.



Someone got the source code for Super Mario Bros and commented it up: https://gist.github.com/1wErt3r/4048722 (not x86 though...)


My favourite was a function to swap either a byte, word or dword with a single ret. The function would have three entry points, calling itself recursively like hanoi-towers.


Isn't LuaJIT assembler DYNASM often referred to as poetry

https://luajit.org/dynasm.html


I remember I wrote a 6 bytes code to destroy information on the whole hard drive.

Kind of opposite of what you're asking for :)


This Defcon 18 talk called 'Trolling with Math' about trying to defeat reverse engineering with assembly tricks https://youtu.be/y124L75ZKAc


Well what we were doing for pe encryptors (will try to dig out some code from tapes when I get back into civilization :D) was generating a code that was driving execution flow using SEH (structured exception handling), sometimes the code executed as intended, sometimes the exception was trigered (with faulty code, again generated) to continue at completely different part of code. This was the complete pain in the "neck" to debug (timing the code execution is a great way to tell if someone is debugging it, and you just continue into some other generated code that is missleading the reverser), and completely impossible to solve on paper (well, with huge amount of knowlidge and a lot of time, you could do it :D). That's why I am laughing with todays overuse of try/catch/throws, we were using it to obfuscate execution flow and today it is used in normal coding with same devastating effect :D


Check out these rather old examples for inspiration,

http://www.win32assembly.programminghorizon.com/source.html


The following snippet swaps two values (rax, rcx) only if they are out of order (i.e. rax > rcx). It does so without using any branches.

  mov rdx, rax
  cmp rax, rcx
  cmovg rax, rcx
  cmovg rcx, rdx


not quite the assembly, but you may look at small intros at https://www.pouet.net/prodlist.php of sizes 32 bytes and so on.

many intros are provided with source codes, and others could easily be viewed in disassemblers.

especially look at intros by such brilliant people like Digimind and Řrřola.


I was quite impressed with the Wolf3D rendering engine when somebody released the code years back.


Michael Abrash's assembly optimizations for the Quake engine.


how du a hack


I really love this one, boot sector chess :)

https://gist.githubusercontent.com/jwieder/7e7e643cc71c81f63...


There were some pretty good Win32 API tutorials back in the day. It was similar to this, if not it:

http://win32assembly.programminghorizon.com/

It was quite enlightening to see common program constructs done only in assembly.


What about something here: http://z0mbie.daemonlab.org/

This guy was a genius (those were the times you didn't do it for profit, it was a game). Google for zmist...


The coolest one liner is xor eax,eax :) 1 cycle for setting register to 0.


Yeah, but with the side effect of clearing the flag registers. So can't be used between operations that rely on flags. Alternatively it is great to break false flag dependency.


Truly beautiful, I was so ecstatic when this clicked with me when I was first learning assembly. So much depth in one "simple" instruction.


Linux boot.S




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: