Say Hello to X64 Assembly Part 3

userbinator · on Sept 20, 2014

I think something isn't quite right with int_to_str/print; in int_to_str, each converted char is pushed onto the stack as 8 bytes, and while the total length is calculated correctly in print, the result is that 7 null bytes get written out with each char as well. What you see will depend on how your terminal interprets them, but they will definitely be there if the output is redirected into a file.

There's also an extra "add rdx, 0x0" in int_to_str, a puzzling multiplication by 1 in print, and a confusion between the standard input (0) and output (1) fds.

pjmlp · on Sept 19, 2014

Thankfully the article uses Intel syntax.

I had to port a code generation module from Intel syntax to AT&T. What a pain!

Gas is so limited compared with what PC macro assemblers are capable of.

MegaDeKay · on Sept 20, 2014

I wrote a blog post a while back showing how Intel syntax can be used in gas, along with a number of examples.

http://madscientistlabs.blogspot.ca/2013/07/gas-problems.htm...

pjmlp · on Sept 20, 2014

If you check a sibling thread you'll see I also had some issues with Gas macro capabilities.

Has Gas understanding of Intel syntax meanwhile improved? It had a few bugs when I did this.

MegaDeKay · on Sept 20, 2014

I ported the five examples from an IBM Developerworks article on x86 assembler [0] and they all worked fine, but didn't fool around much beyond that.

[0] http://www.ibm.com/developerworks/library/l-gas-nasm/

jwr · on Sept 19, 2014

I find AT&T syntax to be completely impractical to work with, and I'm surprised people put up with it.

But, as a side note, all PC assemblers are crap. They could do so much more. I don't know why I'm expected to handle register allocation, or think about inter-unit dependencies that cause stalls.

pjmlp · on Sept 19, 2014

Personally I wouldn't like to have an Assembler re-arrange my code, specially since the issues you mention can even change between firmware revisions.

On the x86, starting with the 486 and Pentium it wasn't possible any longer to fit all variations on the head.

At least I can't.

The porting I mentioned above was to make a compiler depend only on binutils.

Other than that, my last 100% Assembly application was around 1995.

alecco · on Sept 19, 2014

yasm -p gas?

pjmlp · on Sept 19, 2014

That would mean still using the AT&T syntax.

I started coding x86 Assembly back when MS-DOS 3.3 was modern in PC world, so never got to love AT&T syntax.

Specially after my porting experiment.

I needed to change a compiler backend to depend only on binutils (gas & ld).

So the existing NASM code had to ported.

NASM macros couldn't be mapped to what Gas offers and I also had some issues getting my head around how to write the addressing modes.

I was using a common trick for code generation.

Generate bytecodes that are actually Assembler macros.

A simple way to get static binaries, even if the performance is not optimal.

In the end I gave up and made the backend generate the respective Assembly code directly.

Havoc · on Sept 19, 2014

I'm a little confused - why is ASM still an issue these days? Sure I can understand some in-line ASM for hardcore speed-critical code but beyond that...why bother? Even interpreted langs seem fast enough these days, so compiled should def be fast enough and resorting to ASM should imo be unnecessary.

NB the above is a personal view & I'm not a programmer by profession...so if I missed something - no offence intended.

userbinator · on Sept 19, 2014

As someone who reverse-engineers and has seen a lot of compiler-generated code as a result, I'm even more convinced than before that the whole "compilers are better at generating code" mantra is a myth. The only thing they're good at in practice is generating lots of code quickly; there are instances when the output of a compiler manages to impress me (Intel's is particularly good at that), but they're still quite isolated instances and the rest of the code continues to have this "compiler-generated" look to it, i.e. much could be improved.

It is certainly not hard to beat a compiler on speed or size (often both), and I believe that the only ones who can't are the ones who learnt Asm the stupid way that compilers generate code and not how the machine really works. E.g. it's commonly taught that x86 has 6 general-purpose registers (reserving eBP/eSP), but in reality eBP-based stack frames and the use of the stack is nothing more than a compiler-generated artificial construct. Even eSP can be used for something else if you really need one more register[1]!

Compilers follow the rules of their source language and impose strict, often unnecessary conventions on their output. Asm follows the rules of the machine, which are far richer and more expressive than the abstracted simplicity of any HLL. That being said, they have improved significantly over the years - the days when compilers would push/pop every register on entry/exit to a function regardless of whether it was used (or its caller needed the value preserved), or when the start of every function could be identified by a distinctive 55 89 E5 (push bp; mov bp, sp) in the binary are fortunately mostly history.

[1] http://www.virtualdub.org/blog/pivot/entry.php?id=85

pjmlp · on Sept 19, 2014

In what regards modern processors I doubt very few humans are able to keep on their head, what each model and micro-code firmware release is doing to their micro-ops.

userbinator · on Sept 20, 2014

That actually matters far less than most people think, because humans writing Asm do not (and should not) generate code like a compiler does.

The article has a great example of this in the int_to_str code. While it's far from optimal code (e.g. the "mov rbx, 10" should really be outside of the loop, the "add rdx, 0x0" is useless, etc.), it's using the stack in a way that a compiler would probably never do, and something that likely isn't possible to express at all in a HLL. The C version would involve at least allocating an array with some pointers into it to act as a stack.

It's in these approaches - fully exploiting the features of the machine - where human-written Asm really shines. In some ways, the compiler "has its hands tied" when it comes to generating code, since it is limited to generating code that represents ideas in the corresponding HLL; it can make use of extensive knowledge of microarchitecture to schedule instructions/uops, sophisticated algorithms for register allocation and instruction selection, and generate code that is faster on a particular model of CPU than if it did not, but this falls short of the human who realises that neither a separately allocated array, nor such instructions to manipulate it, are actually needed.

The human can apply even more extreme optimisation making use of microarchitectural details if she so chooses; but the fact is that even without this, the code is probably already better than what a compiler can do. In practice, for general-purpose-code (i.e. excluding additional features) most models of x86 are all quite similar since the P6, and Intel has been doing a great job at making existing code faster with each new generation, so there is little need to go to that extreme; the optimal sequence of instructions for one model is often the same or close to optimal for the next one. (The odd ones out are the P4, and possibly some of the early Atoms.)

Another item that appeared on HN a while ago illustrates this "think outside the HLL/compiler" idea too: http://davidad.github.io/blog/2014/02/25/overkilling-the-8-q... ( https://news.ycombinator.com/item?id=7301481 )

dllthomas · on Sept 19, 2014

But sometimes you are faced with squeezing all the performance you can out of a specific processor.

pjmlp · on Sept 19, 2014

And then comes a firmware update to the microcode...

dllthomas · on Sept 19, 2014

And then you apply it in testing, note the regression, and either don't apply it in production or apply it as you push out an update.

pjmlp · on Sept 19, 2014

Do you control all the processors your customers use?

dllthomas · on Sept 19, 2014

If the software will be running on the machines of "customers" and you do not "control all the processors" they use, then you're not in the "sometimes" I was discussing above.

pjmlp · on Sept 20, 2014

That was what I was trying to say, somehow badly I guess.

dllthomas · on Sept 20, 2014

Yeah, I certainly didn't want to give the impression it was a common situation, just that it totally does happen.

floody-berry · on Sept 20, 2014

Do you have an example of this happening?

Havoc · on Sept 19, 2014

You miss my point 100%. I know ASM code is technically superior.

However, the advances in tech have outpaced the advantages of ASM. Yes your code is 10x more efficient than mine, but I can throw 1000x as much processing power at this via cloud & a scripting languages.

Yes you've seen more "compiler-generated code" than I have but realistically...who is going to win this p!ssing contest? You writing ASM code or me scaling a scripting language across clouds? I bet I can deploy cloud processing power faster than you can

MOV AX,CS

MOV DS,AX

Also - I'm not in the cloud game - the above is a simple example/play the devils advocate.

dllthomas · on Sept 19, 2014

It depends tremendously on context. For instance, it doesn't matter how many cloud resources you can throw at the problem if you need sufficiently small latencies.

Note that it's exceedingly rare that one should be writing an entire program in assembly. But knowing assembly allows you, as you bump up against performance constraints, to inspect what code is being generated and understand what's slowing things down, which can help guide you in adjusting the source so it will be faster, applying a better set of optimization options, hand-tweaking the generated assembly, or ultimately rewriting sections while having a reference implementation.

Havoc · on Sept 19, 2014

>>if you need sufficiently small latencies.

I'll concede that you'll need ASM for core speed critical applications like stock exchanges...then again I conceded that point before we started debating...

>Note that it's exceedingly rare that one should be writing an entire program in assembly.

My mention of in-line ASM should have made that obvious that I'm not arguing for pure ASM.

>applying a better set of optimization options, hand-tweaking

You hand tweak your ASM, I deploy additional cloud clusters. We'll see which scales better (pro-tip: optimization is notoriously prone to diminishing returns).

dllthomas · on Sept 19, 2014

">Note that it's exceedingly rare that one should be writing an entire program in assembly.

My mention of in-line ASM should have made that obvious that I'm not arguing for pure ASM."

That wasn't me arguing with you, it was an attempt to clarify my position on the matter in light of what else I said.

"You hand tweak your ASM, I deploy additional cloud clusters. We'll see which scales better"

That depends tremendously on, again, the context - including where one already stands on the various curves and on how parallel the problem is.

"(pro-tip: optimization is notoriously prone to diminishing returns)."

As is parallelism: http://en.wikipedia.org/wiki/Amdahl%27s_law

alecco · on Sept 19, 2014

> but I can throw 1000x as much processing power at this via cloud & a scripting languages

This kind of stuff is why I stopped teaching and mentoring. It's like trying to convince creationists with facts.

Also, it's wide knowledge programmers spend the vast majority of time debugging. The other day I wasted an hour because a popular high level language doesn't warn users when comparing apples to oranges. I've spent on ocassion a lot of time chasing bugs in cloud services that make debugging close to impossible, too. There's no free lunch.

jwr · on Sept 19, 2014

The definition of "fast enough" varies.

A while ago, I was doing some exploratory work — basically trying out various approaches to a problem. There was a particular piece of code that was quite expensive to execute and that was executed a lot. The net result was that a single calculation of the required matrix took about two months.

It took more than a month of hard, full-time work to fit the problem into fixed-point registers and the SIMD model used by Intel SSE. Over two weeks from that month were spent on debugging, even though I went through multiple versions in C, each one closer to the target assembly. But the resulting speedup was well worth it: about 600x.

Now think about it. 600x. We went from "a single try/iteration takes two months" to "we can try six variations overnight".

So yes, assembly matters, for many reasons of which I only outlined one.

As a side note, people tend to focus on big-O algorithm performance, neglecting the constant. But the constant matters in practice, more than most people think.

Havoc · on Sept 19, 2014

>The definition of "fast enough" varies.

You list one specific example where ASM provided a tangible speed benefit in a research oriented processing heavy env.. Indeed. Now take a good look at my original comment:

>>Sure I can understand some in-line ASM for hardcore speed-critical code

floody-berry · on Sept 19, 2014

Er, do you think only a select few people need to know assembler then?

radmuzom · on Sept 19, 2014

Even though I don't advocate application programmers using assembly, I strongly disagree with your statement "Even interpreted langs seem fast enough these days". This is true probably for a small subset of programmers who are mostly into web programming (and using the logic that network and not the CPU is the bottleneck which is plain nonsense), but not true for programmers who create useful applications used by many companies around the world. To give some examples not necessarily related to interpreted languages but to show how fast software is getting worse and worse, 1. My Windows 2000 start-up time on a 1GB machine was probably slightly higher (4-5 seconds) than Windows 8 start-up time in a 8GB machine. It looks like improvements in hardware have not made any difference to the user experience. Similar for Linux, but that may be driven by distributions bundling more things at start-up. 2. In Excel 2003, I could right-click on a cell and format - the formatting dialogue would open up instantly. In Excel 2010, I need to wait 2-3 seconds for the dialogue to open up. 3. I regularly use software like TOAD and SAS at work. Given that both are often used to examine data, I need them to work very smoothly after they have retrieved the necessary data from the back-end database. TOAD is ok, but SAS has gotten progressively worse over the years. I don't care whether your developer time is costlier than compiler time, I need the software to work as fast as it can only constrained by the specifications of my machine. 4. Laptop at work is encrypted using Symantec products. The software is so bad that the hard disk light on the laptop just keeps on "glowing", indicating HD activity, continuously without EVER stopping. All applications are slowed down too because of this activity.

I could just go on and on about how software is getting worse day by day. Even John Carmack recently tweeted about this - how smartphones with massive processing power and small resolutions can't display graphics decently as compared to very old machines. Anyway, thanks for letting me vent out my frustration which has been building up for quite a few days now.

userbinator · on Sept 20, 2014

I don't care whether your developer time is costlier than compiler time, I need the software to work as fast as it can only constrained by the specifications of my machine.

I can't agree more. Many programmers are traditionally taught that optimisation is some sort of "last moment" effort for when thing are seriously bloated and slow and that programmer time is more expensive than machine time, and it seems all this does is breed a lazy, selfish attitude; yes, it's very likely not worth expending an hour or even 10 minutes to optimise a throwaway script that's used by yourself and isn't a bottleneck in your process, but when your software is used by hundreds, thousands, or even millions or more users worldwide, who are often paying users, it doesn't seem nice at all to be making them all wait or be forced to buy newer hardware just so you can "save" a bit of work that you should be doing anyway.

It's like http://xkcd.com/1205/ , multiplied by the size of the userbase.

cnvogel · on Sept 20, 2014

I feel your pain, though the first two things you mention will never be improved by vigorous application of assembly code in software, and actually I'm a big fan of interpreted/scripting languages.

E.g. your start-up time of Win 8.1 compared to Win 2000 on the two quoted machines probably will not change more than +/- a few percent if you'd change from completely-unoptimized-compiled C, to hyper-unrolled-handtuned-SSE3-AVX-assembly.

Also the quoted few-second delay of MS-Office popping up the "Cell Format" dialog probably has more to do with the software asking (I'm making that up) the OS for a list of 1000 fonts and their properties, and less with MS Excel being written in optimized assembly, C, VBS or python, for that matter.

In my experience, some programmers like to discuss endlessly about the merits of replacing "i4" with "i<<2", or other stupid micro-optimizations[#], and are often completely oblivious about the "high-level" design, such as using proper data structures or how to schedule their work efficiently on multiple cores. And, for the "high level" issues, scripting languages might even help coding out larger systems quickly, so that more proper optimization effort can be spend on the actual computing intensive or resource hungry task.

[#] THIS IS JUST A STUPID EXAMPLE! DON'T DISCUSS i<<2 <-> i4 HERE, PLEASE!

ChuckMcM · on Sept 19, 2014

It is the difference between passion and education. Someone who passionate about technology is really interested in all the bits and crevices, someone who is simply learned about technology classifies knowledge into 'useful' and 'not useful' parts. A common example is 'car people' who go through and re-design the fuel injectors for their car. It isn't "required" but they really like everything about all the pieces. So they persevere to understand it all.

bsoares · on Sept 19, 2014

This is a very well thought statement. I was just thinking this morning how great is to understand low level logic and coding, but when discussing it with a co-worker, it was clear that there was no practical day to day benefit from knowing about the underpinnings of the higher level languages we use.

That said, I am able to better overcome technical challenges when they arise due to my curiosity and passion for the details.

Havoc · on Sept 19, 2014

I hear you. Though in the modern world I feel everyone is forced to triage on the useful/not useful front. There is too much info out there and not enough time.

lovelearning · on Sept 19, 2014

In a past life, I worked in an industry where just about all components of our large software system were written in C++.

Lots of multithreading, memory allocations, deallocations, etc. If there is one thing C++ developers know better than the language itself, it's debuggers and memory dumps. We used tools like windbg and gdb almost as much as the IDE.

Often, crash dumps from our release binaries didn't have enough inbuilt information to find the cause of the crash. We would then have to do things like walking the stack manually. For doing that effectively, it was necessary to understand x86 internals to figure out what the disassembled code is doing. This is probably the likeliest way most native application developers run into assembly language. Always helps to know it well.

minimax · on Sept 19, 2014

When you are profiling code that's been generated by an optimizing compiler, you often have to manually reverse-translate from the compiler generated assembler back into the C/C++ source to understand the profile results. The reason you have to do it manually is because one line of C or C++ can easily be compiled into a dozen or more instructions which may not even be adjacent to each other. I don't write much assembler but I find myself reading assembler at least once a week.

alecco · on Sept 19, 2014

ASM is great for learning how CPUs work. Also it's central for:

  * building compilers, especially JIT
  * debugging binaries
  * the lowest level of kernels
  * understanding security exploits like overflows, format string, JIT spraying
  * hardcore performance

scott_s · on Sept 19, 2014

Assembly is part of how a computer computes, and there are people who want to understand every step of the process. Such people are not satisfied until they can pull back the covers, and say to themselves "Aha! I get it, this little thing here feeds into this thing here... which goes there... and out comes my answer. Cool."

And it's a good thing such people exist, because they build all of the foundational technology that you take for granted.

pjmlp · on Sept 19, 2014

Someone needs to write bootloaders, OS drivers, compilers, interpreters, image processing algorithms, audio processing filters, ...

DCoder · on Sept 20, 2014

There are other uses for ASM aside from getting more speed. Learning ASM will show you how things work at a low level, which is helpful when you try to understand C pointers and other details. Reading assembly also lets you (try to) understand the programs you don't have the source for - device drivers¹, malware¹, copy-protection, regular third-party applications ...

I spent several years disassembling a game (C&C Red Alert 2), understanding the code, fixing the shitty parts and glitches left in it, and extending it. It was fascinating.

¹These examples are certainly far from trivial, I know, but that is half the fun.

jonhess · on Sept 19, 2014

If you're writing a desktop app, you may need to debug frameworks and libraries you don't have the source for. If you author a framework or library, you may need to debug a client application you don't have source for. In both cases, you'll likely need to debug assembly code.

krylon · on Sept 19, 2014

Well, there's always people that just think it's fun. Plus, you learn A LOT about the processor you are working with.

But the group of programmers who need to work in assembly is not very large, I think. Most of them are probably writing compiler backends or working in embedded systems.

But learning assembly teaches you a lot about how the processor works, so it might be time well spent even if you never actually write assembly.

(NB I don't really know assembly, I have written like 20 instructions of inline assembly (and 3 inline binary instructions - what a rush that was!), so I am not an expert on the matter, but I can see the appeal. ;-)

xooxies · on Sept 19, 2014

It's still a need to know if you work in the reverse engineering, security, vulnerability analysis, etc. realms.

sliverstorm · on Sept 19, 2014

It's probably a pet project for fun & learning about the metal.

schappim · on Sept 19, 2014

I think I just found the perfect use for the Intel Edison (a tiny Atom (x86) + BLE + Wi-Fi module by Intel): https://www.sparkfun.com/products/13024

Narishma · on Sept 19, 2014

Edison is 32-bit x86.

anjanb · on Sept 19, 2014

is there something equivalent for nasm on the OS X ?

DigitalJack · on Sept 19, 2014

yes, as someone commented already you can use homebrew. I think xcode might include nasm too, but I'm not sure if that is still true.

There are some hello-world examples for 32bit and 64bit with nasm on osx here:

https://gist.github.com/desertmonad/36da2e83569bc8b120e0

valarauca1 · on Sept 19, 2014

     brew install nasm

alecco · on Sept 19, 2014

Nowadays yasm is quite better and backwards compatible to nasm.