Hacker News new | past | comments | ask | show | jobs | submit login
I translated a simple C program to x86_64 and it was slower (ecc-comp.blogspot.com)
214 points by spiffytech on April 25, 2020 | hide | past | favorite | 126 comments



The assembly is not bad for a beginner but far from optimal. A few things that stand out from a quick glance:

1) Try to load things sequentially don’t load from offset 32, then 0, then 16.

2) Don’t use the loop instruction

3) this code:

  dec  n
  cmp  n, 0
  jne  1b
Can be replaced with:

  sub n, 1
  jnz 1b
Which will actually be executed by the processor as a single instruction (macro-op fusion).

4) Interleave your expensive instructions with less expensive ones. Try to interleave multiple dependency chains to let the processor see more of the parallelism.

Your two divisions one after another will be limited by available execution units capable of doing the divide.

5) Lastly align your loop labels to 16-byte offsets. The assembler will do this for you with the ALIGN directive.


Doing most of your changes made essentially no difference X)

I suspect the assembly generated by the compiler is pretty fancy.

Edit: Just briefly looking at the objdump again, gcc is using a lot of scalar variants of instructions, instead of vector. Maybe this is giving the CPU better hints?

Maybe my code would run better on a more modern CPU (while still constraining the original C program to SSE3 only) ?


Strength reduction and algorithmic changes will make the most difference, but that requires a much deeper look at the code. I wouldn’t be surprised if the compiler found a way to reduce the number of divisions.

If you have a lot of practice the compiler is not difficult to beat. But it definitely takes some effort and study of the optimization manuals.


I don't know why you're using objdump instead of -S to have gcc generate assembly directly.

You get named constants and named labels that way, and you can even choose the sane Intel syntax with -masm=intel which you can also do with objdump.


I like -Wa,-adhlns=foo.lst -g to see the C, asm, and machine code all together.


Hand written 294 lines vs. gcc -O3 526 lines. https://godbolt.org/z/yEcMKf

At a first glance, it seems the compiler version is better at hiding the latency of some of the div instructions. It might be hiding memory access latency, too. But that's more involved analysis.


For x86-64 compilers will use intrinsic instructions which will automatically optimize for that processor. Unless you're really good at knowing the processor you're almost always better off doing it like this. It's almost no benefit writing assembly because of these optimizations these days.


Are there linters that will overlay compiler-tier optimization hints on top of handwritten assembly?


Huh, yeah, that'd be neat. Specifically just ripping out the peephole optimizer from a backend just for matching would be a great start with a decent amount of value and not a god tier amount of work.

That being said, you're not going to see the really crazy optimizations that compilers can do, because they need a bit of a higher level view of what you're trying to do.


not precisely what you're asking, but there are two tools that annotate machine instructions with results from throughput and latency analysis, which helps with loop microoptimisation.

the original from intel, iaca: https://software.intel.com/en-us/articles/intel-architecture...

newer clone, shipped with llvm: https://llvm.org/docs/CommandGuide/llvm-mca.html


The closest thing would be the Godbolt Compiler Explorer:

https://godbolt.org/


It is nice tool for online exploration, but not as V-Tune.


You could try to use one of the c language reversers to get c source code and then ask gcc/clang.


This would be cool. I usually write up some C snippets and compare what different compilers do with what I have. The compiler explorer makes this really easy


Don't know of anything that works on the source code level but static analysis in profilers does that for the object code (ones I use are proprietary but I imagine something like V-Tune does this as well).


I have not looked at it in many years, but Intel's VTune tool was designed to help coders analyze and hand-tune code.


Shouldn't a superscalar CPU figure out concurrent execution chains by itself with register renaming


It will do a bit of that, but remember it has to work in real time and and can only look ahead so far. Give it a fighting chance by helping out whenever you can.


Tangent, but this reminds me of this great talk by Matt Godbolt from CppCon 2017: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”

https://youtu.be/bSkpMdDe4g4.

And the follow up https://youtu.be/nAbCKa0FzjQ.


It's comments like these that make me realize I have a lot to learn about computers, I understand those words separately but together it sounds like a line from star trek.


CPUs can execute multiple instructions per clock cycle (recent x86_64 can do 4-5). However, instructions can take 1 or (commonly) several clock cycles to complete, before their results are available for any instructions depending on said result to start executing. Such dependencies are the babe of superscalar parallelism.

But sometimes things that look like dependencies are fake.

XMM0 = XMM1 + XMM2 XMM3 = XMM0 + XMM4 XMM0 = XMM5 + XMM6 XMM7 = XMM0 + XMM8

At first glance it may look like you have to calculate these instructions serially. But, by renaming the last two `XMM0` you eliminate the dependency on the specific register, and can calculate instructions 1 and 3 in parallel, followed by 2 and 4 in parallel.


Cpus are designed with compiler optimizer experts in the loop. If the compiler can do it the cpu won't try. Instead the cpu does things that the optimizer can't do. Note that this goes both ways, if an optimizer won't use something the cpu won't do it, if the optimizer wants something the cpu will do it. (obviously within the limits of what is possible, and all the other trade offs)

Regiser renaming is designed assuming the optimizer


Except all those times it fails miserably. Like doing this to AVX SIMD.


4. is interesting fact I didn't know. I learned in modern hardware, the simple data structure actually runs faster(linked node vs contiguous storage), but in assembly level, it's still quite opposite I guess.


Under what circumstances to non-contiguous data structures run faster? I know of circumstances where structures with pointers make it easier to get better asymptotic behavior with a large amount of data, but none where a linked structure out-performs the analogous contiguous one on moderate amounts (a few cache lines) of data.


I wondered when he learned that. I learned the same thing in 1996, but modern cpus were not yet (the the knowledge of the class, I'm sure they were doing it...) dining prediction in the cache to load the next required memory before it was required. This optimization changed things such that linear search often is faster than binary search (unless n is very large of course)


Where could I read about advanced assembly optimization techniques like this?



In the data sheets for your cpu. Note that this is cpu specific, even when the instruction set is shared that doesn't mean that the same optimization applies.


in any other language it would be something like if (x-1) != 0


You have five bodies and 16 sse registers. The entire state of your simulation can fit into register space and you dont need to ever access memory during the stepping part of your code. You can loop unroll all gravity interactions so you end up with one large branchless memoryless block of code. Now that its completely inline you can rearrange your dependencies based on expected latency and throughput of operations (https://software.intel.com/sites/landingpage/IntrinsicsGuide...)

Then after that you can merge the operations where you can. (for SSE4 at most you are going to get is 2x because you are using doubles)

You may think the full inlining is cheating but the compiler has the same information as your bodies list is entirely constant. (since your dt and your masses are constant they can also potentially be folded).


Full inlining might be slower if the branch is easy to predict


This. You should avoid memory access and branch like plague.


This is the n-body problem from the Computer Language Benchmark Game with the attribution to the original author removed.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

The faster programs usually use SIMD.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


Really sorry about that one. I've added the link a few minutes ago. I just did not expect this to be notable at all to anyone. Thanks for pointing it out!


Any idea what's going on with the mem values in the listing in your second link?

Wouldn't expect a C++ implementation to use 200x the memory of a C implementation. Different algorithms presumably? Or a significant compiler optimisation being missed?


Maybe because the cpp version includes stdlib on top of stdio?


Ah, of course. So the problem doesn't involve any real allocations, and it's just the language overhead that we're seeing.


This is offtopic but I really wish that they re-added the graphs in the benchmarksgame.



This is quite useless compared to the old ones (it used to have a benchmark on each program and language and you could compare them)


You don't say which graphs you are complaining about, and you don't say why you think some other graphs were useful.

Those graphs show the same as they did in 2013:

https://web.archive.org/web/20130411161423/http://benchmarks...

Do you mean these:

https://web.archive.org/web/20130520174141im_/http://benchma...

Do you mean these:

https://web.archive.org/web/20140608064549im_/http://benchma...


I meant the last one. I did say exactly which graph I was complaining about.


So why do you "really wish that they re-added" that graph ?

What problem would that solve for you ?


What does that graph let you do that you cannot do now?


Easily compare in a cute format how a language does against another for a given problem regarding memory usage and time.


https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Not "cute" enough?

(More people complained about those charts than thought them cute.)


Sometime around 2000, I tried to hand optimise an image processing routine in x86 assembly. Previously I'd only done 6502.

My first attempt was massively slower than the compiled code. I had to get into the whole pipelining that was going on and other hardware optimisations.

When I discovered putting a NOP in a tight loop sped up the code, I realised I wasn't going to beat the compiler most of the time!


I also had that thing with the NOP happen to me once, with the program with the extra NOP running 2x as fast! Took a couple of days until I finally figured out what was going on.

After much investigation what I found out was that the original code without the NOP was actually running at only 1/2 the speed that it should. Due to very bad luck, the addresses of the jump targets in the inner loop where placed in a configuration where the branch predictor failed to predict the jumps (perhaps because of collisions in the internal "hash tables" used by the jump predictor). Any nudge to the executable would get the program out of the pathological configuration. Using a different compiler version, different OS , or different CPU model all did the trick. But the most fun of course was that adding or removing a NOP also made the difference :)


Raymond Chen has an entire article on the use of NOP in Windows 95 code. In one case, they had to fix a bug with a 32-bit NOP, because using a 16-bit NOP would introduce a different bug!

https://devblogs.microsoft.com/oldnewthing/20110112-00/?p=11...


Was it instruction alignment or something more mysterious?

My anecdote from hand optimization is that I didn't even know some instructions existed.


Really insightful comment.

It's true. C compilers are so optimized, the day of re-write in assembly have long passed for most of us.


it is the dance between compilers and chip designs that unmakes hand work.


I think the point was adequately made. You can, of course, write better assembly, and you can of course write the exact same assembly a compiler would. But if all of the suggestions for optimization are things compilers would do anyways, you are better off coding in a higher level, portable language in the first place, and only dropping down to asm when it is sufficiently beneficial.

All nitpicking about the quality of the assembly is valid of course, but it does inadvertently help prove the point, especially since most or all of these things are things compilers do today.


This talk proves this point exactly:

https://youtu.be/zBkNBP00wJE

Watch from 27:00 where he talks about using const wherever possible so the compiler can optimise. It reduces a full function call to work out a colour in 3D space from thousands of instructions to one.

After watching this and a few others particularly dynamic dispatch in Swift, I find myself adding as many hints as I can and applying the most restrictions / scoping possible. Final, const, Private etc.

https://developer.apple.com/swift/blog/?id=27

Even though actually it can work this out in your module!


Well, you have to also consider that most of us write a lot of high level code and very little, if any, assembly code, so we shouldn't expect to be good at writing high performance assembly without practicing. I've seen some rather amazing high performance hand-written assembly code, by people who write it often, but I certainly wouldn't be able to do it, without a lot of study and practice.

Of course, for most people and most uses, its not worth the effort.


I frequently make this point so I may sound like a broken record, but I believe this is more or less a fallacy.

Again, the point isn’t that better assembly couldn’t be written. It’s that it most likely wouldn't be significantly better than the compiler because all of the suggestions are things compilers would be doing anyways. There are some cases where this isn’t true, especially when dealing with vectorization, but those are mostly just exceptions (and intrinsics often offer easier ways to do such optimizations...)

But here’s the point that I feel is often ignored when it comes to programming language debates in general: just because you are experienced with and aware of advanced usages of the environment you’re programming in, does not mean the complexity and especially cognitive overhead of said complexity has disappeared. Looking at the C version, it doesn’t really look especially optimized, which is not really something that you would see in assembler, at least not in my opinion. Complexity adds up over time; abstractions are the antidote to that problem.

On top of that, assembly language is obviously not portable, which IMO is even more reason to use a high level language and drop to asm only when needed; you can easily swap implementations and have a fallback for architectures that aren’t specifically optimized.


If you don’t study and practice it, how can you possibly know how to write good code (in any language)? In assembly, if you’ve never been exposed to eg the simd instructions, or aligned memory access or cache or branch prediction or instruction level parallelism, how can you expect to write performant assembly code? Experience and knowledge doesn’t just appear.

I’m not arguing that it’s worth it or that it’s easy to beat the compiler. I certainly am not going to bother writing assembly (maybe some intrinsic for SIMD but certainly not raw assembly, outside of embedded systems, although even then it’s not really worth it usually).

I’m simply saying that you can’t expect to be good at something unless you practice it.

But that doesn’t mean people shouldn’t try but to learn. For example, somebody has to implement the optimisations isn’t he compiler and that person needs to have a great understanding of how to produce high performance assembly code. Plus learning new things is always worthwhile if you have the time.


In Assembly, even if you manage to beat the compiler, it might be a pyrrhic victory, because it might be lost when trying the same benchmark in another CPU or after getting a microcode update.

During the 80's and early 90's it was a different matter, because CPUs were dumb, hardware was relatively static specially on 8 and 16 bit consumer systems and high level optimizers were pretty dumb given the resource constraints of those platforms.


I’m not debating whether or not its a worthy endeavour though, I’m only saying that you can’t expect good performance out of assembly code unless you practice writing high performance assembly code. Most of us have a lot of experience with high level languages, so that we can write well performing high level code makes a lot of sense, but we shouldn’t expect that we can just “drop down to assembly” and get a performance boost, but that also doesn’t mean that its never possible, for the people who do actually do this a lot (eg the x264 people writing hand crafted SSE/AVX code)


It is an herculean effort trying to master modern Assembly.

Back in the day, you could easily know all opcodes for a given CPU, and their clock cycle timings.

This is the SIMD guide for the Intel CPUs,

https://software.intel.com/sites/landingpage/IntrinsicsGuide...

Which is only a tiny subset of all the opcodes that a modern Intel CPU is able to understand, let alone what AMD also offers.

You need tools like VTune from each CPU vendor to actually understand the CPU clock timings of each opcode in micro-ops (microcode execution unit).

While you can master a specific subset, like knowing AVX instructions, mastering Assembly back to back like in the old days, only when writing Assembly for stuff like small PIC microcontrollers.

Trying to master a language like C++ is easier, which says a lot about how modern CPUs look like.


These days, it’s not only compilers that are optimizing for the particulars of the CPU, but CPUs optimizing for compiled code. My favorite examples of this are the inconsistent and sometimes downright crappy performance of “rep” string instructions and the “loop” instruction. Both seem like ideal easy things to write for handwritten assembly, yet the performance of both constructs can be quite awful on certain modern CPUs compared to the naive loops that compilers output. Much of this can be blamed on the fact that compilers rarely use either construct, so chipmakers had no reason to make either of them efficient (and, indeed, seem to have actually pessimized them on some chips!).


REP MOVS is still extremely fast --- and also small. It'll copy cacheline-sized chunks if the size is large enough.

LOOP is a bit of a weird case. I've seen it benchmark both slower and faster than dec/jnz depending on the surrounding instructions.


loopXX instructions do not use CPU LSD (Loop Stream Detector) while cmp/jnz construct takes advantage of it. This speeds up some small loops. Also, there are some rules in intel manuals for instructions within cmp/jnz loop like no mismatched push/pop, etc.

now, does anyone know why?


like no mismatched push/pop, etc.

My guess is virtual stack pointer update prediction latency.

To expand on that, Intel's CPUs have had for a long time a separate piece of hardware dedicated to a "virtual" stack which speeds up push/pop instructions. If pushes and pops are not mismatched, then all stack operations can stay entirely within that and there's no need to update the "real" stack pointer nor stack entries upon leaving the loop.


Thank you for your answer! Any idea why would loops not use LSD when programmed using loopXX instructions but would use when cmp/jnX is used?


Author really needs to put the assembly under a profiler. Others have mentioned things like poor instruction selection (loop).

In my experience it's good to first attempt a rewrite in C using intrinsics. This gets you thinking about data and register layout at a high level and let's you better identify mid level optimizations before committing to assembly.


lol, I did not expect this. I submitted this myself yesterday and it got 2 upvotes.

To the people with suggestions, especially jdsully: thank you! I will probably give these a shot.

To the people who say "blah blah someone who doesnt know x giving an opinion on y blah blah", ok, what is wrong with sharing an experience? I'm not sure how to interpret this other than you take what I write way too seriously? It's an explorative piece...

Edit: I wanted to mention too that it's really cool reading other's stories around this topic. Thank you for sharing!


The author picked a case in which compilers can be faster, because it turns out to be the sort of code that compilers are usually optimising for; code that contains plenty of scientific/numerical computations. Thus the result is not so suprising.

On the other hand, with branchy "business logic"/general-purpose code and algorithms that can't really be vectorised, it's pretty easy to beat a compiler with handwritten Asm --- on speed, size, or often both.


Please show us a proof - I have yet to see any non-toy program / routine with complex control flow written in assembly beating a modern optimizing compiler.

You can optimize a small routine. Anything that's not a toy rapidly becomes impossible to optimize by hand.

Just the register allocation itself can be a gigantic pain to do by hand, even with a limited register set of x86, let alone other RISCs with much larger register files.

EDIT: Added clarification in the wording.


http://menuetos.net/

Just the register allocation itself can be a gigantic pain to do by hand, even with a limited register set of x86,

Actually, that's one of the main advantages of using Asm --- compilers are relatively horrible at register allocation, because they don't take a more "holistic" view of the program, and end up shuffling values between registers or registers and memory far more often. This is why handwritten Asm has a unique "texture" to it.

let alone other RISCs with much larger register files

There's much less room for optimisation with handwritten Asm on a RISC, because with x86 the opportunities are precisely those the compiler can't easily see, unlike a very uniform and boring RISC.


You say those things. Please show me an example of a real world, general purpose, non toy sized program or routine that is hand optimized and shown to perform better than the optimizing compiler.


I realize you believe what you are saying, which makes me feel bad about downvoting and moving on, so I'll try a more complete response.

First, you are setting up a classic "no true Scotsman". By accepting that "toy" routines can be optimized, you grant a response where (by your definition) if it can be optimized it must be a "toy". Such arguments usually aren't very satisfying to participate in.

But perhaps we can start from there. Assume we have a complex program that relies on a numerical computation that takes 90% of the runtime. While this isn't the norm for a modern GUI program used at home, I'll assert from authority and experience that such programs do exist within the realm of scientific computing.

I'll then go on to assert from authority and experience that you can often optimize such routines with vectorized assembly to run 2x as fast on a particular modern processor. In theory, a compiler could do the same. In practice, they don't --- or at least, they don't unless you are willing to write performance-fragile and non-portable code that's actually more fragile than clear assembly

As for a specific example, how about this one: http://chemfp.com. By using an approach that was possible with assembly but not with C, I was able to give about a 40% overall boost in speed to this program when used for a particular task. I optimized it in assembly, it's not a toy, and it's faster as a result.

It's also true that I was able to get about a 30% boost while sticking with compiler vector intrinsics and a particular version of a particular compiler, but the performance was so fragile to version changes that it seemed safer to go with inline assembly. And the only reason I was able to get this boost was that I already knew the assembly I was trying to generate.

I think instead the right lessons are:

1) It's often possible for an experienced programmer to make things run significantly faster on a specific generation of processors by optimizing in assembly.

2) Blind translation to assembly usually won't achieve much of an improvement, and might even hurt. It requires deeper knowledge of the internals of how processors work, roughly comparable to the level of understanding a compiler designer might need.

3) Outside of special cases where extreme performance is necessary, it's rarely worth pursuing this sort of optimization in one-off programs. Usually, the effort is better spent in libraries and compilers that will allow greater leverage.


OpenBLAS


GEMM is a fairly special case, but the case is unproven as far as I can see. The OpenBLAS kernels are certainly bigger than they need be, but it's not clear to me they satisfy the request (especially addressing "holistic" treatment of registers), and "OpenBLAS" isn't a performance measurement.

Since I'm interested in this, I'd like to see the case to reproduce with an analysis of the performance difference. BLAS developers have made statements about GCC optimization failures (at least vectorization) that aren't correct in my experience with at all recent GCC. BLIS' generic C kernel already runs at ~60% (I forget the exact number) of the speed of the Haswell assembler code for large DGEMM without any tuning attempt, perhaps with GCC extensions. (I didn't check whether the blocking is right for Haswell or pursue an analysis with MAQAO or something.)


Not too long ago while optimizing some x86 code on my i7-4900k, I discovered that at least for this code it was appreciably faster to do

    mov [mem], reg
    div [mem]
rather than simply

    div reg
I'm no asm guru, but I expected the L1 cache to be slower than a register, and certainly not a round-trip via L1 to be several percent faster.

The reason I stumbled upon this was that the compiler emitted the first variant and I tried to optimize it to the second, but lost speed. That's when I decided I'll stick to optimizing my high-level code...

FWIW this div was in a somewhat tight loop, but sadly I've totally forgotten the rest of details. I do recall adding nops here and there to play with loop alignment, it wasn't that.


Modern processors are so complex that changes sometimes have paradoxical consequences, simply in a kind of chaotic way. That seems to be the case in what you describe. Those are typically not robust effects though, and changing seemingly completely unrelated things might put the perf right back at what you expect.


Indeed. Having slept on it, I do recall someone mentioning that my performance discrepancy could be related to busy load/store ports. If the issue is that it doesn't have free reg-reg load/store capacity but reg/mem and mem/reg is otherwise available, then it makes kinda sense I guess.

But in that case the performance could totally change again if just one more instruction is added.


Since the compiler took advantage of it, doesn't it mean that the speedup is a well-known effect in the circumstances the compiler encountered? I'm nitpicking slightly on "paradoxical" and "chaotic". The behavior may be documented but to a human that does not remember every little rule, it may seem random.


This comparison is a bit unreasonable because the Benchmarks Game has a heavy selection effect going on; if the compiler generates bad code for a program, people won't submit it.

There are plenty of instances where compilers go haywire and generate code far worse than even a half-competent novice would, because compilers don't know the same things about your code that you do.


> if the compiler generates bad code for a program, people won't submit it

Seems just to be speculation.

Might we just as easily speculate that — if the compiler generates bad code for a program,that still looks a lot better than the comparison interpreters so people will submit it ?


@fallat:

I started to take a look at your problem, but I wasn't able to run the code that you posted to test.

For the C code, the "#include" statements at the top are missing their <header.h>. Easy to solve now that you posted the link to the original, but maybe you could fix?

For the assembly, you are relying on "list.macro", which I couldn't find. Could you post this, or even better link to a repository that can be used?

Separately, it would probably be helpful to post the exact model number of your processor. Speed doesn't (shouldn't?) really matter, but the "generation" is essential.

Also, I don't think you mention the number of iterations you are running for your timing. [I see now in the assembly that you seem to be using 5000000?]

Post a bit more info, and someone here (maybe me, maybe not) will figure it what's causing the slowdown.


Oh whoops, I'll add it in. EDIT: It's a damn blogger encoding error.

Yep, it's 5 million iterations.

I will update the post with list.macros immediately!

This is my CPU: Intel(R) Pentium(R) CPU N3700 @ 1.60GHz

Edit: it is updated. :) list.macros has been inlined in the snippet.


Great, I'll take a look in a bit, although it might take me until tomorrow to have time to do much with it.

In the meantime, I'll mention that my first quick discovery is that clang seems to be significantly faster than gcc on the standard C code. The ratio changes with different versions and compilation options, but on Skylake with "-Ofast -march=native" I find clang-6.0 to be almost twice as fast as gcc-8. So if you have clang installed, check and see if it might be a better baseline.

Also, what system are you running? This shouldn't make a difference with execution speed, but will make it easier to make tool suggestions. If you are running some sort of Linux, now would be a good time to get familiar with 'perf record'!

Edit: > Intel(R) Pentium(R) CPU N3700 @ 1.60GHz Hmm, that's a "Braswell" part, which unfortunately isn't covered in Agner's standard guide to instruction timings (https://www.agner.org/optimize/microarchitecture.pdf) and I'm not familiar with it's characteristics. This might make profiling a little more approximate.


Debian buster; and if you're going to make a comparison, I'm just saying this just in case, you must force it to only use SSE3 and no later. :)

I added the build instruction for the assembly but I'll add it here too: gcc nbodies.s -no-pie -o nbodies.out

Shoot me an email if you get around to it! :D

I'll check out perf record.


OK, I'm now able to test your code!

The preliminary results I get disagree with what you are seeing. I'm not sure if this is my error, your error, or just genuine differences between processors. Specifically, on Skylake, I get your assembly to be much faster than GCC, although slightly slower than Clang. And that's without trying to use options to limit the compiler:

  gcc-8 -Ofast -march=native bodies.c -Wall -lm -o bodies_gcc_Ofast_native_c
  perf stat bodies_gcc_Ofast_native_c 5000000
  2,031,645,589 cycles       #    3.691 GHz
  2,181,200,813 instructions #    1.07  insns per cycle

  gcc bodies.S -o bodies_S
  perf stat bodies_S
  1,293,433,853 cycles         #    3.691 GHz
  2,641,011,827 instructions   #    2.04  insns per cycle

  clang-6.0 -Ofast -march=native bodies.c -Wall -lm -o bodies_clang_Ofast_native_c
  perf stat bodies_clang_Ofast_native_c 5000000
  1,158,569,067 cycles       #    3.691 GHz
  2,331,116,659 instructions #    2.01  insns per cycle
Not having understood the assembly yet, my guess from skimming is that the compiler isn't able to vectorize this code well, and thus the SSE/AVX distinction isn't going to matter much. "-Ofast" should be comparable to "-O3" here, although I don't recall the exact differences. I didn't use "-no-pie" with your assembly, but don't think this matters.

Are you able to do a similar comparison and report results with "perf stat"? Cycles and "instructions per cycle" are going to be better metrics to compare than clock time. No hurry. I'm East Coast US, and done for the night.

Edit: A quick glance at "perf record" and "perf report" suggests that clang is slightly faster than you (on Skylake, using these options) because it's making use of fused multiply-adds. Which slightly contradicts what I said about the SSE/AVX distinction not mattering, although it's only a minor effect. For both routines, the majority of the time spent is in the chain starting with the division. I'm wondering if there is some major difference in architecture with Braswell --- perhaps it only has a single floating point multiplication unit or something? Or one of the operations is relative _much_ slower.

Edit 2: Looking at https://en.wikichip.org/wiki/intel/cores/braswell, I now see that Braswell is the name of the "system on a chip", and the CPU microarchitecture is Airmont. This isn't in Agner either, but at least I've heard of it! https://en.wikichip.org/wiki/intel/microarchitectures/airmon... says Airmont is basically the same as Silvermont, which Agner does cover, so we should be able to figure out timings. See page 320 here: https://www.agner.org/optimize/instruction_tables.pdf.

Edit 3: I hadn't actually looked at your code yet. So you are trying to vectorize, it's just that you are limited in doing so because you can only do 2 doubles at a time. But since you are working in 3D, this doesn't fit evenly, so you do [x,y] then [z,null]. Thus you expected it to be something like 1.5x faster on your processor, and instead it comes out 2x slower. I think the issue might be that your processor doesn't really have full vector units --- instead it's doing some sort of emulation. Look closely at the timings on Page 338 of the instruction table I linked in Edit 2. Note that DIVPD takes just about twice as long as DIVSD. Then check Skylake on Page 252 -- same time for packed and single XMM. While this isn't the full answer, I think it's a strong hint at the issue. The quickest fix (if you actually want to make this faster on your processor) is to use the single instruction for the [z,null] case. This isn't going to help much overall, since it takes twice as long for the packed, but it might at least get them back to parity with the compiler! If you actually want higher speed from your vectorization, you may have to switch to a processor that has better vectorized speeds.


I just want to say: wow! You are showing me something I bet many others are really not aware of: degraded SSE performance in these types of processors! Your DIVSD vs DIVPD comment makes a lot of sense too. Man I feel this HN thread has been just a gold mine of this kind of information.

What is the speed with restriction to SSE3? That would finalize the tests.

Do you mind if I directly quote you in a follow up post? This is really good stuff.


> What is the speed with restriction to SSE3? That would finalize the tests.

I think this is the right incantation to restrict to SSE3:

  clang-6.0 -O3 bodies.c -msse3 -Wall -lm -o bodies_clang_O3_sse3_c
  perf stat bodies_clang_O3_sse3_c 5000000
  1,470,870,787 cycles       #    3.691 GHz
  3,391,153,749 instructions #    2.31  insns per cycle

  gcc-8 -O3 bodies.c -msse3 -Wall -lm -o bodies_gcc_O3_sse3_c
  perf stat bodies_gcc_O3_sse3_c 5000000
  2,256,550,525 cycles       #    3.691 GHz
  3,306,361,186 instructions #    1.47  insns per cycle
It would be interesting to analyze the difference between GCC and Clang here. Just glancing without comprehension, it looks like one big difference is that Clang might be calling out to a different square root routine rather than using the assembly builtin. Hmm, although it makes me wonder if maybe that library routine is using some more advanced instruction set?

> Do you mind if I directly quote you in a follow up post?

Sure, but realize that I'm speculating here. Essentially, I'm an expert (or at least was a couple years ago) on integer vector operations on modern Intel server processors. But I'm not familiar with their consumer models, and I'm much less fluent in floating point. The result is that I know where to look for answers, but don't actually know them off hand. So quote the primary sources instead when you can.

Please do write a followup, and send me an email when you post it. My address is in my HN profile (click on username). Also, I might be able to provide you remote access to testing machines if it would help you test.


OK, I might have a lead on the GCC versus Clang gap. At the least, this obscure command line option closes the gap:

  gcc-8 -fno-math-errno -O3 bodies.c -msse3 -Wall -lm -o bodies_gcc_O3_sse3_c
  perf stat bodies_gcc_O3_sse3_c 5000000
  1,398,805,282    cycles       #    3.691 GHz
     3,181,041,562 instructions #    2.27  insns per cycle
I haven't thought about it much, but more info here: https://stackoverflow.com/questions/37117809/why-cant-gcc-op.... While the question is about C++, it hints that the spec might require sqrt() to set errno if given a negative number. Clang special cases this by adding a never taken branch, but gcc does not (unless told it doesn't need to care).


I would compile the C code with the -S option to output your C code as assembler, and then compare them to see what decisions were made by the (pre)compiler that made the compiled C code run faster than your hand-wrung code. The code is short enough that you should be able to relate the different parts and easily see the "improvements".

This is how I learned to program assembler for Linux. I already knew TASM / MASM / a86 assembler languages for DOS/Windows - all similar but different in their own regards - but gas is a different un-backward beast, and of course the system calls are totally different between Linux and DOS/Windows.

For example, here are 6 different ways to program helloworld on an ARM cpu (Raspberry Pi) running Raspbian Linux: https://github.com/ksaj/helloworld I kept each of them as close to the same as possible, so that it is easy enough to see the differences in how each call is set up before execution.

I'm sure if you added timer code before and after the example code, you'd very quickly discover which methods are more optimal and then begin to dissect why that would be, and how to use that knowledge in future programs.

Keep in mind that the order of opcodes will impact execution speed, because of the many techniques modern cpus use to speed things up (branching, prediction, encoding techniques, entirely rewriting opcodes to something else with the same functionality - eg, most compilers understand that mov ax,0, sub ax,ax and xor ax,ax all do the same thing), etc.


I have a good real-world counter example: The game Rollercoaster Tycoon was written in assembly and ran full speed on a Pentium 1.

OpenRCT2 is a code conversion to C++. You can turn off all the visual tweaks they have implemented on top of the original game and it still requires significantly more CPU time than the original. And thats with a compiler being able to take advantage of modern CPU instructions while the original game can't do that.


Maybe the slowness is more due to poor high level design for the problem (& if you want an highly optimized result)? E.g. AoS everywhere instead of SoA if needed. I don't know I have not checked the source code, but it is a possibility.


I came to the same result three years ago by writing the exact same problem into x86-64 asm, I guess I didn't think of it as something worth of sharing.

https://github.com/franciscouzo/nbody-asm


Unfortunately the links in the README file in that repo no longer resolve (https://benchmarksgame.alioth.debian.org/)

Might you or anyone else have the updated link(s) to the nbody problem?


https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Looks like the subdomain was changed at some point.


All this usually proves is that compilers are smarter than you think.

You can get some gains from SIMD code (i.e. autovectorization is hard), e.g. you've laid out data deliberately but the compiler doesn't quite see it, but modern CPUs are so complicated even executing scalar instructions that I wouldn't bother. I think optimizing memory is more productive half the time anyway, most programs don't spend that much time number crunching.


This program is a special case because the whole state fits in registers unless the compiler detects this a human should be able to beat the compiler. Even unrolled the code would fit in the L1 instruction cache.


Some optimisation can only be performed after the fact (in as much as profile of a real run against real data informs the branch prediction and choices in speculative execution and order)

Some execution fails: hardware has pathological instruction sequences which force sub optimal choices to be made.

Alternative algorithms cannot always be identified from an instruction sequence. What if a higher order function identifed better seeds and led to faster detection in a large field of candidate solutions?


What is the best book for learning the nuances of x86 assembly?


It is now about 20 years old, but you can start with "Zen of Assembly Language" and carry on with "Graphics Programming Black Book".

https://github.com/jagregory/abrash-zen-of-asm

https://github.com/jagregory/abrash-black-book


I recently had some fun writing arm64 assembly. My experience is that hand written assembly is faster if you apply the same cheats that the compiler uses. 3 things to keep in mind: 1) Make use of the registers; 2) Avoid branches; 3) Get familiar with the instruction set and make good use of them.

https://litchie.com/2020/03/arm64-fib


I once found some book on MC68000 programming on the Macintosh. In a chapter on graphics, the book presented an assembly routine for drawing a filled circle and bragged about how machine code gets it down to just .25 seconds. I went, "what???" and read the code: it was calling an integer square root subroutine for every scan line of the circle. Not even a good integer square root routine.


I implemented the circle/ellipse drawing code in Microsoft Windows way back. Bresenham's algorithm is the way to go. The line drawing algorithm is better-known, but it can be extended to a circle. The idea is you calculate the coordinates incrementally, keeping track of the pixel error for each line. You scale the error so everything is simple integer arithmetic, no floating point or square roots needed.


Wikipedia article for the curious:

https://en.wikipedia.org/wiki/Midpoint_circle_algorithm

The non-english articles have code.



Interesting but it left me wondering what was his fast routine for drawing rounded rects. The squarical formula?


Here’s my guess: the Mac’s drawing routines all did clipping in software. That was accomplished by having drawing of rectangles, circles, ovals, polygons and rounded rectangles (for various operations such as erasing or filling them, drawing a frame along the outside, inverting the bits inside) all come down to:

- compute the set of bits (called a ‘region’) to operate on

- call the appropriate function to erase/fill/… that region

I would guess drawing ovals was done this way:

- create a region for a circle with radius equal to that of the radius of the corners.

- insert horizontal parts to each row of bits in the region to ‘stretch’ the region horizontally into a rounded rectangle that has the correct width, but is only as high as the circle.

- insert vertical parts to each column of bits in the region to ‘stretch’ the region vertically into a rounded rectangle that has the correct width and height.

The first step was identical for the code for drawing circles; the region data structure made the last two operations cheap; they did not require any memory allocations. You had to walk the entire data structure, but for small corner radiuses, it wasn’t that large. Also, one could probably optimize for speed by doing that while creating the region for the circle.

I can’t find a good description of the region data structure, but https://www.folklore.org/StoryView.py?project=Macintosh&stor... might be enough for some to figure out how it, conceptually, worked.


My guess is they didn't even need to be fast, just render once when the window is configured and keep a cache of rounded corners in the window system complete with the masks. They're tiny, and symmetrical.


Compute once when the window is created or resized is what happened, yes, but that doesn’t work when using them for buttons (unlike MS Windows, where every control is a ‘Window’, on Mac OS controls inside a window didn’t have their own drawing context), or when using them for window content (e.g. in a drawing program)

And remember: the original Mac had about 28 kilobytes of RAM free for applications. The system unloaded icons, code and fonts that were available on disk all the time. Few objects were ‘tiny’ at the time.


Probably the fastest way to calculate it is the Bresenham circle algorithm. But if there are a lot of roundrects with the same few small corner size (likely in a window system), a table for that size might be better.


Always wonder about those round rect


That makes no sense, considering how highly optimized QuickDraw was.


Yes; well that made no sense, considering anything.


“I haven't looked particularly close at it but I'm sure someone could spot why mine is almost 2x as slow.”

Isn’t that the whole point?


Maybe aligning functions would help. The compiler likely does that for the C version.


Interesting, but personally I'd be more interested with the feasibility of X programming language with GC translated to C program and C version was slower type of article.


Compiler beats novice assembly programmer ¯\_(ツ)_/¯


Maybe I missed it - but did the author list compiler options? I wonder if the handwritten assambler was slower than a compile with optimizations off, too.


From the article: “The gcc options for the C code used was -O3 -lm, because my processor doesn't support AVX“.


One thing I was told in college is that compilers are much better at assembler than people.


"Let me quickly set the scene for when I started: I had zero exposure to x86_64 before this."

So, to sum it up: guy who has no clue writes worse code then one generated by a decent compiler.


>The saying that "processors are optimizing for C" is totally correct.

I thought that the latest consensus on HN was that C is nowhere closer to the hardware.


C is close to the interface the hardware presents you – that is, the instruction set – but far from what the hardware actually does.

Caching, out-of-order execution, branch prediction, speculation... all of those things are just as opaque to assembly as they are to C.


> C is close to the interface the hardware presents you – that is, the instruction set – but far from what the hardware actually does.

C isn't even close to that anymore, now that so many processors have SIMD ISAs which C can't usefully model.


Two things are both true.

1. Processors no longer look all that much like C's idea of what a processor is.

2. Processors are optimized to be able to execute C code very well. But an optimizing compiler can make it much better.

There's a fair amount of magic in compilers to make this work.


Any good optimizing compiler is able to achieve this, there is nothing special about C in it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: