Hacker News new | past | comments | ask | show | jobs | submit login

I don't mean this as a slight because I love the project and other light fast OSes... but with modern compilers, what does assembly get you over just using C or another language and having it handle the translation? I was under the impression compilers nowadays have lots of optimizations that would take a lot of work for a human to do by hand, as well as creating less readable/maintainable code.



> but with modern compilers, what does assembly get you over just using C or another language [...] I was under the impression compilers nowadays have lots of optimizations that would take a lot of work for a human to do by hand

dav1d, the open source AV1 decoder, has now more asm than C code. It's one of the most recent open source projects with significant asm work ongoing.

The asm version outperforms the C version (full optimizations enabled) by 4,5x on AVX2.

We have similar results in SSSE3 (3,5x) and ARM64 (4x).

We're not talking about a few percents, we're talking about multiple times faster.

And AV1 is a standard, so there are no algo shortcuts that can be made: it is either compliant or it is not.

Sure, in most cases, it is not needed to write asm; but there are cases, notably for multimedia, where writing asm by hand is a lot faster, and that includes codecs and game engines.


Well, video decoding is very adapted to ASM, I'd say. You can use vectorisation, you can optimize pipelined executions, etc. Things were compiler may not be so good (my experience is a bit dated). You also optimize a few tight loops, so you can really invest time in them with good ROI.

I'm not sure that operating systems are so suited to ASM optimization (in the sense that you may not reap so many benefits). Maybe one could optimize for size (so you can make super tiny OS) ?


I think one element of writing ASM by hand is the programmer subliminally changes the algorithm to be 'simpler' to write.

In Java, that might involve adding more classes and abstractions to make things conceptually simpler. In C, that might involve using structs to keep relevant data together. In C++, it might involve using a std::set to keep a list of 3 constants, Etc.

It turns out that when you have to write the asm by hand and you see that double pointer dereference is a pain, you avoid data structures with double dereferences (as C data structures often end up with). You avoid classes and abstractions (as are common in Java), because they involve a lot of boilerplate. You use compile time constants for that set of three things rather than any complex hashing that C++ would do, etc.

There are lots of small things like this, and it turns out the effect adds up significantly.


>> I think one element of writing ASM by hand is the programmer subliminally changes the algorithm to be 'simpler' to write.

I strongly disagree :-)

My experience is that optimizing assembler leads to reorganize your code or even your algorithm into a form that you'll CPU will be most efficient at executing which almost always translate to unbelievably intricate, super hard to modify code. I've done stuff on 6502, 80x86, MMX and the assembly parts, optimized for speed and ended up with impossibly tricky code. Worst, sometimes I have to adapt my data structures to allow optimization, which becomes even tougher. So, I prefer to leave assembly code at the "onyl if necessary" level

But now, to be honest, optimizing assembly code is incredibly satisfying to me :-) I've got the feeling to use a CPU to its maximum capacity. Also, using assembly comes after thorough algorithmic study. So once I'm at the ASM level, I've maxed out my own capabilities ! How happy me !

So if you have the chance to do that, just give it a try !


I suspect that this occurs for you exactly because you're aiming for optimization. This can happen in any language, really.

I'd assume that when you maintain larger codebases (i.e. full OS and application suite), that you start writing practical and maintainable code instead.


ahhh I'm biased. Your analysis is right :-)

But even so, I'd say that ASM doesn't help. Clumsy code in ASM is worse than clumsy code in high-level language :-)


If you write assembler in a maintainable way, do you still gain lots of benefits over a compiler?


>I think one element of writing ASM by hand is the programmer subliminally changes the algorithm to be 'simpler' to write.

It's not subliminal. It's packing a backpack that someone else has to carry, vs packing a backpack that _you're_ going to carry.


> I'm not sure that operating systems are so suited to ASM optimization

I do not know, to be honest.

I was just pointing that multimedia and game engines can get a lot of helps on ASM by hand.


I’m wondering if they time spent in writing asm for av1 could also be spend in writing an optimising pass for the c compiler? So that any optimisations can be applied by the compiler while the readability of the C language is retained.


The compiler knows nothing about your program and its data layout, which is were all of that speedup comes from.

There are some SIMD constructs that you can efficiently express in C by writing intrinsics, but that isn't substantially different from writing ASM.

Then there are some trivial cases where a loop can be unrolled and packed into SIMD instructions automagically, which retains readability, but greatly limits what you can write. You'll need to read the generated machine code to make sure you didn't mess up.

The benefit of just writing ASM is to not have such a translation layer between you and the processor. That same code that you carefully wrote to make it through the optimization passes for one compiler will likely get messed up in another compiler.


> I’m wondering if they time spent in writing asm for av1 could also be spend in writing an optimising pass for the c compiler?

Frankly, no.

Improving the compiler could get a few dozen of %, which is huge already. But 300%+, no.


You mean like intrinsics? Sure.


    Maybe one could optimize for size (so you can make super tiny OS) ?
Yes, and remember... optimizing for code size is optimizing for speed. =)

Smaller code and data = more CPU cache hits, which are orders of magnitude faster than fetching from RAM. So even if "all" you do is make code smaller, you can get more speed...


> Yes, and remember... optimizing for code size is optimizing for speed. =)

Not always. I think loop-unrolling is a common perf-optimisation technique. Even if the perf-gain is positive in this case, I highly suspect that it doesn't outweigh the cost of maintaining asm code(vs C/other higher level code).

edit: formatting


Yep. I remember loop unrolling the skeletal animation code in libgdx way back resulted in a significant performance increase. And that was in Java


More so inlining. Much more than loop unrolling.


Optimizing for size is a local maximum in performance.


Less code doesn't imply faster code. GCC coreutils have blasted through openbsd coreutils for years.

Now we have CPUs with 256KB instruction cache and up. You can compile your whole OS at insane optimizations.


There are some LLVM optimization talks that show otherwise, where longer vectorized code actually runs faster than the smaller version, though.


...in microbenchmarks.

That's what created multi-kilobyte memcpy() implementations, which barely beat REP MOVSB but cause huge icache bloat.


Compare a 512 byte boot sector game [1] to a ~1Gigabyte unity 'hello world'.

Until the day the compiler can smartly trim down all the unity framework to just 512 bytes because it notices I'm not using most of it, hand coding in asm will always work out smaller and faster, if one puts in enough effort.

[1]: https://www.youtube.com/watch?v=1UzTf0Qo37A


Unity isn't.. what? And the problem there is using an ill-suited framework, not the language.

If you compile hello world in C, it's not going to be very big either.

> if one puts in enough effort

"enough effort" is a cheat. I can replace almost any program with a smaller javascript version, if I put in "enough effort". It's not a reasonable way to compare anything.


>If you compile hello world in C, it's not going to be very big either.

You should actually try that, and compare sizes with the asm hello world.


The problem not in the language per se, but in the linker settings. By tweaking it, you can achieve almost same size as the assembler version has.


You say that, but I still haven't seen any compilers link in only parts of libc. Use any of it and it all comes in.

Hello world in C becomes anywhere from 120KB to 1.2MB even if just using puts. Even setting the entry point to a function with no arguments doesn't help.

A tiny hello world ends up needing to ignore the standard library and use a console output function from the OS.


Just checked on a Ubuntu machine. Hello world takes 6Kb. Uses printf.


That's because it links dynamically to the standard library.


Is that a problem? It's there whether you use it or not, and it's not like the asm is running in a vacuum either.

If you go to an embedded point of view where there is nothing already supplied, you can get a hello world, even with a basic printf, down to that size.


Especially since you use external libraries to do the bulk of the work, 6KB is a lot.

It is an order of magnitude bigger than the assembly version.


After some manipulations with the linker, I was able to cut the size down to 750 bytes. You cannot go below this, without hacking the ELF file format.


I got my asm one down to 268 without even trying.

It prints (via write syscall) "hello world.\n" and then calls exit(0).

Assembled with fasm, linked with ld -n, then stripped.


Ok fine, I've used the tool called sstrip from a suite called ELFKickers. Now my hello_world.cpp (yes, c++) is 380 bytes long. I could not make it any shorter, so yeah, it is 120 bytes (50%) bigger than yours.


Regarding sstrip, I am familiar with it. Be careful that you know what it actually does.


So, not order of magnitude then?


Yes, just look into how much effort you had to put vs my five minutes attempt.

The difference is between just using asm, or going out of the way to try to coerce a C toolchain to minimize the bloat.


Yea, large effort (objcopy, strip, sstrip), that I need to do only once, after that I already know how to do this an can just put it in a shell script. Now, I would like to know how quickly you'll be able to write a quicksort implementaion, that would outperform my C code, while being at least 25% smaller . Yeah, and how big your binary will become, once you link libc or opengl or whatever to make it able to do actual work.


95KB using MinGW's GCC on Windows using default compilation.

When I tell it to emit the assembly with "-S" I get 28 lines of code when I have a feeling assembly is much shorter.


I agree that Unity is bloated and should take way less constant storage, but that's an invalid comparison. They are not the same domains, they don't use the same APIs, the don't on the same environment, they don't talk to the same hardware, they don't have the same functionality, etc.


All of the performance wins you’re referencing come from vectorization, you typically can’t vectorize OS code since it isn’t ALU bound, which makes your point moot.


> you typically can’t vectorize OS code since it isn’t ALU bound

The answer and parent were in no way only talking about OS.


The parent was definitely talking about the applicability of using assembly in the context of this project. Vectorization is irrelevant here.


> by 4,5x on AVX2

I would expect a much smaller gap as AVX2 is also available through intrinsics on C ?!


When using Clang org CC (possibly other compilers, too), you don’t need to use intrinsics. These compilers have some support for vector types (https://clang.llvm.org/docs/LanguageExtensions.html#vectors-..., https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html)

One can, for example, write:

  typedef int v4si __attribute__ ((vector_size (16)));

  v4si a, b, c;
  […]
  b += 3; // add 3 to each of b’s elements
  c = a + b; // pairwise addition
Using this gives up some control over assembly; it doesn’t guarantee vector instructions get used, but it makes it easier for the compiler to generate them.

Also, I’m fairly sure there are vector instructions not covered by this extension.


CPU intrinsics are just another way of doing inline Assembly.


IANA C/C++/ASM expert, but one thing I've been thinking of would greatly benefit from type-punning (overlaying 2 kinds of data and interpreting one as the other) because it can use a 64-bit register to parallelise 4 ops at once where speed really, really matters.

That's trivial in ASM I believe, but don't know how well it's possible in C/C++, if at all.


I fail to see why not just use a __m64 variable and do the conversion.

https://software.intel.com/sites/landingpage/IntrinsicsGuide...

You can also try to make a union whose one field is a __m64, but beware not to trigger UB conditions.


Type punning in general would be somewhat difficult to do here in a way that vectorized but was still defined by the standard. You’d need to put a branch with undefined behavior in it when alignment wasn’t satisfied, then do a memcpy into a vector type, and hope that the compiler understands what you’re doing and doesn’t deoptimize it.


False or vacuously true


That is still writing asm. You need to write it, the compiler cannot do it by itself.


Video coding is one of the few outliers where compilers are not very good at the kind of optimisations that really help. Any experiences with that kind of code does not generalise to other types of code.


How much would something like Halide help with writing optimized code for AV1? It's originally designed for image manipulation, but I imagine encoding/decoding to be a little bit more demanding of expressiveness.

https://halide-lang.org/


When I worked with Halide, on a thing that is relatively close to the image decoding (transform a series of angle-amplitude pairs (radioastronomy) to the image, a lot of sprite-with-opaqueness painting), I was perplexed to find out how hard it is to work in Halide with non-constant offsets. The functionality was essentially non-existent back then (2015, I assume).

In fact, any scatter-gather operations were non-existent in Halide.

From what I remember these operations were introduced at some point, but we moved away from Halide.

Also, it was not quite simple to transform a loop that draws sprites over the entire image into a loop that draw sprites over part of image and draws many parts of image in parallel (change nesting). Hand-written CUDA version of the algorithm ended up with exactly that.

Thus, if you need some partial-derivatives-numerical-kernel, Halide is good for you. If you are working on the video decoding, Halide is not that good for you. If you are working on video encoding, Halide will be more of a nuisance than a helping hand (early exits from loops, computable access ranges, etc).


That is all very interesting. If you were drawing sprites, why not use straight openGL?

Do you think it would have worked to organize tiles and threads outside of halide and use halide for isolated parts that are already organized into arrays?


We would like to be relatively target-agnostic. OpenGL could be one of targets, but not only one. We also would like to work on regular and/or GPU-equipped cluster machines, etc.

On the suggestion in your second part: why use Halide then? Should it be responsibility of Halide to work out the best loop nesting and best use of threads?

Again, Halide was put aside and we used CUDA for final version, exactly because of inability of Halide to do good work in our case.


> Should it be responsibility of Halide to work out the best loop nesting and best use of threads?

I don't know about "should", but it seems to me that it would still be valuable, even if working out the threading and organization.of the data into an array.

> Again, Halide was put aside and we used CUDA for final version, exactly because of inability of Halide to do good work in our case

I didn't say anything about that. I'm not sure why you are restating it.


I suppose Halide was mainly involved for this part:

> transform a series of angle-amplitude pairs (radioastronomy) to the image


> We're not talking about a few percents, we're talking about multiple times faster.

I wonder how long it will take for compilers close that gap. I assume compilers will eventually produce assembly code that outperforms anything written by human.


These aren't necessarily the typical use cases that compiler writers are trying to optimize, either. So, compilers for general purpose languages may never close that gap.


Have you seen a non asm low level language that could fit hardware better than C ? Not starting a flamewar, just that it seems that you're in a context where you may have reviewed things we might not know of.


I hear tell that a lot of the places where modern Fortran shows up are contexts where faster-than-C performance is desirable.


A company I used to work for has spent 30 years trying to migrate FORTRAN code to C. I was not involved in the project, so I don't know the details, but the gist of what I was told is that there are edge cases where FORTRAN stomps C in every performance metric. So you can make decent progress for a while, but you'll eventually hit a major roadblock that halts progress.


I think of the BLAS algos as being very Fortran friendly, and the Fortran references never _out_perform the C implementations. (The asm implementations are of course the best.)

Be curious to know what you're thinking of here.


I've no personal experience, but people historically point to pointer aliasing as the primary thing that speeds up Fortran code vs C.

https://en.wikipedia.org/wiki/Pointer_aliasing


No personal experience either but discussing with people invested in high perf C, they said that `restrict` was enough to avoid aliasing issues.


But that's plausibly the compiler doing its magic. Or is there anything in Fortran language/semantics that maps close to vectorized ISAs ?


I've done some bare-metal development in Ada. It was originally designed for embedded systems, so it has features that fit this niche very well. It's definitely not as simple as C, it has a much more modern design however.


ISPC in this case is exactly that. It makes vectorization fairly easy and very fast.


Why is this getting downvoted? Writing programs in pure assembly does offer the possibility of implementing very fine-grained optimisations. However, assembly is not a magic bullet. Modern optimising compilers are extremely efficient. The most efficient assembly is quite often not the most legible or maintainable assembly. I've written lots of bare-metal assembly and if I'm given the choice between writing more efficient assembly and writing straightforward assembly, I'll pick the latter every time.

Edit: I have a little bit of experience in the area of operating-system development in assembly. This is purely a hobby project that I work on half-heartedly. Nevertheless, it does demonstrate what I know about bare-metal x86 assembler.

https://github.com/ajxs/hxos


The biggest benefit is usually that you can forgo calling conventions in your own code because everything is visible to you. Of course the code starts looking like spaghetti.


This is correct. You can forego all aspects of your platform's calling conventions if you so desire. Omitting setting up stack frames is a really simple optimisation that can be done in hand-written assembly. You've already touched on what the stakes of doing such things are.


Compilers are able to do the same since a few years.


Yes, but they rarely take a whole-program approach. Reducing a complicated sequence of function calls to a few goto’s is not an easy task. Probably NP-hard?


Doing it on a whole program basis is also unlikely to give much benefit. Function calls are extremely fast, they only show their overhead in tight loops.


Yeah, reviewing the repository there’s a lot of high level language code now, so that probably removes most of the benefit of assembly. You’re still calling into the OS a lot.

Some of the other low level systems (like the Mac) had a trap system that wasn’t so far away in cycle count from user code. But in these days of needing 10,000 cycles to bridge a system call it’s best to do whatever you can to avoid calling the OS.


Function inlining can't avoid system calls


Exactly my point


> Why is this getting downvoted?

You might as well ask why dogs bark or puzzle with furrowed brow over the croaking of frogs.


haha


I was thinking the same thing.

Then I sort of turned the problem around in my mind.

When is a compiler prevented from optimizing? Maybe pointer aliasing?

Another thing I wondered. I've written hello.c and it's about 84 bytes of C and 8.3k as an executable. Would hand-coded assembler be that large?

maybe it's that compilers CAN do well, but because of requirements, they can't do some things well and necessarily create a lot of "boilerplate infrastructure".


> I've written hello.c and it's about 84 bytes of C and 8.3k as an executable. Would hand-coded assembler be that large?

The answer is generally no. A lot of that overhead is due to the C std library. If memory serves from a blog post I read long ago, with very very aggressive tuning, you can get a hello world binary down under 50 bytes (which is smaller than the ELF header). A “normal” ASM coded “hello world” binary could easily be in the range of a few hundred bytes without anything special.


You can use the `-S` switch with gcc to produce the intermediate assembler representation of your C program instead of binary output. From there you can see for yourself what optimisation has been performed and what hand optimisations are possible.

I just compiled my own 'hello.c' on Linux, with the optimisation flag set for 'code size', and the binary weighed in around 8.3k. I then removed the `printf` call and the inclusion of `stdio.h`. The resulting binary was 8.1k. This is exactly as expected. Why would the inclusion of a call to a dynamic library bloat the final binary size?


> Why would the inclusion of a call to a dynamic library bloat the final binary size?

Here’s two good links to read on it:

https://www.muppetlabs.com/~breadbox/software/tiny/teensy.ht...

http://timelessname.com/elfbin/


I looked a little further.

hello.c: 4 loc, 84 bytes

hello.s: 30 loc, 517 bytes

hello: 8.3k

so it must be data structures for linking.

oh, and if I use -static, hello becomes 844704 bytes. :)


Just curious, is pure assembly harder to test for correctness? Are there higher level abstractions like functions?


Programs don't have to be particularly large complex to become relatively unweildy and difficult to understand in assembly. I find it takes me much more time to comprehend large portions of written assembly than higher-level languages. Your mileage may vary. Other engineers might be more talented than I am in this regard, however I think this is more or less everyone's experience.

Abstractions like functions are still present, just not in the way that you might think of them in a higher-level language. You use 'branching' instructions to jump from one part of the code to another. Either by referencing labeled sections of the code to perform 'absolute' jumps, or by jumping 'relative' to the current instruction. This forms the building blocks that can be used to implement more abstract constructs such as loops, functions and conditionals. This is a gross simplification, I hope it helps answer your question though.


Yes and no.

You can write tests like in any language, they just happen to be more tedious.

If you make use of powerful macro assemblers, like MASM and TASM, you can write what looks like high level functions.

You can have a look here for a couple of MASM examples,

https://de.wikipedia.org/wiki/Microsoft_Macro_Assembler


You can code anything you want in ASM, so higher level abstractions as well. However while doing that you will lose some of the performance gain that ASM gives you.

Optimised ASM can still be tested but one problem is that when the code doesn't work, it can simply crash. This makes it a pain to debug.


"We choose to write an OS in asm, not because it is easy, but because it is hard"


[flagged]


The greater our knowledge increases, the greater our ignorance unfolds.


Well, C still doesn't offer the same macro capabilities as 90's macro assemblers were capable of.

Thankfully in C++ we have constexpr for that.


Can you elaborate which capabilities to those of us interested to know?


Yes, here is the manual for MASM.

https://docs.microsoft.com/en-us/cpp/assembler/masm/directiv...

Basically already back in the Amiga/MS-DOS days, you could define:

- procedures and functions like higher level languages

- structures

- make use of looping constructs (repeat/while) and conditional logic (if/else)

- create hygienic labels for variables and jump targets

TASM and the FOSS clones, yasm and fasm do support similar macro capabilities.

On the Amiga side, Devpac was the goto Macro Assembler,

https://www.amigagamedev.com/Downloads/DevPac_v3.00.Manual.p...

Only the UNIX Assemblers have been traditionally quite poor in macro capabilities, as they have been mostly used as yet another stage for generating code from C compilers.

A good example how to take advantage of such macros is to implement a poor man's compiler, or the first stage of a bootstraped compiler.

Generate bytecodes that can be easily mapped into macros, and then just by having your macro library for the target platform you get the compiler very quickly up and running.


In the Amiga demo scene, I would dare say that ASM-One [1] was the default assembler.

[1] https://en.m.wikipedia.org/wiki/ASM-One_Macro_Assembler


AsmTwo[0] is a decent currently maintained fork of that.

[0]: http://coppershade.org/articles/Code/Tools/AsmTwo/


I was into portuguese demoscene Amiga/PC and never heard of it, then again it was back in the days when we exchanged floppies via the post.


> Thankfully in C++ we have constexpr for that.

The only C++ feature I'm thankful for is the option of not using it


I on the other hand appreciate that since 1993, C++ has allowed me avoid C as much as possible.


As noticed above, creators admitted that creating OS written in Assembly was a challenge for them. However, I totally agree about optimizing capabilities of modern compilers. Interesting fact on the topic, my friend, Principal Developer of well-known tech giant has a practice of picking up candidates, who mentioned "fluent Assembly" skill in their CV, and challenge them by suggesting to create a small, highly optimized application. After that he compiles the same application in C++ with maximum optimization, and in 90% of cases generated code is more optimal, that written manually. However, before becoming all sceptical about obsolete skills of learning Assembly, I would point out on the rest 10% of his candidates.


Do they mention fluent x86-64 or arm assembly? Because there are many ISAs where no C++ compiler will be able to generate better code than hand optimal code by a fluent programmer.


Exactly, when the only compiler available for an ISA is gcc 2.9... manually writing the assembly doesn't sound that bad anymore. C++11? There isn't even a C++03 compiler for those..


I think, sinse he's working on internet search and knows targer architecture in advance, he meant x86_64 assembly, used probably for platform-specific fine-tuning


As part of a university project we optimized a simple string to uppercase function via assembler analysis and asm. Directly optimizing the C code to produce more optimized assembly (notably removing branching) was nearly an order of magnitude faster.

  // uppercase without if
  *c -= (*c-'a'<26U)<<5;
Adding loop unrolling to that C-optimized version via asm provided another 20% performance boost. And I didn't even add vectorization using AVX.

While compilers provide a lot of optimization directly, there are huge performance gains in simple functionality by helping the compiler with easier to optimize code or by using asm directly.

Also notably the highest GCC optimization level I tested (-O3) reverted the optimization of the optimized C code, while -O2 kept them, resulting in -O2 being much faster. So only using asm guaranteed the performance gain.


Been 30 years since I've written assembler but back in the day I considered myself part of Abrash's army. What we tend to forget about writing in assembler is each typical formatted line has a one-to-one mapping to a machine instruction unless you're using macros. It really is the most wysiwyg programming experience you can have, so it's simply not possible to get the same optimization and space usage with a higher level language because your only choice for optimizing higher level is via groups of instructions. So C optimizers are inlining groups of instructions and making best assumptions about execution flows for you while maintaining intent. So if you want to enjoy the abstractions of higher languages, and the speed at which you can develop, the only obvious run-time optimization is faster hardware to compensate for abstractions that might decompile to massive amounts of machine instructions.

Oh yeah and I'd just like to mention as an aside that assembler statements make more sense to me than bootstrap css codes - a sad statement to the art of reasonable brevity.


From my experience in this area, it's still possible to produce a lot more efficient and compact machine code in ASM.


You just have to be careful, because in a lot of cases good algorithms have much higher effect that raw performance.

Case in point: I saw a web forum once, written entirely in assembly language. The author claimed high performance, but looking at the sources, I saw that it uses no buffering, and even a simple webpage results in thousands of write syscalls. Whatever they saved in more efficient function prologues, they lost back in wasted context switches.


I'd also be really concerned about security in that application. Web applications tend to do a lot of string processing, and naïve assembly implementations of string processing routines are likely to be vulnerable to all the same errors as a naïve C implementation -- buffer overflows, off-by-one errors, null byte injection...


Exactly. Compile a C or C++ application with -Os and the highest size optimisation settings, and you'll almost certainly still end up with a binary that's several multiples if not an order of magnitude or more larger than if you had written it in Asm with the equivalent functionality. Compilers can do SIMD tricks and such to get close in speed, but they are still pretty horrible at size optimisation.

You could try to "decompile"(!) this project into a higher level language and compile the result if you were really curious and wanted to try this exercise in the opposite direction. I suspect a lot of the things done in this code aren't even representable in a HLL or something a C compiler could be coerced to generate (without cheating and using inline Asm.)


I disagree with almost every aspect of this post. I've never seen any evidence of your claim about binary size being the case. It's a spurious claim. There's always the possibility that a user could write slightly more efficient assembler by hand, but orders of magnitude? Maybe in the 80s this could have been correct.

Your claim about compilers using "SIMD tricks" to "get close in speed" makes no sense either. If the HLL is using SIMD instructions and this magic assembler that you're talking about isn't, how would their performance ever be equivalent?


http://web.eece.maine.edu/~vweaver/papers/iccd09/iccd09_dens...

See figure 7.

how would their performance ever be equivalent?

The majority of general-purpose code can't use SIMD because it's really branchy, mundane "business logic" type of code, and that's where handwritten Asm's code density really shows an advantage.

I see that you claim to have "written lots of bare-metal assembly" and have posted a link to your OS in Asm, so I looked at the code...

You are not using Asm the way Asm is supposed to be written. You are writing code like a compiler, which totally misses the point of using Asm. Now it is obvious why you don't understand --- because you've never seen what "real Asm" looks like.

To elaborate, one thing that stands out is lots of stack manipulation, barely using the registers at all. Putting everything on the stack is what stupid compilers do. This is no good for speed nor size.

I have been reading and writing Asm for a few decades. See some of my other comments if you'd like to learn more... here's a quick sampling:

https://news.ycombinator.com/item?id=8248172

https://news.ycombinator.com/item?id=12332672

https://news.ycombinator.com/item?id=12356450

https://news.ycombinator.com/item?id=15721322

https://news.ycombinator.com/item?id=17948475


I think you are underestimating the ability of humans.

I remember back in the day where I was following a small amateur indie game scene (1998-2002), a guy released a game he'd written entirely in assembler, just for fun. He could have easily have written it in C instead, but he didn't.

The binary was significantly smaller and the game really smooth.

You don't write assembler the same way you write C or C++.

A compiler is working under various constraints, for instance regarding interoperability. When you craft everything by hand, you really have no constraints other than your imagination. You can benchmark stuff and learn and adapt. Compared to that a compiler is a sophisticated idiot.

That doesn't mean all of us should start working in assembler. But don't look down at people that do - instead find inspiration, and perhaps try to make the compilers less dumb.


That was twenty years ago. Compiler technology has not exactly stood still in all those years.


There are some (x86) demos with sounds, graphics and input. Size is measured in a few hundred bytes (https://www.youtube.com/watch?v=pcdHY-eJVIY for a 256 bytes example or anything here really : http://olivier.poudade.free.fr). Is any C compiler able to generate this few code ?


No, but those are extreme outliers, and not relevant at all to discussion about large scale programs.

You can absolutely not write a program much larger than 256 bytes in the style you use to write those. It would be utterly incomprehensible and unmaintainable. The small size actually works in your favour here, and allows you to use incredibly questionable tricks.


How much of an effect would linking shared / static libraries have on binary size?

Would someone writing assembly tend to avoid using external libraries unlike with HLL?


Linking a static library into your program will increase your binary size, linking to a shared library will not. Smaller binary size is one of the benefits that shared libraries offer.

That decision is entirely up to the developer and the project requirements. If you're not writing a bare-metal program there's no reason why not to use external libraries. It's still perfectly possible.


Program bloat is not due to poor compiler optimization. It’s due to runtime size and a lack of effort to make runtimes smaller. In particular the c runtime.


Here's someone who wrote a load library routines in asm and claimed they were faster.

https://old.reddit.com/r/programming/comments/dygsvm/heavyth...

I was skeptical, so I looked into his zlib implementation. It was 18% faster than gzip v1.6. The author said, "all I did was hand-compile the reference implementation, an 18% reduction in user-space time is a big advantage IMO."

I am not skeptical any more! More details of the benchmarking are in the linked thread.


I suspect a big win is that usually, programs from hand-written assembly will be smaller than compiler-generated binaries. For example, see [1], where a 32-bit GCC-generated binary (2.6 KB) is compared with a hand-written assembly program (45 bytes, although this is not representative). For reference, when I compile an empty program with GCC 8.3 on my 64-bit system, I get a 16KB binary.

[1] https://www.muppetlabs.com/~breadbox/software/tiny/teensy.ht...


The constant overhead for compiler generated binaries is larger because the compiler doesn't optimize for the case of nearly empty programs. I'm not so sure that the benefits are that large once you have a decently sized piece of code.


That's true. The overhead on relatively small systems with lot of simple binaries might stil be considerable. But it's probably not the main reason for the difference in snappiness.


I love the idea and I'm glad someone was able to create it but long term maintenance is the problem with assembler. Few people have the know-how to help. Even if there were a large number of programmers that could maintain it, it's better to write the OS in a high-level language and use assembler to optimize it.

Nice job anyhow...


It gives you a very close understanding of your CPU's instruction set and features. It's not necessary to write performant code, since compilers can do this very well.


Modern compilers are pretty good, but it's still quite possible for a good human coder to exceed their performance even for normal code.

For really tight algorithms I've still seen humans beat compilers by a lot, though it takes a lot of skill and some domain knowledge of the CPU.

For code that benefits a lot from vectorization, human coders still beat the crap out of compilers. I am not aware of any vectorizing compiler or JIT that can even approach what a human coder can do with SSE, AVX, or NEON (ARM's equivalent). I think these CPU extensions are just too complex for current generation compilers to effectively deal with. They require too much abstract understanding of what's actually happening in the code and the CPU to use really effectively.

I'm a bit surprised that there's been so little attention paid to the opportunities for applying deep learning and other advanced techniques to compiler optimization. It seems like this area is ripe for a new wave of innovation, but it's not happening. I do believe that many companies would pay for an advanced compiler capable of generating code that was significantly faster than stock compilers, but the speedup would have to be more than a few percent to justify spending money on it.


SIMD like code, auto-vectorization has come a good way, but it still isn't as fast as doing it by hand.

IoT devices with memory measured in single digits KB.

Plus, someone needs to write the Assembly that those compilers generate.


It depends. If your target is x86/x86_64 and the compiler has a LLVM or Intel backend, I've found that beating the compiler is almost impossible and largely futile. I took a long time to accept this since I know asm and the x86 instructions well.


That is my point of view as well, but then again the kind of code I write also does quite well with bounds checking enabled.

So I never had to deep dive into stuff like writing AVX by hand, just basing my remark on some comments that occasionally read about.


Compiler auto-vectorization is fairly easy to beat as it won’t even try when loop invariants are mildly complicated.


Why do mimes not speak?

Some things, one should just accept.


Assembly language gets you a detailed definition of behavior that isn't a moving target.


> compilers nowadays have lots of optimizations

Whatever, the Kolibri ISO is 64 MB. Nuff said.


64 MB is absolutely gigantic if we are talking about assembly code. So that is not "nuff said" by any measure, it is a completely irrelevant factoid that tells us nothing at all?


>64 MB is absolutely gigantic if we are talking about assembly code.

Likely due to the inclusion of source code and some non-core stuff actually written in c.


64MB is gigantic, period. TinyCore Linux is what, 11MB?


Does that include a window manager?


Assembly gives you minimal resource use by default. Some higher level languages make it possible as an edge case.

Generally, higher level languages are usually built upon non-zero-cost abstraction overheads, library bindings, memory management and runtimes.


Not always. Sometimes it’s annoying to do something performant in assembly but trivial in a higher-level language: hashmaps, for example.


The main advantage of Asm is the tiny constant factor. A linear or quadratic algorithm can beat a logarithmic or even (amortised) constant-time one on the sizes of data used, if the latter has a much larger constant factor.

Just like in a HLL, if you really need it you can still write reusable data structure libraries, but Asm's really tiny constant factor and effort involved in adding complexity forces you to think about whether you really need it first.

In other words: in the amount of cycles spent initialising a hashmap and inserting a few dozen or hundred items (perhaps involving memory allocations, etc.) just so you can get (once again, a relatively large factor each time) asymptotically constant-time lookup in an HLL, you could've gone through the whole set many times already in a tight loop of less than a dozen instructions, that entirely fits in the L1 cache along with the data too.


What are you talking about? You're talking about implementation. This has nothing to do with assembly at all. Assembly is not a magic language that offers instant efficiency. 'Constant factor' is entirely irrelevant to this discussion. There's nothing stopping an engineer from implementing equally inefficient data structures/algorithms in assembler as they would in a higher-level language. It is true that if the engineer understands the problem domain very well, and the project requirements are amenable, the possibility certainly exists for them to solve the problem with highly efficient assembly. This is true for almost anything. However, it exists at the cost of extra development effort with no real guaranteed benefit.


There's nothing stopping an engineer from implementing equally inefficient data structures/algorithms in assembler as they would in a higher-level language.

...except the additional complexity of doing so? If you have to write every single instruction, you start thinking more about whether you have to write each one.

'Constant factor' is entirely irrelevant to this discussion.

It's entirely relevant to the real world.


Division by constants is another example.

C compilers will convert constant divides into reciprocal multiplication. All the handwritten assembly I have seen uses the divide instruction.


Dunno why you are getting downvoted, because you are exactly right.


Assembly doesn't have undefined behavior.


Actually it does.

BSF (Bit Scan Forward) is the most famous which returns an undefined result if your operand is 0.

Many instructions leave the state of bits in the FLAGs register undefined. E.g. AAA will leave SF,ZF, and PF undefined.

Then of course there is the “reserved bits” but those are easier to avoid.


Here's a 6502 instruction that has undefined behavior, because the microsequence connects circuits together in an invalid way causing analog effects that can change even run to run of the same processor.

http://visual6502.org/wiki/index.php?title=6502_Opcode_8B_%2...


It looks like that’s not undefined behaviour, but using a non-existent and undefined opcode, which happens to “do something” because the cpu attempts to execute the bits anyway (it doesn’t mask the unused opcodes into no ops or errors). Later variants of the 6502, like the WDC 65C816, did away with this (I believe, either using the opcodes for something defined, or making them no ops, although I’m unsure)


"Of all the unsupported opcodes" - I'm not sure it's a good example.


It does, what it lacks is an optimizer that will rewrite your code to win micro-benchmarks, just because it happened to be UB.


It sure does.

Some is inherent to the ISA itself.

And some undefined behavior is errata that differs from CPU to CPU. Check out the errata documentation for any processor released in the last few decades. For example, x86. You can trigger things that happen on one x86 CPU that won't happen on another.


Not entirely true. Look up 'Silicon Errata' and you will find some examples. Its actually interesting stuff




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: