Hacker News new | past | comments | ask | show | jobs | submit login
The surprising cleverness of modern compilers (lemire.me)
144 points by kiyanwang on May 24, 2016 | hide | past | favorite | 83 comments



There is an often-missed elephant in the compiler optimization living room in the form of poor language design leading to a tremendous amount of wasted human effort. First, effort is demanded of the programmer to write the clever optimized C code in the first place, and then effort is demanded of the compiler writers to recognize the clever optimized C code and compile it down to the native instruction that does the Right Thing.

All of this effort could be eliminated simply by adding a function to the standard library called bitcount. This would make life easier for both the programmer, who could now count bits by calling bitcount, and the compiler writers, who no longer have to write and maintain code that recognizes clever hacks.

This (counting bits) is not an isolated incident either. There is an entire field of academic endeavor devoted to this sort of thing, focusing particularly on recognizing common idioms involved in vector and array processing. All of this could be eliminated at a stroke through better language design. Instead, an entire academic field dons the ball-and-chain of a poorly designed language and then pats itself on the back for getting to the finish line panting and sweating, and people like me who say, "Um, why not just take off the extra weight and use a better designed language?" get marginalized at best and vilified at worst.


Rust does exactly this in many cases. All the integer types have methods like 'count_ones' for popcount and 'trailing_zeros' for ctz, things that are intrinsics in other languages. See http://doc.rust-lang.org/1.8.0/std/primitive.u64.html#method...


Though note that these methods are really just thin abstractions over intrinsics, since Rust doesn't want to stabilize intrinsics directly for reasons of compiler backend portability. For example, the `count_ones` method currently resolves directly to LLVM's `ctpop` intrinsic, defined at https://github.com/rust-lang/rust/blob/master/src/libcore/in... .


FYI, when you're on a GitHub page you can press the 'y' key to have the URL changed to one that includes a particular blob reference. It helps prevent bit rot. Curently, your linked line doesn't point to the `ctpop` intrinsic. I think you wanted https://github.com/rust-lang/rust/blob/34fd68668152530dcd1d0...


One big problem I see, is that these tricks make the compiler unpredictable in terms of performance. Add a line somewhere in the middle, and the compiler might no longer recognize the construct, and suddenly your code runs 10x slower. Try debugging that :)


Yeah, you can't print that halfway through and get the killer performance (not that printf is fast)


Your point on vectors and array processing is not a good one: Some significant numerical algorithms can be easily expressed with nested loops, and are a pain to express with array operations. Fortran has both, and some of the best compilers turn array operations into loops, to ensure that they get both forms right.

Maybe you could spend more time understanding other points of view, rather than feeling marginalized or vilified?


> Some significant numerical algorithms can be easily expressed with nested loops

Like for example?

Just to clarify, I'm (obviously) not suggesting that nested loops be eliminated from the language, or that common optimizations on nested loops be eliminated from compilers. But having specialized code to recognize a clever idiom for counting bits is IMHO going way too far.

It's true that I'm not intimately familiar with numerical algorithms, so I could be wrong about this, but my impression is that the number of "significant numerical algorithms" that can be easily expressed as nested loops is not that large: dozens, not hundreds. Unless I'm wrong about that, I stand by my position that the best way to handle these is to enumerate them and then allow them to be expressed directly.


Such as any code that applies a 1D operator against each of 3 dimensions in turn. That's done in the commonly-used PPM hydrodynamics algorithm, which is used in astrophysics and nuclear weapon simulations.


gcc and clang both provide a builtin for this, __builtin_popcount. There are a bunch of other useful things like that too:

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Of course, that doesn't save you from writing the long version if you need portable code, but it's a step in the direction you describe.


Make sure to use __builtin_popcountl to count bits in a uint64_t, or you may miss half of them (yes it happened to me).


I love this example. Even though you are aware how error-prone this is, you still get it wrong.

long is 32-bits on GCC 32-bit. So it will still give the wrong result with `uint64_t`. You should be using __builtin_popcountll.

(don't get me wrong - you're not stupid, it happened to me too: https://github.com/orlp/libop/issues/1)


Good catch, thanks. (It has been a while since I've been on a 32bit platform.)


> it's a step in the direction you describe.

Well, sort of. This is not about any one particular optimization. This is about the general pattern of having one human take a high-level concept (like counting bits) and reducing it to a low-level implementation, and then having a second human writing a compiler which tries to reverse-engineer the work that the first human did in order to figure out that they were trying to count bits so that it can emit code that actually does the Right Thing.

So yes, __builtin_popcount does improve the situation a little bit, but it perpetuates a whole host of other problems. For example:

1. It has the wrong name. The right name for a function that counts bits is "bitcount", not "popcount" (and certainly not __builtin_popcount!) The __builtin is there because C doesn't have a proper name spacing system, and so using __builtin_popcount perpetuates that problem in the same way that clever compiler optimizations perpetuate other bad aspects of C's design.

2. __builtin_popcount only works on unsigned ints. If you want to count bits in anything else (a string, say, or a bignum) you're back to square 1.

To really fix this problem you need a general mechanism for expressing high-level concepts that a compiler can now about and optimize directly. For that, C is completely hopeless. You need a new language.


> a general mechanism for expressing high-level concepts

functions

> that a compiler can now about and optimize directly. For that, C is completely hopeless.

This has been done with C for generations, i.e. memcpy, strlen, sqrt, and even printf are recognized by the C compiler and custom code is generated.

The only problem is the C community has been slow to standardize on additional functions.


No, that's not the only problem, it's one of many problems. But even by itself it's a pretty frickin' substantial problem.


It means a new language is not required. Even without a Standard update, the various C compiler vendors could get together and agree on a set of them, making a "technical report".

Anyhow, what other problems?


> the various C compiler vendors could get together and agree on a set of them, making a "technical report".

Yes, they could. But they don't.

> Anyhow, what other problems?

The lack of a real multi-dimensional array type is probably the biggest problem with regards to the goal of making code run fast. The fact that strings are required to be zero-terminated arrays of signed 8-bit integers is another huge problem. It means that you can't have native unicode strings, and that to create a substring you have to make a copy. The lack of exceptions and automatic memory management means that you need weird calling conventions if you want to return composite types, or if a function can fail. The lack of generics and namespacing means that you need to use weird naming conventions to avoid collisions, and there is a significant cognitive load placed on the programmer to figure out the right operation to use for a particular set of operand types (except for native arithmetic, but that can't be extended to any non-native types).


I'm aware of C's language problems, but I don't think the ones you mention impede encapsulation of algorithms so the compiler can do a better optimization job.

> strings are required to be zero-terminated arrays of signed 8-bit integers

This is incorrect. C's char type is optionally signed, not required to be signed. The implementations I know of make them unsigned, because signed characters make no sense.


> I don't think the ones you mention impede encapsulation of algorithms so the compiler can do a better optimization job.

Unless you can produce an actual counter-argument I guess we'll just have to agree to disagree about that.

> signed characters make no sense

I could not agree more. And yet...

    [ron@mighty:~] gcc -v
    Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
    Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn)
    Target: x86_64-apple-darwin13.4.0
    Thread model: posix
    [ron@mighty:~] cat test.c
    
    int main(int argc, char** argv) {
      unsigned char* x = "baz";
    }
    [ron@mighty:~] gcc test.c
    test.c:3:18: warning: initializing 'unsigned char *' with an expression of type
          'char [4]' converts between pointers to integer types with different sign
          [-Wpointer-sign]
      unsigned char* x = "baz";
                     ^   ~~~~~
    1 warning generated.


Dang, you're right. D's chars are unsigned, and I had gotten so used to that I forgot that C compilers often default to signed. Thanks for the correction.


The ARM C compiler I used over 20 years ago had unsigned chars. I don't know if that is still true for modern ARM.


> Unless you can produce an actual counter-argument

There's no problem having a function called popcnt(), and having the C compiler recognize it just like it recognizes memcpy(), strlen(), sqrt(), etc., and generate optimal code for it.


Sure, that is not in dispute. What is in dispute is whether the same can be done in C for matrix_multiply (or, even better, m1*m2 where m1 and m2 are matrices) or string.split(string) or strings.join(sequence_of_strings) or dict.get(key) or...


Aren't any of those negatively affected by the possibility of aliasing?


None of those I mentioned are, though C has the 'restrict' type modifier to deal with that.


> The right name for a function that counts bits is "bitcount", not "popcount"

"Population count" is what this function is typically called.

> (and certainly not __builtin_popcount!)

C doesn't have namespacing and naming the function "bitcount" would almost certainly collide with user code.

> 2. __builtin_popcount only works on unsigned ints. If you want to count bits in anything else (a string, say, or a bignum) you're back to square 1.

Not really? Just use __builtin_popcount in a loop.


> "Population count" is what this function is typically called.

I didn't know that. Thanks.

> C doesn't have namespacing

That I did know. I even pointed that out in the comment you're responding to :-)

> Just use __builtin_popcount in a loop.

But that just brings you back to the original problem when some day someone builds a machine that has hardware support for popcount on multiple words.


> But that just brings you back to the original problem when some day someone builds a machine that has hardware support for popcount on multiple words.

Is that really a problem? We've already seen that compilers are clever enough to recognize these patterns. Meanwhile any loop using __builtin_popcount will look more or less like this:

    for (int i = X; i < Y; i++) {
        total += __builtin_popcount(a[i]);
    }
... which is a substantially easier pattern to recognize than the one that clang is already using to recognize popcount.


> Is that really a problem?

I think it is if you consider wasted effort a problem. But apparently not everyone does.

> compilers are clever enough

Yes, but only because someone did a lot of work to make them clever. And they only had to do that work because someone else did a lot of work to write a clever algorithm. If we as an industry had just done the Right Thing to begin with both of those people could have used that time to do something more productive.


>I think it is if you consider wasted effort a problem. But apparently not everyone doe

You haven't considered 2 things: marginal returns and opportunity cost.

Your solution to this "wasted effort" is to build a better C (or a better language). But this overlooks the millions of hours spent in C systems, C code, C compilers, learning, and all that would have to be done once again...

If the better language already existed, and was a drop in replacement (meaning everything just works) then your argument would indeed mean an end to wasted effort if adopted.

But as it is, it means tons of EXTRA effort, which is not clear is if better than the current wasted effort because of C deficiencies.


> You haven't considered 2 things: marginal returns and opportunity cost.

Yes, I have.

> this overlooks the millions of hours spent in C systems

No, it doesn't. You are assuming that I am advocating re-writing existing C code in this hypothetical new language. I'm not. The hypothetical new language should be backwards-compatible with C, either a superset of C (like C++) or with a C (and C++) FFI (just about every other language out there today).

But even that is not strictly necessary. It is already common practice for programs written in different languages to interact with each other via language-independent mechanisms like pipes and sockets.

> If the better language already existed, and was a drop in replacement (meaning everything just works) then your argument would indeed mean an end to wasted effort if adopted.

Many such languages exist. The one I like to use, Common Lisp (and specifically Clozure Common Lisp) not only has a C FFI, it has an Objective-C FFI, which allows me to easily write Cocoa applications and leverage all of the power of both C and ObjC libraries on OS X with all of the efficiency of C. No wheel-reinventing required. And yes, everything Just Works. (With QuickLisp it really Just Works!)

> But as it is, it means tons of EXTRA effort

No, it doesn't. If you think it does, then you're doing it wrong.


> Many such languages exist. The one I like to use, Common Lisp (and specifically Clozure Common Lisp) not only has a C FFI, it has an Objective-C FFI, which allows me to easily write Cocoa applications and leverage all of the power of both C and ObjC libraries on OS X with all of the efficiency of C. No wheel-reinventing required. And yes, everything Just Works. (With QuickLisp it really Just Works!)

Not sure in what parallel universe starting to write software in CL, and getting hundreds of thousands of C programmers to adopt it doesn't mean "tons of EXTRA effort".

Heck, even the re-training required is millions of man-hours.

Even worse for the other proposed solution, to have legacy programs (that is: 99.9999% of available software that's not in the "NEW LANGUAGE") written in different languages to interact with each other via language-independent mechanisms like pipes and sockets.

And all that to gain what?

Supposedly it was to have a language where compilers could take advantage of the superior design to be even faster and more optimized than C, and you propose CL, which, after 30+ years, is slower or at best equal-ish to C.


I was not suggesting that all C programmers switch to Common Lisp (though I don't think that would be a bad thing). I was just citing that as an existence proof that it is possible to introduce a new language into the software development mix without abandoning all of the legacy code written in C.

> And all that to gain what?

The ability to write future software with less effort (and that is more reliable and secure) than past software.

> equal-ish to C.

Equal-ish is good enough given the big win in security, reliability and development effort.


>I was just citing that as an existence proof that it is possible to introduce a new language into the software development mix without abandoning all of the legacy code written in C.

I don't think anybody doubted that. At least not anybody who ever heard of the C calling conventions and FFIs.

But you started this thread advocating for something much stronger, which I read (correct me if I'm wrong) as: working around C is wasting so much effort, people should adopt a better language design for better compiler optimizations, etc.

Unless this adoption is actually a mass phenomenon (even a replacement), then instead of improving things, it only split all the effort and worsened things.

>The ability to write future software with less effort (and that is more reliable and secure) than past software (...) Equal-ish is good enough given the big win in security, reliability and development effort.

At the start of this thread you were all about compiler optimizations for speed -- not about more reliability and security. Now "equal-ish" to C is good enough?


> I don't think anybody doubted that

Well, you wrote:

> this overlooks the millions of hours spent in C systems, C code, C compilers, learning, and all that would have to be done once again...

and you are wrong -- on two counts. I didn't overlook these things, and they would not have to be done again. It's a little unclear to me how you could possibly think that they would have to be done again if you didn't doubt that this problem could be solved with an FFI.

> But you started this thread advocating for something much stronger, which I read (correct me if I'm wrong) as: working around C is wasting so much effort, people should adopt a better language design for better compiler optimizations, etc.

Yes, that's right.

> Unless this adoption is actually a mass phenomenon (even a replacement), then instead of improving things, it only split all the effort and worsened things.

No, that's not right. Different tools are good for different tasks. There's nothing wrong with split efforts.

Furthermore, even if you're right and there should be only the One True Language, don't you think it would be better if it were actually a well designed one?

> At the start of this thread you were all about compiler optimizations for speed

Not quite. I was lamenting the way in which compilers are evolving: people write clever algorithms in C, and compiler writers then write compilers to try to reverse-engineer those clever algorithms, and neither of those efforts would be necessary if we had a language that could directly express the things that people want to run fast and compiler writers could just directly compile those things to run fast. Life would be better for everyone.

The problem with C is that it's a low-level language masquerading as a high-level one. It's a macro assembler for a machine that was common in the 1970s but is no longer common today. The general idea of having a high-level assembler is a good one, but it needs to be an assembler for the actual target machine, otherwise you waste a lot of effort dealing with the impedance mismatch.


I would expect `bitcount` to mean sizeof(x)*8. Population count or Hamming weight is the canonical name for this operation.


Erm - ``sizeof x*CHAR_BIT'' ;)


A bit is a binary digit, with a value of 0 or 1. Popcount counts the number of 1-bits; and, as said, population count actually is the canonical name.

You can apply __builtin_popcount in a loop if you want to count bits in multiple words. The point of exposing this particular operation for an unsigned int is that a recent CPU will likely have an efficient operation for it. Operating on larger units is something better left to your own code or a library. You're definitely not "back to square 1", because the intrinsic offers you an operation that would otherwise only be accessible with inline assembly.


That's the whole point, even from the long version, clang figured out that it was doing a popcount and emited a single popcntq instruction. Nice!


Pop count returns the number of set bits. So it most useful for figuring out if a number is a power of two. If there are more than 1 "set" bits, then it is NOT a power of two. However, a non portable way of determining how many bits are in a object would be this expression, assuming 8 bits are in a byte. `sizeof(object) * 8`


Detecting a power of two is easy with just (x != 0 && (x & x-1) == 0).

A portable way of determining how many bits are in an object is sizeof(object) * CHAR_BIT. No reason to use the 8 version.


There are most likely many ways to check if a number is a power of two. I'm not sure if my solution is more efficient. It seems your expression would create more code than mine.


There's a proposal to add them to the C++ standard, but I don't know its current status. “A constexpr bitwise operations library for C++”: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n386...


Better language design unfortunately doesn't eliminate the other elephant in the room: legacy code. Compilers do provide intrinsics for some of these hacks, counting bits for sure. But old code doesn't use them and yet the compiler should create optimal instructions.


Trying to make legacy code run faster through horrible compiler hacks rather than fixing the legacy code is a deal with the devil. The net result of that strategy is a horrible pile of hacks, many of which are there fore no reason other than to undo the deleterious effects of previous hacks. The long-term cost of maintaining all that complexity is far greater than the cost of just doing the Right Thing in the first place. Nowadays there is absolutely no excuse for continuing to use languages with built-in design compromises that were put in simply because the compiler had to fit in 64k of RAM. People keep doing it because doing the Right Thing front-loads the cost. But another name for "front loading costs" is "making an investment" which is generally considered a Good Thing in other walks of life. But for some reason, not in software.


Legacy code is working code. Arguably time is better spent making old code go faster than optimizing yet to be written code.

And this is not confined to compilers. Most general purpose high performance CPUs are actually optimized to make existing code go fast and do not require recompilation to take advantage of many optimizations. Some go as far as recognizing common sequences of instructions and internally converting them to more optimized sequences (for example POWER8 conditional conversion of short jumps to conditional execution).


Option A: Maybe it's acceptable for legacy code to be compiled less-optimally? If it was fast enough on a Pentium 4, it should run fine on modern chips even without some optimizations

Option B: How often is legacy code recompiled, anyway?

Option C: Why not compile legacy code with legacy compilers? Sure the hack was removed upstream, but that doesn't stop you from using a copy of gcc circa 1995

Option D: If it needs to be maximally fast, and is routinely recompiled on a wide range of systems with a variety of compilers (e.g. Linux kernel), maybe it's worth rewriting.


Great thinking! So that way, if you need to leftpad a string, you could just import the well-maintained leftpad function ...


The principle is sound. Would you rather there be 500 non standard implementations of left-pad or should there be one fast and optimized implementation of left pad?

The only issue that leftpad demonstrated was that there was NO method for dependency resolution of removed libraries.


Oh, agreed, just calling back to the outrage over "dude, why would you need a library for that?"


>Please don't bait other users by inviting them to downvote you or announce that you expect to get downvoted.

https://news.ycombinator.com/newsguidelines.html


Ironically, I see you're getting downvoted, but you shouldn't be because you're right: I violated HN guidelines, so I edited my comment to remove the offending passage. Thanks for pointing this out.


> What does that mean? It means that C is a high-level language.

For this specific example, isn't the reasoning backwards? The machine instruction is clearly the "high level" here, if you disregard the fact that it is, in fact, a machine instruction, thus there can be no "lower level" (ignoring microcode, an implementation detail).

Of course, you cannot say that, in general, C is lower level than X64 assembly. However, just to illustrate, replace the popcntq instruction with some Scheme function, and try arguing that it is at a "lower" level. The argument won't work. In fact, people would think you were talking about a "decompiler", instead.


One of the comments on that article links to this https://llvm.org/bugs/show_bug.cgi?id=1488 which rather suggests that this surprisingly specific optimization was added in order to speed up some calculations in one of the SPEC benchmarks.

(The benchmark in question is code from a chess program called Crafty, which represents a chess position as a bunch of 64-bit bitmaps. Unsurprisingly, 64-bit popcount is quite often useful for this.)


Compiler vendors have been doing stuff like this (recognizing specific code patterns from common benchmarks) for 30 years, back in the days when there was real competition amongst C compilers (Borland vs. Zortech, for example.)


Yup. But the HN discussion of this one seemed to be treating it as an instance of compilers being terribly clever to benefit their actual users, rather than yet another benchmark hack.


Just a side-comment: I never actually needed to count bits, but this technique illustrated in the post is quite clever! I had to stop for a second to understand what it actually does: Essentialy, each iteration will remove the least significant high bit from the number (or the right-most bit, if you represent the number LTR). and then continue to do so until there's no high bits left (or the number is zero)!

It's so simple, but to be honest, I don't think I would come up with it without putting in some mental effort.


There is a classical page full of such things [1]. Not so long ago researchers looked into how to come up with such tricks automagically using program synthesis [2].

[1] https://graphics.stanford.edu/~seander/bithacks.html

[2] http://www.eecs.berkeley.edu/~sseshia/pubdir/synth-icse10.pd...


I teach Data Structures at UC Berkeley, and this is one of our most challenging problems each year: find a one-line expression to get the least significant bit of a number. x & (x-1)! It exploits some very interesting magic of binary numbers.


Clarification: clear (not get) the least significant bit.


Doesn't that give you numbers which are a power of two? ie, x&(x-1) is zero if x is a power of two?


Why not (x & 1) ?


This just results in the "first" bit, essentially showing whether x is odd (1) or even (0).

Example: 6_10 = 110_2 and 1_10 = 001_2

(x&1) = 110 & 001 = 000 -> even number

I think the x&(x-1) is wrong though. I thought (x & -x) extracts the LSB:

010&110 = 010

110&010 = 010

001&111 = 001

111&001 = 001

001 010 100 = 110 101 100 = 000 000 100

This happens because the negative integers in two's-complement are just the positive number with all bits negated plus 1. If it'd only be negated, the result would be 0, as all bits would be different from one-another. Now everything right of the LSB of the argument is by definition zero and in the negated version by definition one. Finally, two's complement adds a one which turns all those bits back to zero and the position where the LSB is to one. This is now the only place in the two binary numbers where both bits are one, and therefore the only place that "survives" the '&'-Operation and remains one. All other bits become 0.

Disclaimer: Still studying and not teaching (yet) ;)

edit: changed order of examples to clarify, that it works from positive to negative and the other way round.


Is that a common definition of LSB? I previously learned that the definition of the LSB is the "first" bit [0]. Would it be more clear to say that this algorithm finds the least-significant bit that's set?

[0] https://en.wikipedia.org/wiki/Least_significant_bit


You're right. Least significant _set_ bit is (x & -x).

Got confused by the (x & (x-1)) and thought rosstex must have meant something more complicated, because it doesn't seem to result in the usual LSB either :/


> We test people’s intelligence in room, disconnected from the Internet, with only a pencil. But my tools should get as much or even more credit than my brain for most of my achievements. Left alone in a room with a pencil, I’d be a mediocre programmer, a mediocre scientist. I’d be no programmer at all. And this is good news. It is hard to expand or repair the brain, but we have a knack for building better tools.

Reminds me of what Kasparov said in his essay analyzing the rise of "centaurs" in chess competitions:

http://www.nybooks.com/articles/2010/02/11/the-chess-master-...

> Programming yourself by analyzing your decision-making outcomes and processes can improve results much the way that a smarter chess algorithm will play better than another running on the same computer. We might not be able to change our hardware, but we can definitely upgrade our software.

edit: fixed the errant double-copy-paste


Hey, thanks for linking this. What an entertaining, insightful piece of writing!


Did you mean to quote the same passage twice?


Seems to be a built-in, extremely specific use case: https://github.com/llvm-mirror/llvm/blob/ae889d36724efb174e8...

I suppose it makes sense, given: 1) popcount is very common in bit-manipulating programs, 2) there is no standard popcount in ISO C, 3) it potentially saves 100+ instructions.

(Edit: I see a comment on Lemire's blog already points to that.)


I was not aware of the popcnt[1] instruction. It's quite amazing how LLVM can recognize that code, thanks for the link!

[1] - https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT


Yes tools are a big part of augmenting intelligence. However!! The biggest thing that computers have over humans is the replicability of programs.

A human, somewhere, figures out a correct solution or a better algorithm, then it gets propagated and replicated.

A self-driving car "learns" something in its experience on the road. That information is propagated using a pre-existing system. Now all self-driving cars eventually improve and outperform typical humans.

Watson collects disease incidence rates and diagnoses from around the world, so the "big data" eventually allows it to make decisions that a single doctor would not be abls to make.

It's the easy copying of bits that makes this possible.

A photo in the 1960s could be imperfectly reproduced, on a newfangled "copy machine". Before that, photos were set up and printed using huge printing presses. The more copies are made of a photo is reproduced, the more likely it is to survive the destruction of the physical media it is embedded in.

But in the past, making perfect copies was very expensive. Today, the copying of bits is what has brought the price down of copying information. And this is the key to the intelligence revolution.

Humans were able to zoom ahead because of language. It provided ways of propagating information and copying it. Although the copies were imperfect, it is what made every generation smarter than the last.

Anyway, the "easy copying" is a feature of a platform. The humans are still the ones banging away at the platform. And that's why open source beatthe proprietary software. The same reason why science made progress -- anyone anywhere can make an incremental improvement, that is then recognized and propagated according to some political system in place.

And that is what produces the explosion in intelligence.

The things we can improve now are the political systems that govern which information is propagated. Should they be centralized? Decentralized? Who determines what goes in? How to deal with versions and dependencies?

The fact that we keep reinventing the same things (NoSQL) going in circles (thin client / thick client, JS framework fads) locking out users (Apple, MS Secure Boot) and bloating our computer languages (C++, JS, PHP, etc.) are just symptoms that we haven't figured out the best ways to keep moving forward.


Indeed, Clang/LLVM did a stellar job here. It'd be interesting to know how (in part to understand what situations can one expect this sort of thing to happen; for instance, I'd be surprised if IEEE floating point emulation code would be compiled to a hardware floating point instruction - though perhaps I underestimate the optimization passes at work here!)


I bet there is a specific optimization that recognizes common patterns for counting the number of set bits and replaces it with popcntq. There is no popcnt() function built into C, so everyone who needs it uses one of a few common implementations.


Yes. Many compilers, including LLVM, have an "idiom recognizer" (in LLVM parlance) that detects memsets, memcpys, population counts, etc. and turns them into calls to the optimized implementations.



LoopIdiomRecognize::recognizePopcount? Then I think it's more about "the surprising ability of modern compilers to recognize popcount" than general "cleverness."

(I suspected it was this kind of thing, but I thought maybe it was something more generic.)


Yes most other loops that produce the same result wouldn't be recognized, just this very specific, note the steps:

// step 1: Check if the loop-back branch is in desirable form.

// step 2: detect instructions corresponding to "x2 = x1 & (x1 - 1)"

// step 3: Check the recurrence of variable X

// step 4: Find the instruction which count the population: cnt2 = cnt1 + 1

It's very, very specific. But they allow that the loop has something more, see the resulting transformation at the end:

   // After this step, this loop (conceptually) would look like following:
   // newcnt = __builtin_ctpop(x);
   // t = newcnt;
   // if (x)
   //  do { cnt++; x &= x-1; t--) } while (t > 0);
This can be later potentially optimized more, in the given simplest case the rest is removed by the later optimization steps.


This sounds the most plausible.


> It'd be interesting to know how

Do you mean how this things are implemented in general or how come that this particular function was reduced to one instruction? The answer to later wrote gjm11 here: https://news.ycombinator.com/item?id=11764800 (it was to make LLVM good at some benchmark!)

The answer to former is: once the sorce code ins transformed to some intermediary form, a lot of optimizations simply look for the "best match" for different patterns and then replace the matched intermediary form sequence with the replacement sequence. Once you have an infrastructure in your compilers, it's typically not so hard to add one more "replace this with that" to get this result to win the benchmark.


The problem with this kind of idiom recognition is that it must be conservative. That rules out lots of cases involving floating-point arithmetic and memory references.

In the particular example of population count: this has been a hardware instruction on many processors since the early 60's. It is a crying shame that it's not part of the modern C language or standard library.


Site is slow, here is the bug report:

https://llvm.org/bugs/show_bug.cgi?id=1488

Too bad it won't optimize in the face of clever code like this:

http://www.hackersdelight.org/hdcodetxt/pop.c.txt


> it gets hard to benchmark “algorithms”

The cleverest of all C compilers that I know do not even attempt to rewrite algorithm to a "smarter" one (they're not there.. yet). All this does is translate popcount algorithm to a hardware instruction, which one would presume affects the performance linearly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: