The surprising cleverness of modern compilers

lisper · on May 24, 2016

There is an often-missed elephant in the compiler optimization living room in the form of poor language design leading to a tremendous amount of wasted human effort. First, effort is demanded of the programmer to write the clever optimized C code in the first place, and then effort is demanded of the compiler writers to recognize the clever optimized C code and compile it down to the native instruction that does the Right Thing.

All of this effort could be eliminated simply by adding a function to the standard library called bitcount. This would make life easier for both the programmer, who could now count bits by calling bitcount, and the compiler writers, who no longer have to write and maintain code that recognizes clever hacks.

This (counting bits) is not an isolated incident either. There is an entire field of academic endeavor devoted to this sort of thing, focusing particularly on recognizing common idioms involved in vector and array processing. All of this could be eliminated at a stroke through better language design. Instead, an entire academic field dons the ball-and-chain of a poorly designed language and then pats itself on the back for getting to the finish line panting and sweating, and people like me who say, "Um, why not just take off the extra weight and use a better designed language?" get marginalized at best and vilified at worst.

trishume · on May 24, 2016

Rust does exactly this in many cases. All the integer types have methods like 'count_ones' for popcount and 'trailing_zeros' for ctz, things that are intrinsics in other languages. See http://doc.rust-lang.org/1.8.0/std/primitive.u64.html#method...

kibwen · on May 24, 2016

Though note that these methods are really just thin abstractions over intrinsics, since Rust doesn't want to stabilize intrinsics directly for reasons of compiler backend portability. For example, the `count_ones` method currently resolves directly to LLVM's `ctpop` intrinsic, defined at https://github.com/rust-lang/rust/blob/master/src/libcore/in... .

nirvdrum · on May 26, 2016

FYI, when you're on a GitHub page you can press the 'y' key to have the URL changed to one that includes a particular blob reference. It helps prevent bit rot. Curently, your linked line doesn't point to the `ctpop` intrinsic. I think you wanted https://github.com/rust-lang/rust/blob/34fd68668152530dcd1d0...

amelius · on May 24, 2016

One big problem I see, is that these tricks make the compiler unpredictable in terms of performance. Add a line somewhere in the middle, and the compiler might no longer recognize the construct, and suddenly your code runs 10x slower. Try debugging that :)

_mhyx · on May 25, 2016

Yeah, you can't print that halfway through and get the killer performance (not that printf is fast)

wumpus · on May 24, 2016

Your point on vectors and array processing is not a good one: Some significant numerical algorithms can be easily expressed with nested loops, and are a pain to express with array operations. Fortran has both, and some of the best compilers turn array operations into loops, to ensure that they get both forms right.

Maybe you could spend more time understanding other points of view, rather than feeling marginalized or vilified?

lisper · on May 24, 2016

> Some significant numerical algorithms can be easily expressed with nested loops

Like for example?

Just to clarify, I'm (obviously) not suggesting that nested loops be eliminated from the language, or that common optimizations on nested loops be eliminated from compilers. But having specialized code to recognize a clever idiom for counting bits is IMHO going way too far.

It's true that I'm not intimately familiar with numerical algorithms, so I could be wrong about this, but my impression is that the number of "significant numerical algorithms" that can be easily expressed as nested loops is not that large: dozens, not hundreds. Unless I'm wrong about that, I stand by my position that the best way to handle these is to enumerate them and then allow them to be expressed directly.

wumpus · on May 24, 2016

Such as any code that applies a 1D operator against each of 3 dimensions in turn. That's done in the commonly-used PPM hydrodynamics algorithm, which is used in astrophysics and nuclear weapon simulations.

mikeash · on May 24, 2016

gcc and clang both provide a builtin for this, __builtin_popcount. There are a bunch of other useful things like that too:

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Of course, that doesn't save you from writing the long version if you need portable code, but it's a step in the direction you describe.

psuter · on May 24, 2016

Make sure to use __builtin_popcountl to count bits in a uint64_t, or you may miss half of them (yes it happened to me).

orlp · on May 25, 2016

I love this example. Even though you are aware how error-prone this is, you still get it wrong.

long is 32-bits on GCC 32-bit. So it will still give the wrong result with `uint64_t`. You should be using __builtin_popcountll.

(don't get me wrong - you're not stupid, it happened to me too: https://github.com/orlp/libop/issues/1)

psuter · on May 26, 2016

Good catch, thanks. (It has been a while since I've been on a 32bit platform.)

lisper · on May 24, 2016

> it's a step in the direction you describe.

Well, sort of. This is not about any one particular optimization. This is about the general pattern of having one human take a high-level concept (like counting bits) and reducing it to a low-level implementation, and then having a second human writing a compiler which tries to reverse-engineer the work that the first human did in order to figure out that they were trying to count bits so that it can emit code that actually does the Right Thing.

So yes, __builtin_popcount does improve the situation a little bit, but it perpetuates a whole host of other problems. For example:

1. It has the wrong name. The right name for a function that counts bits is "bitcount", not "popcount" (and certainly not __builtin_popcount!) The __builtin is there because C doesn't have a proper name spacing system, and so using __builtin_popcount perpetuates that problem in the same way that clever compiler optimizations perpetuate other bad aspects of C's design.

2. __builtin_popcount only works on unsigned ints. If you want to count bits in anything else (a string, say, or a bignum) you're back to square 1.

To really fix this problem you need a general mechanism for expressing high-level concepts that a compiler can now about and optimize directly. For that, C is completely hopeless. You need a new language.

WalterBright · on May 24, 2016

> a general mechanism for expressing high-level concepts

functions

> that a compiler can now about and optimize directly. For that, C is completely hopeless.

This has been done with C for generations, i.e. memcpy, strlen, sqrt, and even printf are recognized by the C compiler and custom code is generated.

The only problem is the C community has been slow to standardize on additional functions.

lisper · on May 24, 2016

No, that's not the only problem, it's one of many problems. But even by itself it's a pretty frickin' substantial problem.

WalterBright · on May 24, 2016

It means a new language is not required. Even without a Standard update, the various C compiler vendors could get together and agree on a set of them, making a "technical report".

Anyhow, what other problems?

lisper · on May 25, 2016

> the various C compiler vendors could get together and agree on a set of them, making a "technical report".

Yes, they could. But they don't.

> Anyhow, what other problems?

The lack of a real multi-dimensional array type is probably the biggest problem with regards to the goal of making code run fast. The fact that strings are required to be zero-terminated arrays of signed 8-bit integers is another huge problem. It means that you can't have native unicode strings, and that to create a substring you have to make a copy. The lack of exceptions and automatic memory management means that you need weird calling conventions if you want to return composite types, or if a function can fail. The lack of generics and namespacing means that you need to use weird naming conventions to avoid collisions, and there is a significant cognitive load placed on the programmer to figure out the right operation to use for a particular set of operand types (except for native arithmetic, but that can't be extended to any non-native types).

WalterBright · on May 25, 2016

I'm aware of C's language problems, but I don't think the ones you mention impede encapsulation of algorithms so the compiler can do a better optimization job.

> strings are required to be zero-terminated arrays of signed 8-bit integers

This is incorrect. C's char type is optionally signed, not required to be signed. The implementations I know of make them unsigned, because signed characters make no sense.

lisper · on May 25, 2016

> I don't think the ones you mention impede encapsulation of algorithms so the compiler can do a better optimization job.

Unless you can produce an actual counter-argument I guess we'll just have to agree to disagree about that.

> signed characters make no sense

I could not agree more. And yet...

    [ron@mighty:~] gcc -v
    Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
    Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn)
    Target: x86_64-apple-darwin13.4.0
    Thread model: posix
    [ron@mighty:~] cat test.c
    
    int main(int argc, char** argv) {
      unsigned char* x = "baz";
    }
    [ron@mighty:~] gcc test.c
    test.c:3:18: warning: initializing 'unsigned char *' with an expression of type
          'char [4]' converts between pointers to integer types with different sign
          [-Wpointer-sign]
      unsigned char* x = "baz";
                     ^   ~~~~~
    1 warning generated.

WalterBright · on May 25, 2016

Dang, you're right. D's chars are unsigned, and I had gotten so used to that I forgot that C compilers often default to signed. Thanks for the correction.

fanf2 · on May 25, 2016

The ARM C compiler I used over 20 years ago had unsigned chars. I don't know if that is still true for modern ARM.

WalterBright · on May 25, 2016

> Unless you can produce an actual counter-argument

There's no problem having a function called popcnt(), and having the C compiler recognize it just like it recognizes memcpy(), strlen(), sqrt(), etc., and generate optimal code for it.

lisper · on May 25, 2016

Sure, that is not in dispute. What is in dispute is whether the same can be done in C for matrix_multiply (or, even better, m1*m2 where m1 and m2 are matrices) or string.split(string) or strings.join(sequence_of_strings) or dict.get(key) or...

coldtea · on May 25, 2016

Aren't any of those negatively affected by the possibility of aliasing?

WalterBright · on May 25, 2016

None of those I mentioned are, though C has the 'restrict' type modifier to deal with that.

theseoafs · on May 24, 2016

> The right name for a function that counts bits is "bitcount", not "popcount"

"Population count" is what this function is typically called.

> (and certainly not __builtin_popcount!)

C doesn't have namespacing and naming the function "bitcount" would almost certainly collide with user code.

> 2. __builtin_popcount only works on unsigned ints. If you want to count bits in anything else (a string, say, or a bignum) you're back to square 1.

Not really? Just use __builtin_popcount in a loop.

lisper · on May 24, 2016

> "Population count" is what this function is typically called.

I didn't know that. Thanks.

> C doesn't have namespacing

That I did know. I even pointed that out in the comment you're responding to :-)

> Just use __builtin_popcount in a loop.

But that just brings you back to the original problem when some day someone builds a machine that has hardware support for popcount on multiple words.

theseoafs · on May 25, 2016

> But that just brings you back to the original problem when some day someone builds a machine that has hardware support for popcount on multiple words.

Is that really a problem? We've already seen that compilers are clever enough to recognize these patterns. Meanwhile any loop using __builtin_popcount will look more or less like this:

    for (int i = X; i < Y; i++) {
        total += __builtin_popcount(a[i]);
    }

... which is a substantially easier pattern to recognize than the one that clang is already using to recognize popcount.

lisper · on May 26, 2016

> Is that really a problem?

I think it is if you consider wasted effort a problem. But apparently not everyone does.

> compilers are clever enough

Yes, but only because someone did a lot of work to make them clever. And they only had to do that work because someone else did a lot of work to write a clever algorithm. If we as an industry had just done the Right Thing to begin with both of those people could have used that time to do something more productive.

coldtea · on May 26, 2016

>I think it is if you consider wasted effort a problem. But apparently not everyone doe

You haven't considered 2 things: marginal returns and opportunity cost.

Your solution to this "wasted effort" is to build a better C (or a better language). But this overlooks the millions of hours spent in C systems, C code, C compilers, learning, and all that would have to be done once again...

If the better language already existed, and was a drop in replacement (meaning everything just works) then your argument would indeed mean an end to wasted effort if adopted.

But as it is, it means tons of EXTRA effort, which is not clear is if better than the current wasted effort because of C deficiencies.

lisper · on May 26, 2016

> You haven't considered 2 things: marginal returns and opportunity cost.

Yes, I have.

> this overlooks the millions of hours spent in C systems

No, it doesn't. You are assuming that I am advocating re-writing existing C code in this hypothetical new language. I'm not. The hypothetical new language should be backwards-compatible with C, either a superset of C (like C++) or with a C (and C++) FFI (just about every other language out there today).

But even that is not strictly necessary. It is already common practice for programs written in different languages to interact with each other via language-independent mechanisms like pipes and sockets.

> If the better language already existed, and was a drop in replacement (meaning everything just works) then your argument would indeed mean an end to wasted effort if adopted.

Many such languages exist. The one I like to use, Common Lisp (and specifically Clozure Common Lisp) not only has a C FFI, it has an Objective-C FFI, which allows me to easily write Cocoa applications and leverage all of the power of both C and ObjC libraries on OS X with all of the efficiency of C. No wheel-reinventing required. And yes, everything Just Works. (With QuickLisp it really Just Works!)

> But as it is, it means tons of EXTRA effort

No, it doesn't. If you think it does, then you're doing it wrong.

coldtea · on May 26, 2016

> Many such languages exist. The one I like to use, Common Lisp (and specifically Clozure Common Lisp) not only has a C FFI, it has an Objective-C FFI, which allows me to easily write Cocoa applications and leverage all of the power of both C and ObjC libraries on OS X with all of the efficiency of C. No wheel-reinventing required. And yes, everything Just Works. (With QuickLisp it really Just Works!)

Not sure in what parallel universe starting to write software in CL, and getting hundreds of thousands of C programmers to adopt it doesn't mean "tons of EXTRA effort".

Heck, even the re-training required is millions of man-hours.

Even worse for the other proposed solution, to have legacy programs (that is: 99.9999% of available software that's not in the "NEW LANGUAGE") written in different languages to interact with each other via language-independent mechanisms like pipes and sockets.

And all that to gain what?

Supposedly it was to have a language where compilers could take advantage of the superior design to be even faster and more optimized than C, and you propose CL, which, after 30+ years, is slower or at best equal-ish to C.

lisper · on May 26, 2016

I was not suggesting that all C programmers switch to Common Lisp (though I don't think that would be a bad thing). I was just citing that as an existence proof that it is possible to introduce a new language into the software development mix without abandoning all of the legacy code written in C.

> And all that to gain what?

The ability to write future software with less effort (and that is more reliable and secure) than past software.

> equal-ish to C.

Equal-ish is good enough given the big win in security, reliability and development effort.

coldtea · on May 26, 2016

>I was just citing that as an existence proof that it is possible to introduce a new language into the software development mix without abandoning all of the legacy code written in C.

I don't think anybody doubted that. At least not anybody who ever heard of the C calling conventions and FFIs.

But you started this thread advocating for something much stronger, which I read (correct me if I'm wrong) as: working around C is wasting so much effort, people should adopt a better language design for better compiler optimizations, etc.

Unless this adoption is actually a mass phenomenon (even a replacement), then instead of improving things, it only split all the effort and worsened things.

>The ability to write future software with less effort (and that is more reliable and secure) than past software (...) Equal-ish is good enough given the big win in security, reliability and development effort.

At the start of this thread you were all about compiler optimizations for speed -- not about more reliability and security. Now "equal-ish" to C is good enough?

lisper · on May 26, 2016

> I don't think anybody doubted that

Well, you wrote:

> this overlooks the millions of hours spent in C systems, C code, C compilers, learning, and all that would have to be done once again...

and you are wrong -- on two counts. I didn't overlook these things, and they would not have to be done again. It's a little unclear to me how you could possibly think that they would have to be done again if you didn't doubt that this problem could be solved with an FFI.

> But you started this thread advocating for something much stronger, which I read (correct me if I'm wrong) as: working around C is wasting so much effort, people should adopt a better language design for better compiler optimizations, etc.

Yes, that's right.

> Unless this adoption is actually a mass phenomenon (even a replacement), then instead of improving things, it only split all the effort and worsened things.

No, that's not right. Different tools are good for different tasks. There's nothing wrong with split efforts.

Furthermore, even if you're right and there should be only the One True Language, don't you think it would be better if it were actually a well designed one?

> At the start of this thread you were all about compiler optimizations for speed

Not quite. I was lamenting the way in which compilers are evolving: people write clever algorithms in C, and compiler writers then write compilers to try to reverse-engineer those clever algorithms, and neither of those efforts would be necessary if we had a language that could directly express the things that people want to run fast and compiler writers could just directly compile those things to run fast. Life would be better for everyone.

The problem with C is that it's a low-level language masquerading as a high-level one. It's a macro assembler for a machine that was common in the 1970s but is no longer common today. The general idea of having a high-level assembler is a good one, but it needs to be an assembler for the actual target machine, otherwise you waste a lot of effort dealing with the impedance mismatch.

rjeli · on May 24, 2016

I would expect `bitcount` to mean sizeof(x)*8. Population count or Hamming weight is the canonical name for this operation.

to3m · on May 24, 2016

Erm - ``sizeof x*CHAR_BIT'' ;)

andreasvc · on May 24, 2016

A bit is a binary digit, with a value of 0 or 1. Popcount counts the number of 1-bits; and, as said, population count actually is the canonical name.

You can apply __builtin_popcount in a loop if you want to count bits in multiple words. The point of exposing this particular operation for an unsigned int is that a recent CPU will likely have an efficient operation for it. Operating on larger units is something better left to your own code or a library. You're definitely not "back to square 1", because the intrinsic offers you an operation that would otherwise only be accessible with inline assembly.

greg7mdp · on May 24, 2016

That's the whole point, even from the long version, clang figured out that it was doing a popcount and emited a single popcntq instruction. Nice!

_RPM · on May 25, 2016

Pop count returns the number of set bits. So it most useful for figuring out if a number is a power of two. If there are more than 1 "set" bits, then it is NOT a power of two. However, a non portable way of determining how many bits are in a object would be this expression, assuming 8 bits are in a byte. `sizeof(object) * 8`

mikeash · on May 25, 2016

Detecting a power of two is easy with just (x != 0 && (x & x-1) == 0).

A portable way of determining how many bits are in an object is sizeof(object) * CHAR_BIT. No reason to use the 8 version.

_RPM · on May 25, 2016

There are most likely many ways to check if a number is a power of two. I'm not sure if my solution is more efficient. It seems your expression would create more code than mine.

DerekL · on May 24, 2016

There's a proposal to add them to the C++ standard, but I don't know its current status. “A constexpr bitwise operations library for C++”: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n386...

adrianN · on May 24, 2016

Better language design unfortunately doesn't eliminate the other elephant in the room: legacy code. Compilers do provide intrinsics for some of these hacks, counting bits for sure. But old code doesn't use them and yet the compiler should create optimal instructions.

lisper · on May 24, 2016

Trying to make legacy code run faster through horrible compiler hacks rather than fixing the legacy code is a deal with the devil. The net result of that strategy is a horrible pile of hacks, many of which are there fore no reason other than to undo the deleterious effects of previous hacks. The long-term cost of maintaining all that complexity is far greater than the cost of just doing the Right Thing in the first place. Nowadays there is absolutely no excuse for continuing to use languages with built-in design compromises that were put in simply because the compiler had to fit in 64k of RAM. People keep doing it because doing the Right Thing front-loads the cost. But another name for "front loading costs" is "making an investment" which is generally considered a Good Thing in other walks of life. But for some reason, not in software.

gpderetta · on May 24, 2016

Legacy code is working code. Arguably time is better spent making old code go faster than optimizing yet to be written code.

And this is not confined to compilers. Most general purpose high performance CPUs are actually optimized to make existing code go fast and do not require recompilation to take advantage of many optimizations. Some go as far as recognizing common sequences of instructions and internally converting them to more optimized sequences (for example POWER8 conditional conversion of short jumps to conditional execution).

sliverstorm · on May 24, 2016

Option A: Maybe it's acceptable for legacy code to be compiled less-optimally? If it was fast enough on a Pentium 4, it should run fine on modern chips even without some optimizations

Option B: How often is legacy code recompiled, anyway?

Option C: Why not compile legacy code with legacy compilers? Sure the hack was removed upstream, but that doesn't stop you from using a copy of gcc circa 1995

Option D: If it needs to be maximally fast, and is routinely recompiled on a wide range of systems with a variety of compilers (e.g. Linux kernel), maybe it's worth rewriting.

SilasX · on May 24, 2016

Great thinking! So that way, if you need to leftpad a string, you could just import the well-maintained leftpad function ...

gravypod · on May 24, 2016

The principle is sound. Would you rather there be 500 non standard implementations of left-pad or should there be one fast and optimized implementation of left pad?

The only issue that leftpad demonstrated was that there was NO method for dependency resolution of removed libraries.

SilasX · on May 24, 2016

Oh, agreed, just calling back to the outrage over "dude, why would you need a library for that?"

cremno · on May 24, 2016

>Please don't bait other users by inviting them to downvote you or announce that you expect to get downvoted.

https://news.ycombinator.com/newsguidelines.html

lisper · on May 24, 2016

Ironically, I see you're getting downvoted, but you shouldn't be because you're right: I violated HN guidelines, so I edited my comment to remove the offending passage. Thanks for pointing this out.

outworlder · on May 24, 2016

> What does that mean? It means that C is a high-level language.

For this specific example, isn't the reasoning backwards? The machine instruction is clearly the "high level" here, if you disregard the fact that it is, in fact, a machine instruction, thus there can be no "lower level" (ignoring microcode, an implementation detail).

Of course, you cannot say that, in general, C is lower level than X64 assembly. However, just to illustrate, replace the popcntq instruction with some Scheme function, and try arguing that it is at a "lower" level. The argument won't work. In fact, people would think you were talking about a "decompiler", instead.

gjm11 · on May 24, 2016

One of the comments on that article links to this https://llvm.org/bugs/show_bug.cgi?id=1488 which rather suggests that this surprisingly specific optimization was added in order to speed up some calculations in one of the SPEC benchmarks.

(The benchmark in question is code from a chess program called Crafty, which represents a chess position as a bunch of 64-bit bitmaps. Unsurprisingly, 64-bit popcount is quite often useful for this.)

treerex · on May 24, 2016

Compiler vendors have been doing stuff like this (recognizing specific code patterns from common benchmarks) for 30 years, back in the days when there was real competition amongst C compilers (Borland vs. Zortech, for example.)

gjm11 · on May 26, 2016

Yup. But the HN discussion of this one seemed to be treating it as an instance of compilers being terribly clever to benefit their actual users, rather than yet another benchmark hack.

spoiler · on May 24, 2016

Just a side-comment: I never actually needed to count bits, but this technique illustrated in the post is quite clever! I had to stop for a second to understand what it actually does: Essentialy, each iteration will remove the least significant high bit from the number (or the right-most bit, if you represent the number LTR). and then continue to do so until there's no high bits left (or the number is zero)!

It's so simple, but to be honest, I don't think I would come up with it without putting in some mental effort.

psuter · on May 24, 2016

There is a classical page full of such things [1]. Not so long ago researchers looked into how to come up with such tricks automagically using program synthesis [2].

[1] https://graphics.stanford.edu/~seander/bithacks.html

[2] http://www.eecs.berkeley.edu/~sseshia/pubdir/synth-icse10.pd...

rosstex · on May 24, 2016

I teach Data Structures at UC Berkeley, and this is one of our most challenging problems each year: find a one-line expression to get the least significant bit of a number. x & (x-1)! It exploits some very interesting magic of binary numbers.

xentronium · on May 24, 2016

Clarification: clear (not get) the least significant bit.

carl_dr · on May 24, 2016

Doesn't that give you numbers which are a power of two? ie, x&(x-1) is zero if x is a power of two?

kevinwang · on May 24, 2016

Why not (x & 1) ?

johnp_ · on May 24, 2016

This just results in the "first" bit, essentially showing whether x is odd (1) or even (0).

Example: 6_10 = 110_2 and 1_10 = 001_2

(x&1) = 110 & 001 = 000 -> even number

I think the x&(x-1) is wrong though. I thought (x & -x) extracts the LSB:

010&110 = 010

110&010 = 010

001&111 = 001

111&001 = 001

001 010 100 = 110 101 100 = 000 000 100

This happens because the negative integers in two's-complement are just the positive number with all bits negated plus 1. If it'd only be negated, the result would be 0, as all bits would be different from one-another. Now everything right of the LSB of the argument is by definition zero and in the negated version by definition one. Finally, two's complement adds a one which turns all those bits back to zero and the position where the LSB is to one. This is now the only place in the two binary numbers where both bits are one, and therefore the only place that "survives" the '&'-Operation and remains one. All other bits become 0.

Disclaimer: Still studying and not teaching (yet) ;)

edit: changed order of examples to clarify, that it works from positive to negative and the other way round.

kevinwang · on May 24, 2016

Is that a common definition of LSB? I previously learned that the definition of the LSB is the "first" bit [0]. Would it be more clear to say that this algorithm finds the least-significant bit that's set?

[0] https://en.wikipedia.org/wiki/Least_significant_bit

johnp_ · on May 24, 2016

You're right. Least significant _set_ bit is (x & -x).

Got confused by the (x & (x-1)) and thought rosstex must have meant something more complicated, because it doesn't seem to result in the usual LSB either :/

danso · on May 24, 2016

> We test people’s intelligence in room, disconnected from the Internet, with only a pencil. But my tools should get as much or even more credit than my brain for most of my achievements. Left alone in a room with a pencil, I’d be a mediocre programmer, a mediocre scientist. I’d be no programmer at all. And this is good news. It is hard to expand or repair the brain, but we have a knack for building better tools.

Reminds me of what Kasparov said in his essay analyzing the rise of "centaurs" in chess competitions:

http://www.nybooks.com/articles/2010/02/11/the-chess-master-...

> Programming yourself by analyzing your decision-making outcomes and processes can improve results much the way that a smarter chess algorithm will play better than another running on the same computer. We might not be able to change our hardware, but we can definitely upgrade our software.

edit: fixed the errant double-copy-paste

dTal · on May 24, 2016

Hey, thanks for linking this. What an entertaining, insightful piece of writing!

mjlee · on May 24, 2016

Did you mean to quote the same passage twice?

psuter · on May 24, 2016

Seems to be a built-in, extremely specific use case: https://github.com/llvm-mirror/llvm/blob/ae889d36724efb174e8...

I suppose it makes sense, given: 1) popcount is very common in bit-manipulating programs, 2) there is no standard popcount in ISO C, 3) it potentially saves 100+ instructions.

(Edit: I see a comment on Lemire's blog already points to that.)

pera · on May 24, 2016

I was not aware of the popcnt[1] instruction. It's quite amazing how LLVM can recognize that code, thanks for the link!

[1] - https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT

EGreg · on May 24, 2016

Yes tools are a big part of augmenting intelligence. However!! The biggest thing that computers have over humans is the replicability of programs.

A human, somewhere, figures out a correct solution or a better algorithm, then it gets propagated and replicated.

A self-driving car "learns" something in its experience on the road. That information is propagated using a pre-existing system. Now all self-driving cars eventually improve and outperform typical humans.

Watson collects disease incidence rates and diagnoses from around the world, so the "big data" eventually allows it to make decisions that a single doctor would not be abls to make.

It's the easy copying of bits that makes this possible.

A photo in the 1960s could be imperfectly reproduced, on a newfangled "copy machine". Before that, photos were set up and printed using huge printing presses. The more copies are made of a photo is reproduced, the more likely it is to survive the destruction of the physical media it is embedded in.

But in the past, making perfect copies was very expensive. Today, the copying of bits is what has brought the price down of copying information. And this is the key to the intelligence revolution.

Humans were able to zoom ahead because of language. It provided ways of propagating information and copying it. Although the copies were imperfect, it is what made every generation smarter than the last.

Anyway, the "easy copying" is a feature of a platform. The humans are still the ones banging away at the platform. And that's why open source beatthe proprietary software. The same reason why science made progress -- anyone anywhere can make an incremental improvement, that is then recognized and propagated according to some political system in place.

And that is what produces the explosion in intelligence.

The things we can improve now are the political systems that govern which information is propagated. Should they be centralized? Decentralized? Who determines what goes in? How to deal with versions and dependencies?

The fact that we keep reinventing the same things (NoSQL) going in circles (thin client / thick client, JS framework fads) locking out users (Apple, MS Secure Boot) and bloating our computer languages (C++, JS, PHP, etc.) are just symptoms that we haven't figured out the best ways to keep moving forward.

yosefk · on May 24, 2016

Indeed, Clang/LLVM did a stellar job here. It'd be interesting to know how (in part to understand what situations can one expect this sort of thing to happen; for instance, I'd be surprised if IEEE floating point emulation code would be compiled to a hardware floating point instruction - though perhaps I underestimate the optimization passes at work here!)

TorKlingberg · on May 24, 2016

I bet there is a specific optimization that recognizes common patterns for counting the number of set bits and replaces it with popcntq. There is no popcnt() function built into C, so everyone who needs it uses one of a few common implementations.

pcwalton · on May 24, 2016

Yes. Many compilers, including LLVM, have an "idiom recognizer" (in LLVM parlance) that detects memsets, memcpys, population counts, etc. and turns them into calls to the optimized implementations.

psuter · on May 24, 2016

Indeed: https://github.com/llvm-mirror/llvm/blob/ae889d36724efb174e8...

yosefk · on May 24, 2016

LoopIdiomRecognize::recognizePopcount? Then I think it's more about "the surprising ability of modern compilers to recognize popcount" than general "cleverness."

(I suspected it was this kind of thing, but I thought maybe it was something more generic.)

acqq · on May 24, 2016

Yes most other loops that produce the same result wouldn't be recognized, just this very specific, note the steps:

// step 1: Check if the loop-back branch is in desirable form.

// step 2: detect instructions corresponding to "x2 = x1 & (x1 - 1)"

// step 3: Check the recurrence of variable X

// step 4: Find the instruction which count the population: cnt2 = cnt1 + 1

It's very, very specific. But they allow that the loop has something more, see the resulting transformation at the end:

   // After this step, this loop (conceptually) would look like following:
   // newcnt = __builtin_ctpop(x);
   // t = newcnt;
   // if (x)
   //  do { cnt++; x &= x-1; t--) } while (t > 0);

This can be later potentially optimized more, in the given simplest case the rest is removed by the later optimization steps.

ninjabeans · on May 24, 2016

This sounds the most plausible.

acqq · on May 24, 2016

> It'd be interesting to know how

Do you mean how this things are implemented in general or how come that this particular function was reduced to one instruction? The answer to later wrote gjm11 here: https://news.ycombinator.com/item?id=11764800 (it was to make LLVM good at some benchmark!)

The answer to former is: once the sorce code ins transformed to some intermediary form, a lot of optimizations simply look for the "best match" for different patterns and then replace the matched intermediary form sequence with the replacement sequence. Once you have an infrastructure in your compilers, it's typically not so hard to add one more "replace this with that" to get this result to win the benchmark.

pklausler · on May 24, 2016

The problem with this kind of idiom recognition is that it must be conservative. That rules out lots of cases involving floating-point arithmetic and memory references.

In the particular example of population count: this has been a hardware instruction on many processors since the early 60's. It is a crying shame that it's not part of the modern C language or standard library.

mzs · on May 24, 2016

Site is slow, here is the bug report:

https://llvm.org/bugs/show_bug.cgi?id=1488

Too bad it won't optimize in the face of clever code like this:

http://www.hackersdelight.org/hdcodetxt/pop.c.txt

flamedoge · on May 24, 2016

> it gets hard to benchmark “algorithms”

The cleverest of all C compilers that I know do not even attempt to rewrite algorithm to a "smarter" one (they're not there.. yet). All this does is translate popcount algorithm to a hardware instruction, which one would presume affects the performance linearly.