Will It Optimize?

udp · on Sept 16, 2011

Previous submission and discussion: http://news.ycombinator.com/item?id=1540567

Roritharr · on Sept 16, 2011

I remember learning about compiler optimization at university. Sadly all my colleagues knew(and were taught) was Java and the Sun JVM we used at the time didn't even do most of the stuff we learned, not even the tail-recursion optimization, therefor the opinion of most of the students was that the professor should stop babbling about things that have no application in real life.

This was the one time i was on the side of academia...

iam · on Sept 16, 2011

Thankfully since then the HotSpot JVM has become one of the most advanced VMs in terms of JIT optimization, doing things that are so wild and crazy that other JITs can only dream of.

ajross · on Sept 16, 2011

Quibbles: the strlen() example is poorly explained. It doesn't hoist it because it's "built-in", it hoists it because strlen() is marked "pure" (i.e. without side effects) in the declaration. Any function can get that attribute.

ridiculous_fish · on Sept 16, 2011

Actually I think this is a glibc/BSD difference. You're right that any pure function would be hoisted in the same way, but in fact strlen() is not marked pure in my OS X header (which looks to be the same as in FreeBSD). Compiling with -fno-builtin defeats the optimization, which shows that it is due to strlen being built-in on those platforms.

On my Linux machine, on the other hand, strlen() is marked pure, and there the optimization persists even with -fno-builtin.

wzdd · on Sept 16, 2011

There must be more to it than that. For example, if something in the loop body writes to s, then the result of "strlen" could change and the optimisation will be wrong.

So, it must know a bit about what strlen() does, mustn't it -- otherwise it wouldn't know that it has to ensure that \0 isn't written to s inside the loop body.

ajross · on Sept 16, 2011

The compiler can see the loop, so it can optimize it. Hosting constant expressions out of loops is a long-standing optimization in all compilers. Here, the expression "s" is unchanged* inside the loop, so the compiler can hoist it. If the expression were "s[0]" it could do the same. But can it do it for strlen(s)? It can, if it knows that strlen() has no side effects. So gcc extends C's syntax to add a "pure" attribute to tell the compiler exactly that.

* Pedants will point out all subtleties about C pointer aliasing rules apply here. But for the purpose of explanation, assume that nothing assigning throuh s means "s is constant".

saucetenuto · on Sept 16, 2011

GCC can statically prove that `s` isn't changed at all in the loop body, just accessed. Thus the analysis goes: `strlen()` is pure, and its argument never changes, so its result never changes, so I can hoist it.

ssp · on Sept 16, 2011

No, there is more to it than that. It has to perform an aliasing analysis to prove that nothing that s points to changes. Consider this code:

    unsigned sum(const unsigned char *s, char *t) {
        unsigned result = 0;
        for (size_t i=0; i < strlen(s); i++) {
            result += s[i];
            *t = 'x';
        }
        return result;
    }

gcc can't hoist strlen() in this case, because *t = 'x' could potentially overwrite the null termination of s.

JoachimSchipper · on Sept 16, 2011

Note that C99 has added "restrict" for such cases, which essentially promises that the restricted pointer (and any pointers derived from it) is the only pointer to that chunk of memory.

sausagefeet · on Sept 16, 2011

Can GCC handle more complicated loops? For example what if I go several functions deep in in my loop with s? What if I call a function in a shared object? Does gcc do this analysis for all cases where a function is marked pure or is strlen a specific case?

ajross · on Sept 16, 2011

The compiler can't see inside functions in the general case (because they're opaque binary blobs in a library that hasn't been linked yet), so no. It's limited to optimizing stuff it can "see", which broadly means the current function and anything inlinable into it.

But yes, if all your functions are marked inline and visible to the compiler, there's no theoretical limit to GCC's ability to hoist constants out of the innermost parts of the loop. In practice, there are surely some performance limits to how much inlining will be expanded.

wzdd · on Sept 16, 2011

Thanks. If this helps anyone else, the realisation I was missing was that the compiler must treat any write based off a pointer as a write to the thing pointed at. Which is conservative if there is no aliasing.

alecco · on Sept 16, 2011

Oh, I remember getting upset last time with this "test". For example:

  3. Multiplication by 2 to addition - integer
  Will GCC transform an integer multiplication by 2 to addition?

The statement (not function) x * 2 is shift x left 1 bit on almost all compilers. Shift has a lot less dependencies than ADD/LEA and has better reciprocal throughput. Meh.

DarkShikari · on Sept 16, 2011

Shift has a lot less dependencies than ADD/LEA and has better reciprocal throughput.

Nope. Usually it's the reverse: Atom, for example, can only do one shift per cycle, but it can dual-issue adds.

ajross · on Sept 16, 2011

This surprises a lot of programmers. A barrel shifter is a big circuit for an ALU. Smaller than a multiplier obviously, but much larger than a ripple carry adder and somewhat larger even than the fancy pipelined adders used in fast CPUs.

It's the kind of thing that got skimped on in Atom. Lots of stuff runs surprisingly slow on Atom. Another one I find striking is that there are two pipes for ADD/SUB instructions, but ADC/SBB (the carry/borrow variants) are not just single-issue, but apparently unpipelined. Intel lists their throughput at 3 instead of 0.5!

psykotic · on Sept 16, 2011

> Intel lists their throughput at 3 instead of 0.5!

Agner Fog lists ADC/SBB on Atom as having latency 2 and throughput 1/2. I guess Intel could have made them run with ADD/SUB-level performance in cases where there are no intra-pair dependencies on the carry/borrow flag, but the extra interlocks for handling that were probably not worth the cost.

alecco · on Sept 16, 2011

My big mouth. I stand corrected.

samth · on Sept 16, 2011

The comment about tail calls is wrong -- it's not an optimization, it's a guarantee about semantics of your program. Not optimizing tail calls is like making array dereference run in linear time instead of constant: it changes the programs you can write.

barrkel · on Sept 16, 2011

Irrespective of how it changes the semantics of (mostly functional) languages that rely on it for iteration, tail calling is also an optimization in languages that don't need to rely on it for this, and it is not wrong to describe it as such.

samth · on Sept 16, 2011

There are two things here: an optimization, and a semantics guarantee. Obviously, if your language provides the guarantee, then implementing it correctly isn't just an optimization. But the original article claims that the transformation proves that proper tail calling is "just an optimization", which it doesn't prove at all. It's still an important semantic guarantee which increases the expressiveness of a language. This is true regardless of whether it is the only means of expressing iteration in a language.

ridiculous_fish · on Sept 16, 2011

What the factorial example proves is that the set of recursive functions that can run in constant space is larger than just tail-recursive functions.

You're right to say that TCO guarantees increase the expressiveness of a language. If we could provide the same guarantees for an even larger class of recursive functions, it stands to reason our expressiveness would increase even further!

lukesandberg · on Sept 16, 2011

In a language that requires TCO then you are right, it's not about optimization it's about correctness. In a language (like c) that has no such semantics it is 'just' an optimization. In scheme however it is a correctness issue.

merraksh · on Sept 16, 2011

I have a little doubt about number 6. A switch() without a "default:" label has undefined behavior if x is negative or above 5, while nothing would happen in the cascading if's. Wouldn't a correct optimization of that be

  void function(int x) {
    switch (x) {
      case 0: f0(); break;
      case 1: f1(); break;
      case 2: f2(); break;
      case 3: f3(); break;
      case 4: f4(); break;
      case 5: f5(); break;
      default:
    }
  }

dchest · on Sept 16, 2011

Why does it have undefined behavior? Switch statement with no match and without the default case has defined behavior: the switch statement does nothing.

rcfox · on Sept 16, 2011

"If no converted case constant expression matches and there is no default label, no part of the switch body is executed."

Section 6.8.4.2, item 5 of http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf (page 134)

jefftk · on Sept 16, 2011

how is this different from "does nothing"?

rcfox · on Sept 16, 2011

... it's not? I was just providing an authoritative reference for the parent's claim.

(Pedantic note: It's technically incorrect to say that the switch statement does nothing.

    switch(FunctionThatDoesSomethingAndReturns1()) {
        case 2:
            // foo
    }

The switch body does nothing, but the actual statement does something.)

thefre · on Sept 16, 2011

are you high?