LLVM merges machine function splitter for reduction in TLB misses

account42 · on Sept 11, 2020

In their comparisons they explain that the existing Hot Cold Split creates new functions for the cold parts which brings overheads due to calling conventions and exception handling. Their new method avoids that by using simple jumps.

I wish compilers were able to optimize functions boundaries for visible but not inlined better so they could optimize those function call overheads away automatically. There is no reason for a compiler to limit itself to the system ABI for completely internal functions.

Besides leaving values in the registers they are already in instead of moving them to where the ABI says they should be, this could enable passing C++ objects such as std::unique_ptr in registers where the current ABI forbids this. Or to eliminate nop destructors for objects that are always moved from in the called functions.

In general, I think compilers should see function boundaries in the source code as a mere hint when it comes to code generation just like they already do with the register keyword.

Is there any existing compiler work to optimize function calls where the implementation and all uses are visible in the translation unit?

chriswarbo · on Sept 11, 2020

> There is no reason for a compiler to limit itself to the system ABI for completely internal functions.

As far as I understand it, this limitation is also the only thing preventing tail-call elimination in C/C++. Tail-recursive functions are essentially the same as 'while' loops (execute some instructions then conditionally go back to the start); they just have different scope and calling conventions. In particular, 'while' loops are always inline and can't be referenced from other functions, but tail-recursive functions can. Hence compilers can apply more aggressive optimisations to 'while' loops, e.g. re-using the same stack frame and iterating with a conditional jump.

However, as you say, functions-as-subroutines don't necessarily need to coincide with functions-as-interface. Tail-recursive functions which are only used internally can also be optimised to re-use the same stack frame and iterate with a conditional jump. If such functions are also exposed in the interface, it's pretty trivial to expose a dummy entry point which invokes the optimised version (if we want to); although cross-unit tail calls wouldn't be eliminated in that case.

Of course, this is more useful than just writing 'while' loops in a different style (although I personally prefer recursion https://news.ycombinator.com/item?id=11150346 ), since tail calls don't have to be recursive: we can split our code up into modular, composable, self-contained functions without having to worry about stack usage (or code bloat caused by inlining into multiple locations).

tom_mellior · on Sept 11, 2020

> As far as I understand it, this limitation is also the only thing preventing tail-call elimination in C/C++.

Nothing "prevents" tail-call elimination in C. Compilers do it all the time.

> Tail-recursive functions which are only used internally can also be optimised to re-use the same stack frame and iterate with a conditional jump. If such functions are also exposed in the interface, it's pretty trivial to expose a dummy entry point which invokes the optimised version (if we want to)

Can you illustrate with a concrete example what you mean? If the function doesn't need a stack frame, there will be no stack frame either way. If the function needs a stack frame, that stack frame must be allocated in the function's start block either way, and tail recursive calls will of course not jump to that start block but to its successor. I don't see how being "exposed in the interface" (in C speak, static vs. non-static) would make a difference here.

heavenlyblue · on Sept 11, 2020

The issue is basically that tail-recursive functions need to have this information recorded in their signature (for external interfaces), but this is not recorded in C and thus you can't provide this guarantee at the interface boundaries.

tom_mellior · on Sept 11, 2020

This is not concrete in any way. What guarantee do you mean? What "this information" should be "recorded in their signature"? What does this murky "information" have to do with enabling tail recursion?

Here's a tail recursive function C function that is compiled to tail recursive machine code without any problems: https://gcc.godbolt.org/z/dc886a

This function can be called from external compilation units without any problems. Whether or not it is compiled to tail recursive code is an implementation detail that the caller doesn't need to know about.

gsg · on Sept 11, 2020

> As far as I understand it, this limitation is also the only thing preventing tail-call elimination in C/C++.

That is not the case. Guaranteed TCE requires deallocating the stack frame before jumping to the target, but that is not possible when an object is allocated in that frame and its address passed to the target function.

In C++ there is also the issue of destructors needing to run after the tail-call returns (in which case it is not really a tail-call).

C/C++ compilers can and do eliminate tail calls where possible, but there's no guarantee like you get from a compiler for a functional language.

chriswarbo · on Sept 11, 2020

> that is not possible when an object is allocated in that frame and its address passed to the target function

Ah of course, I hadn't thought about live aliases to stack-allocated data.

You're right about destructors; I agree that they're not really tail-calls (since we're processing results).

> C/C++ compilers can and do eliminate tail calls where possible, but there's no guarantee like you get from a compiler for a functional language.

Yes, I tend to say "tail-call optimisation" when it's opportunistic (e.g. GCC/Clang), and "tail-call elimination" when it's guaranteed (e.g. Scheme/ML).

omginternets · on Sept 11, 2020

>As far as I understand it, this limitation is also the only thing preventing tail-call elimination in C/C++.

At the risk of introducing a tangent, can someone explain how compilers detect tail recursion? This is still a complete mystery to me.

tom_mellior · on Sept 11, 2020

What part of the detection is giving you trouble? Detecting whether a call is recursive, or whether it is a tail call that can be optimized? The first part is relatively easy, you typically have some sort of data structure representing calls and their targets, and you know what function you're currently compiling. If the function you're compiling is the same as a call's target, you have recursion.

LLVM's code is at https://llvm.org/doxygen/TailRecursionElimination_8cpp_sourc...

The core of this detection is in findTRECandidate. It iterates backwards over a basic block, looking for call instructions whose call target is the current function F:

    CallInst *CI = nullptr;
    BasicBlock::iterator BBI(TI);
    while (true) {
      CI = dyn_cast<CallInst>(BBI);
      if (CI && CI->getCalledFunction() == &F)
        break;
  
      if (BBI == BB->begin())
        return nullptr;          // Didn't find a potential tail call.
      --BBI;
    }

The harder part is detecting whether calls are in fact tail calls that you can replace by jumps. There are all sorts of special cases, like whether alloca() might be called by the function.

omginternets · on Sept 11, 2020

Sorry, I should have been more precise: I was referring to detecting that a function is tail recursive.

(Yours is still a helpful explanation, though -- thanks!)

maximilianburke · on Sept 11, 2020

I can't say how LLVM does it but in a compiler I built I detected tail recursion by scanning the IR and if the value of a recursive call (ie: call to self) was immediately returned it was considered a candidate for TCE. If the value was consumed by the parent, it wasn't.

Silfen · on Sept 11, 2020

I'd recommend "Lambda, the ultimate GOTO" https://dspace.mit.edu/handle/1721.1/5753

omginternets · on Sept 11, 2020

Reading this now, cheers! Is there anything in particular I should be looking for, or will it jump out at me (bearing in mind that I've never written anything more complex than an unoptimized Lisp interpreter) ?

chriswarbo · on Sept 11, 2020

As I said, tail-recursion is just a particular use of tail calls. Tail calls are very simple: anything of the form 'return foo(...)'; or, in languages without 'return', a function call which is the at the end of the definition (i.e. the last step).

All we require is that the current function is calling some other function, and passing on its return value unchanged. For example, let's take a simple function:

    int foo(int x) {
      return bar(x);
    }

'return bar(x);' is a tail-call, since 'foo' isn't doing anything with the result; calling 'foo' is equivalent to calling 'bar'. In this case the compiler could actually replace every occurrence of 'foo' with 'bar' (e.g. inlining it), but we can't do that in general. However, when we call 'bar' we certainly don't need to keep any of 'foo's machinery around (stack frames, local variables, register contents, etc.); we can just start executing the code for 'bar' (i.e. a jump instruction).

More realistically, 'foo' and 'bar' might not be equivalent, e.g.

    int foo(int x) {
      int y = baz(x);
      return bar(x + 3 * y);
    }

Here we can't just replace calls to 'foo' with calls to 'bar', since they do different things. However, notice the order that things will run in: (1) 'baz(x)' will be called, (2) 'x + 3 * y' will be calculated, (3) 'bar' will be called. Even though these 'foo' and 'bar' functions are very different to begin with, once we reach step three (the call to 'bar') the remaining instructions are the same (since 'foo' is calling 'bar' and returning its result unchanged; i.e. the definition of a tail-call!). Hence we can still discard 'foo's machinery and just jump to 'bar' (e.g. we can throw away 'y'; and the compiler can ensure the result of (2) is put where 'bar' expects its argument).

If 'foo' and 'bar' are the same function then we happen to have tail recursion. For example:

    uint fac(uint n) {
      if (n) return fac(n-1);
      return 1;
    }

Here 'return fac(n-1)' is a tail-call; it also happens to be tail-recursion, since it's calling the 'fac' function from within the 'fac' function. We don't need to keep the stack frame for 'fac(n)' around, we can just overwrite it with the stack frame of 'fac(n-1)', and so on.

We can also have mutually tail-recursive functions, e.g.

    uint even(uint n) {
      if (n) odd(n-1);
      return 1;
    }

    uint odd(uint n) {
      if (n) even(n-1);
      return 0;
    }

This is harder to spot than direct tail-recursion, but it doesn't matter since tail-recursion isn't particularly special. As long as we implement tail-calls efficiently, then the benefits apply to tail-recursive functions (like 'fac') and mutually tail-recursive functions (like 'even' and 'odd'), as well as many other places which just-so-happen to use tail-calls.

Intuitively, any time we find ourselves using a "main loop", or "driver loop", etc. which just calls out to a sequence of functions, possibly passing the return values of one into arguments of the next, that's equivalent to having each of those functions make a tail-call to the next one (i.e. such loops are "trampolines").

nybble41 · on Sept 15, 2020

> Tail calls are very simple: anything of the form 'return foo(...)'…

It's a bit more complicated than that since there may be destructors (in C++, or C with certain extensions) or stack-frame cleanup which occur between 'foo' returning and the 'return' statement itself. Tail call elimination is only possible when the stack frame can be freed before the final function call; if the address of any local variable escapes the current function then it might not be an option even if the source otherwise appears tail-recursive. TCE is simpler in garbage-collected languages which allocate their stack frames from the heap (and typically lack precise destructors) since these cleanup operations can be left to the GC.

Apanatshka · on Sept 11, 2020

I've seen a poster presentation at the student research competition of a PL conference where the student created a tool to do this after the fact on binaries. It looked quite promising, but I think preliminary results were single digit percentage speedup for most benchmarks. I'll have to see if I can find it again.

kevincox · on Sept 11, 2020

Single digits is a huge improvement. This is single digits across billions of programs with no change on their part. In the email Google claimed single digit improvements and I'm sure they are thrilled. 1% CPU savings across Google is many, many millions of dollars a year.

If you contribute a 1% CPU savings change to GCC or LLVM you would have a very hard time being carbon-positive over your lifespan.

saagarjha · on Sept 11, 2020

Functions by default have external linkage, and as such can be called by anyone. So of course this could only apply to functions that are static or internal in some way. Once you have that down, you have to deal with the fact that the things you would want to optimize may differ at each call site (one may like to keep rdx free, one may be able to give up rdi) so I am unsure it would be worth the effort to not follow the ABI.

account42 · on Sept 11, 2020

> So of course this could only apply to functions that are static or internal in some way.

Or with LTO / -fwhole-program: functions that have hidden visibility. Which is all of them for Windows executables and the default for DLLs (and -fvisibility=hidden is a sane default for other systems too).

> Once you have that down, you have to deal with the fact that the things you would want to optimize may differ at each call site (one may like to keep rdx free, one may be able to give up rdi) so I am unsure it would be worth the effort to not follow the ABI.

Sure, that makes things harder. But in many cases will only be one call site. In others you can still figure out which optimizations you can apply to all of them - for example passing std::unique_ptr and similar objects by register should always be doable as long as you control all call sites. Further, the compiler has the option to clone functions in order to apply optimizations specific to a group of call sites if it decides that an optimization is worth the increased code size. And some optimizations will be purely on the caller side - for example not spilling caller-saved registers if you know they won't be clobbered by the callee.

webreac · on Sept 11, 2020

IMHO, it will be a nightmare for debuggers.

jcranmer · on Sept 11, 2020

Inlining and outlining are per se probably some of the least destructive optimizations for debugging: they don't affect the order or existence of instructions, compared to the naïve translation model, merely where it is located within the executable.

(Admittedly, generating the debugging records to be able to refer to variables in the non-outlined portion of the function from the outlined function is probably not done, although it's probably doable with the current DWARF expression syntax. This is much like how most of the <optimized out> variables you see in gdb could be specified with DWARF information but aren't because the compiler doesn't maintain the information.)

mnem · on Sept 11, 2020

Generally trying to debug optimised code is a nightmare anyway. Most debugging is performed on non-optimised builds, which wouldn’t be affected by the proposed optimisation.

thu2111 · on Sept 11, 2020

There's actually a compiler that does all these things and allows debugging of optimised code whilst seeing the unoptimised form - Sulong.

Sulong is an LLVM JIT-compiling engine that runs on GraalVM. So it uses the same infrastructure of the JVM, but without JVM bytecode getting into the picture. Instead the JVM runs LLVM bitcode directly. But it does so in the standard way which gives many benefits:

• Splitting of hot/cold paths, as in this work.

• Cross-function optimisation and inlining regardless of visibility, as requested above.

• Ability to debug an optimised program, because the JVM can 'deoptimize' the parts being inspected or debugged.

• Run the Linux version of the bitcode cross-platform.

• In "Managed Sulong" (which is an enhanced payware version), all allocations are garbage collected and bounds checked. Yes even for C/C++ programs, just needing a recompile. It's just about possible to do this within the rules of well defined C/C++, as long as the program doesn't try to do non-portable or non-defined things like pass stack pointers between threads. Managed Sulong can also sandbox LLVM code modules, and redirect IO back into the JVM for mapping via the pluggable socket/filesystem layers.

• Cross-language interop, including recent experimental support for e.g. casting Java or Ruby or Python objects to C++ objects and calling methods on them.

• All the same compiler optimisations as LLVM does still apply.

It's a pretty amazing and under-appreciated project. On the other hand the code being run inherits the limits of the JVM, e.g. you shouldn't try and fork the process (of course, lots of runtimes don't like being forked and on Windows you can't do it at all).

brian_herman · on Sept 11, 2020

Yeah, sulong looks pretty cool! https://llvm.org/devmtg/2019-04/slides/TechTalk-Schatz-Sulon...

jeffbee · on Sept 10, 2020

The post on llvm-dev is ever so much more interesting.

https://lists.llvm.org/pipermail/llvm-dev/2020-August/144012...

mleonhard · on Sept 11, 2020

You can see it with inline graphics and formatted tables on the old (no-login-required) Google Groups view:

https://groups.google.com/d/msg/llvm-dev/RUegaMg-iqc/wFAVxa6...

snehasish · on Sept 10, 2020

The llvm-dev google group retains the formatting of the original email and is slightly easier to read though it requires a gmail login.

https://groups.google.com/g/llvm-dev/c/RUegaMg-iqc/m/VFyV9cX...

moonchild · on Sept 11, 2020

> requires a gmail login

https://archive.is/pqLgW

nielsbot · on Sept 11, 2020

This link + Safari Reader Mode is good

goalieca · on Sept 11, 2020

Since when did it require a Google login to view?

mcpherrinm · on Sept 11, 2020

The "old" groups UI (URLs like https://groups.google.com/forum/#!topic/llvm-dev) doesn't require a login right now. It briefly did, and the Golang team complained and got the change rolled back (probably temporarily). The "new" UI (URLs like https://groups.google.com/g/llvm-dev/) has always needed to be logged into google afaik.

judge2020 · on Sept 11, 2020

That's linking to a specific reply, here's the main reply: https://groups.google.com/g/llvm-dev/c/RUegaMg-iqc/m/wFAVxa6...

Looks like it requires a Google account now though, I can't get it in incognito.

dang · on Sept 10, 2020

Ok, we'll change to that from https://www.phoronix.com/scan.php?page=news_item&px=LLVM-Lan.... Thanks!

bla3 · on Sept 11, 2020

The new URL is a nice overview but doesn't mention at which commit the code got merged: https://github.com/llvm/llvm-project/commit/94faadaca4e1704f...

MaxBarraclough · on Sept 10, 2020

> Google engineers found a 2.33% runtime improvement with a ~32% reduction in iTLB and sTLB misses

Not bad. Presumably this will benefit every compiler that uses LLVM, not just Clang.

monocasa · on Sept 10, 2020

Or at least any that supports gathering and utilizing profiling information. Last time I checked, that didn't include Rust yet for instance, but my information could be out of date there.

kibwen · on Sept 11, 2020

rustc does support PGO, but I don't know how mature the support is. https://doc.rust-lang.org/rustc/profile-guided-optimization....

rmdashrfstar · on Sept 11, 2020

Can I create profile data by running the instrumented binaries through testcases? Perhaps integration test cases which represent typical end to end functionality of my application.

fluffy87 · on Sept 11, 2020

Unit tests are a bad fit, you need to run a test that is representative for an actual work load.

A unit test that tests code that’s cold in practice would make it look like hot code, and be counter productive.

IshKebab · on Sept 11, 2020

Yes you have to do that. That's how PGO works.

fluffy87 · on Sept 11, 2020

Rust does support PGO

saagarjha · on Sept 10, 2020

Note that LLVM has already had a function outliner; you may have seen these paths already if you've ever looked inside a binary on macOS and noticed * .cold. * functions, which is the unlikely-to-execute code cut out from the function they came from. This appears to be an improvement on that effort.

quotemstr · on Sept 11, 2020

Right. The improvement that this work brings is that it performs the function split very late and at a low level.

Basically, while the previous outliner split a function into two functions (the hot one literally calling the cold one as needed) this new thing takes a single function and splits it into two parts connected to each other by jumps. The cold part of the function isn't really a function --- it's just a group of basic blocks that happen to be located far away from the other group of basic blocks.

By avoiding the call into the cold function and the return to the hot function, the generated code can be smaller and more register-efficient.

simcop2387 · on Sept 11, 2020

That'll make it more stack efficient too since it doesn't have to go through the same dance for a function call. It probably didn't do them while adhering to the C ABI but it'd still put at least a return address and I suspect some registers during the call.

saagarjha · on Sept 11, 2020

There’s no need, though: the entry point and return address are unique; it’s literally code that is sliced out, jumped to, and it jumps back to the function it was cut out of. The only thing you’d need to save is a register or two if you can’t make the jump without doing some math.

quotemstr · on Sept 11, 2020

X86 should always be able to jump directly, and ARM sets aside x16 and x17 just for this kind of math. But all jumps should be PC relative, so you shouldn't have to clobber anything anyway

loeg · on Sept 11, 2020

Might be confusing to debuggers if the address-space range of a single function is discontiguous. Does the cold portion get an independent symbol with derived name, like, e.g., "Blocks?"

quotemstr · on Sept 11, 2020

Yes --- the collection of cold basic blocks gets named "<origfunc>.cold". But it's nevertheless not really an independent function from a code POV.

ladberg · on Sept 10, 2020

FYI, this requires profile data to do the splitting. Yet another reason to do PGO!

repsilat · on Sept 11, 2020

For many use-cases FDO ("feedback-driven") can be more convenient and sometimes mkre effective than PGO ("profile-driven").

The difference is sampling prod vs sampling separately in test. The arguments for FDO:

- Prod behaviour/data is always "representative", whereas synthetic or recorded data can go out of date quickly.

-PGO test fixtures can contain sensitive user data. Instrumenting production processes doesn't put data in more places.

The benefits of both are huge though. The rule of thumb I've seen is a 20% improvement for FDO over -O3.

throwaway17_17 · on Sept 11, 2020

I’m interested in reading some more on this concept, The Wikipedia, which usually my first stop for a broad overview and maybe some linked research, seems to suggest that profile directed/guided and feedback guided are the same thing. Is there anywhere I can read about the varying approaches?

jeffbee · on Sept 11, 2020

They are the same thing. The GP makes a distinction which does not exist. Witness the fact that two different compilers refer to the exact same feedback technique as either AutoFDO or SamplePGO.

repsilat · on Sept 11, 2020

Ah, I definitely worked in a place where that distinction in terminology was used, but maybe it isn't widespread. In any case, whatever you call either one of them, sampling from production processes can have benefits over sampling from synthetic workloads.

justinclift · on Sept 11, 2020

Surely that's a given?

It's literally sampling from a representative workload (production) vs a non-representative one (anything synthetic).

repsilat · on Sept 11, 2020

Maybe so. I think it was a novel idea to me because my intuition around profiling was formed with an assumption of delivering binaries to users, and not running server processes. (It's probably also a pre-internet bias of thinking it would be hard to get prod data, whereas profiling data from generated in a horrible enormous compile&run&profile&recompile process at least doesn't need to "go" anywhere.

account42 · on Sept 11, 2020

There is a distinction even, even if you don't like OP's terminology used to differentiate the two things:

- With traditional PGO (ie -fprofile-generate, then -fprofile-use) you first generate a separate instrumented binary that records profile information. You wouldn't generally want to use this binary in production because of the overhead this profile generation incurs.

- With sample driven PGO (ie -fprofile-sample-use) you use external tools to sample profile information from an uninstrumented binary - the same binary you'd use in production.

derangedHorse · on Sept 11, 2020

It always surprises me how many people are familiar with LLVM internals. Just looking at the 60 unique commenters at the time of posting this is blowing my mind.

IvyMike · on Sept 11, 2020

> We find that most functions 5 KiB or larger have inlined children more than 10 layers deep

Ok, I would not have expected that.

jackpirate · on Sept 11, 2020

I wouldn't have expected any functions to be 5KiB or larger. That seems like a lot of assembly to me, and so it makes sense to me that these functions would be "accidentally" generated by inlining.

I'm curious what percent of functions are this large?

jeffbee · on Sept 11, 2020

They get larger the more aggressively you inline code.

Edited: I totally forgot about this great paper that shows the size of the top 100 most-executed functions at Google. See Figure 8.

https://people.ucsc.edu/~hlitz/papers/asmdb.pdf

Here's a gigantic function I found laying about: https://gist.github.com/jwbee/2167565e043578000ae489f67a82d6...

Protocol Buffers message decoders tend to be extremely large, since each message has an auto-generated parse function that will tend to inline everything. My local build of protobuf contains 20 functions each larger than 5KB, including a 10KB-long specialization of std::sort.

Someone · on Sept 11, 2020

That google paper has a nice observation on glibc’s memcmp:

“memcmp clearly stands out of the correlation between call frequency and function size in Figure 8. It is both extremely frequent, and, at almost 6 KiB of code, 10× larger than memcpy which is conceptually of similar complexity. Examining its layout and execution patterns (Figure 13) suggests that it does suffer from the high amount of fragmentation we observed fleetwide in the previous section. While covering 90% of executed instructions in memcmp only requires two cache lines, getting up to 99% coverage requires 41 lines, or 2.6 KiB of cache capacity. Not only is more than 50% of code cold, but it is also interspersed between the relatively hot regions, and likely unnecessarily brought in by prefetchers. Such bloat is costly – performance counter data collected by GWP indicates that 8.2% of all i-cache misses among the 100 hottest functions are from memcmp alone.

A closer look at the actual code from glibc can explain the execution patterns in Figure 13. It is hand-written in assembly and precompiled, with extensive manual loop unrolling, many conditional cases for the various alignments of the two source arrays, and large jump tables.

In our experience, code usually evolves into similar state from over-reliance on micro-optimization and micro-benchmarking. While writing in assembly can in rare cases be a necessary evil, it prevents the compiler from doing even the most basic feedback-directed code layout optimizations.

[…]

We tested this hypothesis by macro-benchmarking a version of memcmp that is specifically optimized for code size (only 2 i-cache lines) and locality. In short, it only special-cases very small string sizes (to aid the compiler in inlining very fast cases) and falls back to rep cmps for larger compares. Even though it achieves slightly lower throughput numbers than the glibc version in micro-benchmarks, this simple proof-of-concept showed an overall 0.5%-1% end-to-end performance improvement on large-footprint workloads like web search.”

I would guess the assembly version can be improved by moving those cold blocks out of the cache lines of the hot code.

jcranmer · on Sept 11, 2020

Totally non-scientific measurement (I took a few applications I had lying around, got size of symbols with nm -S) suggests that you're looking at around 0.1-1% of all functions are larger than 5KiB.

Interestingly, the largest function all cases seemed to be something on the order of "initialize a large static hashtable."

mac01021 · on Sept 11, 2020

I've seen plenty of c/c++/java functions that exceeded 200 lines of high level code. It seems reasonable to me that such a function would be 5k bytes of machine code, especially on a 64-bit architecture.

Not that such functions are the right way to structure your program...

account42 · on Sept 11, 2020

> especially on a 64-bit architecture

Pointer size has little impact on code size. amd64 code can even be smaller than x86 in some cases.

knaq · on Sept 11, 2020

I think you could easily get past a megabyte with auto-generated parser code. That would be stuff like lex, yacc, bison, antlr, flex, etc.

Anything with a big switch would qualify too, such as an emulator.

formerly_proven · on Sept 11, 2020

C++ stdlib code often has these ultra-deep callstacks.

Ericson2314 · on Sept 11, 2020

This + multi-return (push multiple code addresses in stack frame) together mean I think we can just use Result/Either and never unwinding, and leave no performance behind.

zelly · on Sept 11, 2020

It doesn't work in Rust, only Clang

Ericson2314 · on Sept 11, 2020

Why? Other commenters were talking about PGO and saying Rust did support it.

gosukiwi · on Sept 10, 2020

LLVM is one impressive (and scary) piece of technology.

bonzini · on Sept 11, 2020

FWIW this particular optimization has been in GCC for about 15 years, though it was only enabled by default in GCC 8. It's not rocket science.

isatty · on Sept 11, 2020

Idk, not learning compiler theory is one of my regrets from school and this does seem like rocket science to me.

jjice · on Sept 11, 2020

There are a ton of great books out there! If you're looking for a bit of the whole shebang, check out Compilers: Principles, Techniques, and Tools (aka, The Dragon Book). It's an older book, but a lot of what is described in there is still very relevant and good info. If you want a more high level overview at first, I'd recommend Crafting Interpreters [0] and Writing a Compiler/Interpreter in Go [1]. The latter two focus more on lexing, parsing, and intermediate code generation.

[0] https://craftinginterpreters.com/ [1] https://compilerbook.com/

isatty · on Sept 11, 2020

I know I should get to it - maybe I should've done that during the COVID lockdown eh? Thanks for the links, I'll check them out.

why_only_15 · on Sept 11, 2020

Commenters elsewhere in the thread are saying this optimization existed in LLVM as well, just that this is a more deep/efficient version of it.

bonzini · on Sept 11, 2020

Yes, I'm referring to this particular version.

silisili · on Sept 10, 2020

Impressive, sure. But curious what makes it scary to you?

mhh__ · on Sept 11, 2020

The LLVM way of doing C++ is basically a wet dream when it comes to C++-isms e.g. not many C++ projects do their own dynamic casts and get away with it.

LLVM is fairly well-structured, especially compared to GCC. I don't find it scary but it is a big fat fucker of a codebase.

ndesaulniers · on Sept 11, 2020

All dyn_cast does is move runtime polymorphism to compile time polymorphism.

pjmlp · on Sept 11, 2020

OWL and MFC come to mind.

on Sept 10, 2020

[dead]

BoorishBears · on Sept 10, 2020

Have you heard of the dragon book?

jasonmp85 · on Sept 11, 2020

LLVM's logo is a wyvern

BoorishBears · on Sept 11, 2020

You mean

> a winged two-legged dragon with a barbed tail

?

https://llvm.org/Logo.html

> The LLVM logo is a stylized wyvern (a kind of dragon). Dragons have connotations of power, speed and intelligence, and can also be sleek, elegant, and modular (err, maybe not). In addition, there is a series of influential compiler books going back to 1977 which featured dragons on the cover.

Leave it to HN to jump to nitpick something as meaningless a wyvern vs a dragon but not even take 5 seconds to check the primary source

rmdashrfstar · on Sept 11, 2020

99% of readers either read and voted or moved on, it’s just one guy who decided to nitpick you, not some nebulous “HN” hive mind collective.

BoorishBears · on Sept 11, 2020

useless nitpicks is a time honored HN tradition.

anoncareer0212 · on Sept 11, 2020

[flagged]

BoorishBears · on Sept 11, 2020

Well the nitpicking is pretty well accepted at this point, but since when are spammy comments like this a thing?

vlovich123 · on Sept 11, 2020

2% better perf isn't anything to sneeze at but also not super drastic. Will be curious to see if this unlocks more optimizations over time.

I'm also curious if there's been any work to organize code layout to make it more efficient to page stuff out efficiently to reduce memory pressure (so that you're using all code in the pages you are bringing in) & reduce paging thrash (you're not evicting a code page only to bring it back for a few hundred bytes).

hinkley · on Sept 11, 2020

Given the greatly improved TLB hit rate, I sort of wonder if the test isn't abusing the code enough. Does this change the inflection point where throughput falls from trying to run too many tasks in parallel or sequentially? 2% response time won't do much, but 10% higher peak load would change quite a few priorities on a lot of backlogs.

vlovich123 · on Sept 11, 2020

It’s not too surprising to me. If TLB misses cost your program to run 6% slower, wouldn’t you expect a 30% reduction in that to be a 2% overall improvement? Why would it be more? If you’re surprised that TLB misses are not so costly, consider that cache locality and CPU prediction are largely going to hide the cost of misses (cache locality means the miss rate is low to begin with, prediction reduces the cost of a miss). TLBs are also super complex involving a few layers of caching (TLB look aside, L1 cache of your TLB table, L2 cache, L3 and then RAM).

voldacar · on Sept 11, 2020

So to use this new optimization pass you have to specify it explicitly with an additional command line flag instead of it simply being turned on as an ordinary part of PGO?

It seems like most people who just want to squeeze the last drop of speed out of their code will probably never become aware of it then unless they are specifically seeking it out

mafuy · on Sept 11, 2020

It's an experimental feature. If it works well for a while, it may become part of the default. But it should definitely not be forced on everyone if it's not yet sure how well it works.

emerged · on Sept 11, 2020

One issue with PGO is you are optimizing for the particular subset of use cases which are profiled, on the exact machine you profile on. The 2% quoted improvement will not necessarily generalize across all use cases for the software and all machines it runs on. In fact, you may often be taking that 2% or more away from other use cases.

drivebycomment · on Sept 11, 2020

> One issue with PGO is you are optimizing for the particular subset of use cases which are profiled, on the exact machine you profile on.

The exact machine part is not true. There's nothing about this particular optimization that's machine specific - e.g. as the original post explains, this optimization gives performance boost on Intel and AMD, on Intel due to reduction in iTLB misses, and on AMD due to reduction in L1 and L2 icache misses. i.e. this kind of "working-set" reduction translates to any platform.

> In fact, you may often be taking that 2% or more away from other use cases.

In general, it is correct that profile-guided optimization can theoretically reduce performance, as some of the aggressive optimizations are only done with profile because of inherent trade-off the optimization has (e.g. aggressive inlining which can be detrimental for the performance if hot functions are entirely different).

However, empirically this is not true in most cases, unless you picked really bad training input, and your code has extremely different behavior under different input. Moreover, nowadays with sampled profile, which you can collect from the real, production runs, it's extremely unlikely for this to happen.

kevincox · on Sept 11, 2020

Yes, taking advantage of PGO is hard. But with the right infrastructure you get very accurate profiles. For example many large companies will gather the profiles from the previous version running in production. I have even seen cases where the new version was fed replayed requests then recompiled with the recorded profiles.

So yes, PGO is hard, but if you collect representative profiles it is a massive improvement.

emerged · on Sept 10, 2020

I take it this is a form of profiler guided optimization? Does MSVC already do this during its PGO? It seems like an obvious thing to do.

neerajsi · on Sept 11, 2020

MSVC does have this optimization. It has 3 sections: 'live', 'sick' (referenced, but uncommon), 'dead' (unreferenced in the profile).

jeffbee · on Sept 11, 2020

Yes, this is a pretty old technique. There were papers about hot/cold function splits based on profile data as early as 1996.

caf · on Sept 11, 2020

The technique itself isn't new in clang either. This is about a new implementation of it, where the difference to the existing implementation is that this happens later in the process (it's deferred to the machine-specific code generation phase, whereas the existing implementation happens in the middle-end and is target-agnostic).

vlovich123 · on Sept 11, 2020

The other major piece is that the hot-cold split is more efficient. Rather than thunking out the cold code via a function call it just jumps to the basic block, making it a more efficient approach (no register spilling and function call overhead)

caf · on Sept 11, 2020

The function call overhead itself is irrelevant, because by definition these blocks are cold. The saving/restoring of callee-clobbered registers does affect the code size of the hot function though, so that's important.

labawi · on Sept 11, 2020

Blocks are cold by an imprecise measurement.

Decreasing the downside means you can apply the hopefully-optimisation more aggressively for more gain so I would expect it to matter.

enigmo · on Sept 11, 2020

this was a common practice at Microsoft in the mid 90s, maybe 95-97ish. the set of apps involved were called BBT: "Basic Block Tools". Windows, SQL Server, among others, were post processed with profiling data. it also deduped basic blocks, reducing binary bloat from inlining even without profiling data. just needed some additional info in the debug symbols to work.

jeffbee · on Sept 11, 2020

Identical code folding still breaks debugging with modern llvm. It can make it seem like the call came from an impossible place in a stack sample. Is it something that Microsoft solved long ago?

drivebycomment · on Sept 11, 2020

Yes. And it was implemented relatively widely e.g. https://docs.oracle.com/cd/E19205-01/820-4379/index.html which IIRC was published before 2010.

DSingularity · on Sept 10, 2020

Yes it is, not sure what MSVC does.

person_of_color · on Sept 11, 2020

Google improving an Apple project. The high ground!