In their comparisons they explain that the existing Hot Cold Split creates new functions for the cold parts which brings overheads due to calling conventions and exception handling. Their new method avoids that by using simple jumps.
I wish compilers were able to optimize functions boundaries for visible but not inlined better so they could optimize those function call overheads away automatically. There is no reason for a compiler to limit itself to the system ABI for completely internal functions.
Besides leaving values in the registers they are already in instead of moving them to where the ABI says they should be, this could enable passing C++ objects such as std::unique_ptr in registers where the current ABI forbids this. Or to eliminate nop destructors for objects that are always moved from in the called functions.
In general, I think compilers should see function boundaries in the source code as a mere hint when it comes to code generation just like they already do with the register keyword.
Is there any existing compiler work to optimize function calls where the implementation and all uses are visible in the translation unit?
> There is no reason for a compiler to limit itself to the system ABI for completely internal functions.
As far as I understand it, this limitation is also the only thing preventing tail-call elimination in C/C++. Tail-recursive functions are essentially the same as 'while' loops (execute some instructions then conditionally go back to the start); they just have different scope and calling conventions. In particular, 'while' loops are always inline and can't be referenced from other functions, but tail-recursive functions can. Hence compilers can apply more aggressive optimisations to 'while' loops, e.g. re-using the same stack frame and iterating with a conditional jump.
However, as you say, functions-as-subroutines don't necessarily need to coincide with functions-as-interface. Tail-recursive functions which are only used internally can also be optimised to re-use the same stack frame and iterate with a conditional jump. If such functions are also exposed in the interface, it's pretty trivial to expose a dummy entry point which invokes the optimised version (if we want to); although cross-unit tail calls wouldn't be eliminated in that case.
Of course, this is more useful than just writing 'while' loops in a different style (although I personally prefer recursion https://news.ycombinator.com/item?id=11150346 ), since tail calls don't have to be recursive: we can split our code up into modular, composable, self-contained functions without having to worry about stack usage (or code bloat caused by inlining into multiple locations).
> As far as I understand it, this limitation is also the only thing preventing tail-call elimination in C/C++.
Nothing "prevents" tail-call elimination in C. Compilers do it all the time.
> Tail-recursive functions which are only used internally can also be optimised to re-use the same stack frame and iterate with a conditional jump. If such functions are also exposed in the interface, it's pretty trivial to expose a dummy entry point which invokes the optimised version (if we want to)
Can you illustrate with a concrete example what you mean? If the function doesn't need a stack frame, there will be no stack frame either way. If the function needs a stack frame, that stack frame must be allocated in the function's start block either way, and tail recursive calls will of course not jump to that start block but to its successor. I don't see how being "exposed in the interface" (in C speak, static vs. non-static) would make a difference here.
The issue is basically that tail-recursive functions need to have this information recorded in their signature (for external interfaces), but this is not recorded in C and thus you can't provide this guarantee at the interface boundaries.
This is not concrete in any way. What guarantee do you mean? What "this information" should be "recorded in their signature"? What does this murky "information" have to do with enabling tail recursion?
Here's a tail recursive function C function that is compiled to tail recursive machine code without any problems: https://gcc.godbolt.org/z/dc886a
This function can be called from external compilation units without any problems. Whether or not it is compiled to tail recursive code is an implementation detail that the caller doesn't need to know about.
> As far as I understand it, this limitation is also the only thing preventing tail-call elimination in C/C++.
That is not the case. Guaranteed TCE requires deallocating the stack frame before jumping to the target, but that is not possible when an object is allocated in that frame and its address passed to the target function.
In C++ there is also the issue of destructors needing to run after the tail-call returns (in which case it is not really a tail-call).
C/C++ compilers can and do eliminate tail calls where possible, but there's no guarantee like you get from a compiler for a functional language.
> that is not possible when an object is allocated in that frame and its address passed to the target function
Ah of course, I hadn't thought about live aliases to stack-allocated data.
You're right about destructors; I agree that they're not really tail-calls (since we're processing results).
> C/C++ compilers can and do eliminate tail calls where possible, but there's no guarantee like you get from a compiler for a functional language.
Yes, I tend to say "tail-call optimisation" when it's opportunistic (e.g. GCC/Clang), and "tail-call elimination" when it's guaranteed (e.g. Scheme/ML).
What part of the detection is giving you trouble? Detecting whether a call is recursive, or whether it is a tail call that can be optimized? The first part is relatively easy, you typically have some sort of data structure representing calls and their targets, and you know what function you're currently compiling. If the function you're compiling is the same as a call's target, you have recursion.
The core of this detection is in findTRECandidate. It iterates backwards over a basic block, looking for call instructions whose call target is the current function F:
CallInst *CI = nullptr;
BasicBlock::iterator BBI(TI);
while (true) {
CI = dyn_cast<CallInst>(BBI);
if (CI && CI->getCalledFunction() == &F)
break;
if (BBI == BB->begin())
return nullptr; // Didn't find a potential tail call.
--BBI;
}
The harder part is detecting whether calls are in fact tail calls that you can replace by jumps. There are all sorts of special cases, like whether alloca() might be called by the function.
I can't say how LLVM does it but in a compiler I built I detected tail recursion by scanning the IR and if the value of a recursive call (ie: call to self) was immediately returned it was considered a candidate for TCE. If the value was consumed by the parent, it wasn't.
Reading this now, cheers! Is there anything in particular I should be looking for, or will it jump out at me (bearing in mind that I've never written anything more complex than an unoptimized Lisp interpreter) ?
As I said, tail-recursion is just a particular use of tail calls. Tail calls are very simple: anything of the form 'return foo(...)'; or, in languages without 'return', a function call which is the at the end of the definition (i.e. the last step).
All we require is that the current function is calling some other function, and passing on its return value unchanged. For example, let's take a simple function:
int foo(int x) {
return bar(x);
}
'return bar(x);' is a tail-call, since 'foo' isn't doing anything with the result; calling 'foo' is equivalent to calling 'bar'. In this case the compiler could actually replace every occurrence of 'foo' with 'bar' (e.g. inlining it), but we can't do that in general. However, when we call 'bar' we certainly don't need to keep any of 'foo's machinery around (stack frames, local variables, register contents, etc.); we can just start executing the code for 'bar' (i.e. a jump instruction).
More realistically, 'foo' and 'bar' might not be equivalent, e.g.
int foo(int x) {
int y = baz(x);
return bar(x + 3 * y);
}
Here we can't just replace calls to 'foo' with calls to 'bar', since they do different things. However, notice the order that things will run in: (1) 'baz(x)' will be called, (2) 'x + 3 * y' will be calculated, (3) 'bar' will be called. Even though these 'foo' and 'bar' functions are very different to begin with, once we reach step three (the call to 'bar') the remaining instructions are the same (since 'foo' is calling 'bar' and returning its result unchanged; i.e. the definition of a tail-call!). Hence we can still discard 'foo's machinery and just jump to 'bar' (e.g. we can throw away 'y'; and the compiler can ensure the result of (2) is put where 'bar' expects its argument).
If 'foo' and 'bar' are the same function then we happen to have tail recursion. For example:
Here 'return fac(n-1)' is a tail-call; it also happens to be tail-recursion, since it's calling the 'fac' function from within the 'fac' function. We don't need to keep the stack frame for 'fac(n)' around, we can just overwrite it with the stack frame of 'fac(n-1)', and so on.
We can also have mutually tail-recursive functions, e.g.
uint even(uint n) {
if (n) odd(n-1);
return 1;
}
uint odd(uint n) {
if (n) even(n-1);
return 0;
}
This is harder to spot than direct tail-recursion, but it doesn't matter since tail-recursion isn't particularly special. As long as we implement tail-calls efficiently, then the benefits apply to tail-recursive functions (like 'fac') and mutually tail-recursive functions (like 'even' and 'odd'), as well as many other places which just-so-happen to use tail-calls.
Intuitively, any time we find ourselves using a "main loop", or "driver loop", etc. which just calls out to a sequence of functions, possibly passing the return values of one into arguments of the next, that's equivalent to having each of those functions make a tail-call to the next one (i.e. such loops are "trampolines").
> Tail calls are very simple: anything of the form 'return foo(...)'…
It's a bit more complicated than that since there may be destructors (in C++, or C with certain extensions) or stack-frame cleanup which occur between 'foo' returning and the 'return' statement itself. Tail call elimination is only possible when the stack frame can be freed before the final function call; if the address of any local variable escapes the current function then it might not be an option even if the source otherwise appears tail-recursive. TCE is simpler in garbage-collected languages which allocate their stack frames from the heap (and typically lack precise destructors) since these cleanup operations can be left to the GC.
I've seen a poster presentation at the student research competition of a PL conference where the student created a tool to do this after the fact on binaries. It looked quite promising, but I think preliminary results were single digit percentage speedup for most benchmarks. I'll have to see if I can find it again.
Single digits is a huge improvement. This is single digits across billions of programs with no change on their part. In the email Google claimed single digit improvements and I'm sure they are thrilled. 1% CPU savings across Google is many, many millions of dollars a year.
If you contribute a 1% CPU savings change to GCC or LLVM you would have a very hard time being carbon-positive over your lifespan.
Functions by default have external linkage, and as such can be called by anyone. So of course this could only apply to functions that are static or internal in some way. Once you have that down, you have to deal with the fact that the things you would want to optimize may differ at each call site (one may like to keep rdx free, one may be able to give up rdi) so I am unsure it would be worth the effort to not follow the ABI.
> So of course this could only apply to functions that are static or internal in some way.
Or with LTO / -fwhole-program: functions that have hidden visibility. Which is all of them for Windows executables and the default for DLLs (and -fvisibility=hidden is a sane default for other systems too).
> Once you have that down, you have to deal with the fact that the things you would want to optimize may differ at each call site (one may like to keep rdx free, one may be able to give up rdi) so I am unsure it would be worth the effort to not follow the ABI.
Sure, that makes things harder. But in many cases will only be one call site. In others you can still figure out which optimizations you can apply to all of them - for example passing std::unique_ptr and similar objects by register should always be doable as long as you control all call sites. Further, the compiler has the option to clone functions in order to apply optimizations specific to a group of call sites if it decides that an optimization is worth the increased code size. And some optimizations will be purely on the caller side - for example not spilling caller-saved registers if you know they won't be clobbered by the callee.
Inlining and outlining are per se probably some of the least destructive optimizations for debugging: they don't affect the order or existence of instructions, compared to the naïve translation model, merely where it is located within the executable.
(Admittedly, generating the debugging records to be able to refer to variables in the non-outlined portion of the function from the outlined function is probably not done, although it's probably doable with the current DWARF expression syntax. This is much like how most of the <optimized out> variables you see in gdb could be specified with DWARF information but aren't because the compiler doesn't maintain the information.)
Generally trying to debug optimised code is a nightmare anyway. Most debugging is performed on non-optimised builds, which wouldn’t be affected by the proposed optimisation.
There's actually a compiler that does all these things and allows debugging of optimised code whilst seeing the unoptimised form - Sulong.
Sulong is an LLVM JIT-compiling engine that runs on GraalVM. So it uses the same infrastructure of the JVM, but without JVM bytecode getting into the picture. Instead the JVM runs LLVM bitcode directly. But it does so in the standard way which gives many benefits:
• Splitting of hot/cold paths, as in this work.
• Cross-function optimisation and inlining regardless of visibility, as requested above.
• Ability to debug an optimised program, because the JVM can 'deoptimize' the parts being inspected or debugged.
• Run the Linux version of the bitcode cross-platform.
• In "Managed Sulong" (which is an enhanced payware version), all allocations are garbage collected and bounds checked. Yes even for C/C++ programs, just needing a recompile. It's just about possible to do this within the rules of well defined C/C++, as long as the program doesn't try to do non-portable or non-defined things like pass stack pointers between threads. Managed Sulong can also sandbox LLVM code modules, and redirect IO back into the JVM for mapping via the pluggable socket/filesystem layers.
• Cross-language interop, including recent experimental support for e.g. casting Java or Ruby or Python objects to C++ objects and calling methods on them.
• All the same compiler optimisations as LLVM does still apply.
It's a pretty amazing and under-appreciated project. On the other hand the code being run inherits the limits of the JVM, e.g. you shouldn't try and fork the process (of course, lots of runtimes don't like being forked and on Windows you can't do it at all).
Or at least any that supports gathering and utilizing profiling information. Last time I checked, that didn't include Rust yet for instance, but my information could be out of date there.
Can I create profile data by running the instrumented binaries through testcases? Perhaps integration test cases which represent typical end to end functionality of my application.
Note that LLVM has already had a function outliner; you may have seen these paths already if you've ever looked inside a binary on macOS and noticed * .cold. * functions, which is the unlikely-to-execute code cut out from the function they came from. This appears to be an improvement on that effort.
Right. The improvement that this work brings is that it performs the function split very late and at a low level.
Basically, while the previous outliner split a function into two functions (the hot one literally calling the cold one as needed) this new thing takes a single function and splits it into two parts connected to each other by jumps. The cold part of the function isn't really a function --- it's just a group of basic blocks that happen to be located far away from the other group of basic blocks.
By avoiding the call into the cold function and the return to the hot function, the generated code can be smaller and more register-efficient.
That'll make it more stack efficient too since it doesn't have to go through the same dance for a function call. It probably didn't do them while adhering to the C ABI but it'd still put at least a return address and I suspect some registers during the call.
There’s no need, though: the entry point and return address are unique; it’s literally code that is sliced out, jumped to, and it jumps back to the function it was cut out of. The only thing you’d need to save is a register or two if you can’t make the jump without doing some math.
X86 should always be able to jump directly, and ARM sets aside x16 and x17 just for this kind of math. But all jumps should be PC relative, so you shouldn't have to clobber anything anyway
Might be confusing to debuggers if the address-space range of a single function is discontiguous. Does the cold portion get an independent symbol with derived name, like, e.g., "Blocks?"
I’m interested in reading some more on this concept, The Wikipedia, which usually my first stop for a broad overview and maybe some linked research, seems to suggest that profile directed/guided and feedback guided are the same thing. Is there anywhere I can read about the varying approaches?
They are the same thing. The GP makes a distinction which does not exist. Witness the fact that two different compilers refer to the exact same feedback technique as either AutoFDO or SamplePGO.
Ah, I definitely worked in a place where that distinction in terminology was used, but maybe it isn't widespread. In any case, whatever you call either one of them, sampling from production processes can have benefits over sampling from synthetic workloads.
Maybe so. I think it was a novel idea to me because my intuition around profiling was formed with an assumption of delivering binaries to users, and not running server processes. (It's probably also a pre-internet bias of thinking it would be hard to get prod data, whereas profiling data from generated in a horrible enormous compile&run&profile&recompile process at least doesn't need to "go" anywhere.
There is a distinction even, even if you don't like OP's terminology used to differentiate the two things:
- With traditional PGO (ie -fprofile-generate, then -fprofile-use) you first generate a separate instrumented binary that records profile information. You wouldn't generally want to use this binary in production because of the overhead this profile generation incurs.
- With sample driven PGO (ie -fprofile-sample-use) you use external tools to sample profile information from an uninstrumented binary - the same binary you'd use in production.
It always surprises me how many people are familiar with LLVM internals. Just looking at the 60 unique commenters at the time of posting this is blowing my mind.
I wouldn't have expected any functions to be 5KiB or larger. That seems like a lot of assembly to me, and so it makes sense to me that these functions would be "accidentally" generated by inlining.
I'm curious what percent of functions are this large?
Protocol Buffers message decoders tend to be extremely large, since each message has an auto-generated parse function that will tend to inline everything. My local build of protobuf contains 20 functions each larger than 5KB, including a 10KB-long specialization of std::sort.
That google paper has a nice observation on glibc’s memcmp:
“memcmp clearly stands out of the correlation between call frequency and function size in Figure 8. It is both extremely frequent, and, at almost 6 KiB of code, 10× larger than memcpy which is conceptually of similar complexity. Examining its layout and execution patterns (Figure 13) suggests that it does suffer from the high amount of fragmentation we observed fleetwide in the previous section. While covering 90% of executed instructions in memcmp only requires two cache lines, getting up to 99% coverage requires 41 lines, or 2.6 KiB of cache capacity. Not only is more than 50% of code cold, but it is also interspersed between the relatively hot regions, and likely unnecessarily brought in by prefetchers. Such bloat is costly – performance counter data collected by GWP indicates that 8.2% of all i-cache misses among the 100 hottest functions are from memcmp alone.
A closer look at the actual code from glibc can explain the execution patterns in Figure 13. It is hand-written in assembly and precompiled, with extensive manual loop unrolling, many conditional cases for the various alignments of the two source arrays, and large jump tables.
In our experience, code usually evolves into similar state from over-reliance on micro-optimization and micro-benchmarking. While writing in assembly can in rare cases be a necessary evil, it prevents the compiler from doing even the most basic feedback-directed code layout optimizations.
[…]
We tested this hypothesis by macro-benchmarking a version of memcmp that is specifically optimized for code size (only 2 i-cache lines) and locality. In short, it only special-cases very small string sizes (to aid the compiler in inlining very fast cases) and falls back to rep cmps for larger compares. Even though it achieves slightly lower throughput numbers than the glibc version in micro-benchmarks, this simple proof-of-concept showed an overall 0.5%-1% end-to-end performance improvement on large-footprint workloads like web search.”
I would guess the assembly version can be improved by moving those cold blocks out of the cache lines of the hot code.
Totally non-scientific measurement (I took a few applications I had lying around, got size of symbols with nm -S) suggests that you're looking at around 0.1-1% of all functions are larger than 5KiB.
Interestingly, the largest function all cases seemed to be something on the order of "initialize a large static hashtable."
I've seen plenty of c/c++/java functions that exceeded 200 lines of high level code. It seems reasonable to me that such a function would be 5k bytes of machine code, especially on a 64-bit architecture.
Not that such functions are the right way to structure your program...
This + multi-return (push multiple code addresses in stack frame) together mean I think we can just use Result/Either and never unwinding, and leave no performance behind.
There are a ton of great books out there! If you're looking for a bit of the whole shebang, check out Compilers: Principles, Techniques, and Tools (aka, The Dragon Book). It's an older book, but a lot of what is described in there is still very relevant and good info. If you want a more high level overview at first, I'd recommend Crafting Interpreters [0] and Writing a Compiler/Interpreter in Go [1]. The latter two focus more on lexing, parsing, and intermediate code generation.
The LLVM way of doing C++ is basically a wet dream when it comes to C++-isms e.g. not many C++ projects do their own dynamic casts and get away with it.
LLVM is fairly well-structured, especially compared to GCC. I don't find it scary but it is a big fat fucker of a codebase.
> The LLVM logo is a stylized wyvern (a kind of dragon). Dragons have connotations of power, speed and intelligence, and can also be sleek, elegant, and modular (err, maybe not). In addition, there is a series of influential compiler books going back to 1977 which featured dragons on the cover.
Leave it to HN to jump to nitpick something as meaningless a wyvern vs a dragon but not even take 5 seconds to check the primary source
2% better perf isn't anything to sneeze at but also not super drastic. Will be curious to see if this unlocks more optimizations over time.
I'm also curious if there's been any work to organize code layout to make it more efficient to page stuff out efficiently to reduce memory pressure (so that you're using all code in the pages you are bringing in) & reduce paging thrash (you're not evicting a code page only to bring it back for a few hundred bytes).
Given the greatly improved TLB hit rate, I sort of wonder if the test isn't abusing the code enough. Does this change the inflection point where throughput falls from trying to run too many tasks in parallel or sequentially? 2% response time won't do much, but 10% higher peak load would change quite a few priorities on a lot of backlogs.
It’s not too surprising to me. If TLB misses cost your program to run 6% slower, wouldn’t you expect a 30% reduction in that to be a 2% overall improvement? Why would it be more? If you’re surprised that TLB misses are not so costly, consider that cache locality and CPU prediction are largely going to hide the cost of misses (cache locality means the miss rate is low to begin with, prediction reduces the cost of a miss). TLBs are also super complex involving a few layers of caching (TLB look aside, L1 cache of your TLB table, L2 cache, L3 and then RAM).
So to use this new optimization pass you have to specify it explicitly with an additional command line flag instead of it simply being turned on as an ordinary part of PGO?
It seems like most people who just want to squeeze the last drop of speed out of their code will probably never become aware of it then unless they are specifically seeking it out
It's an experimental feature. If it works well for a while, it may become part of the default. But it should definitely not be forced on everyone if it's not yet sure how well it works.
One issue with PGO is you are optimizing for the particular subset of use cases which are profiled, on the exact machine you profile on. The 2% quoted improvement will not necessarily generalize across all use cases for the software and all machines it runs on. In fact, you may often be taking that 2% or more away from other use cases.
> One issue with PGO is you are optimizing for the particular subset of use cases which are profiled, on the exact machine you profile on.
The exact machine part is not true. There's nothing about this particular optimization that's machine specific - e.g. as the original post explains, this optimization gives performance boost on Intel and AMD, on Intel due to reduction in iTLB misses, and on AMD due to reduction in L1 and L2 icache misses. i.e. this kind of "working-set" reduction translates to any platform.
> In fact, you may often be taking that 2% or more away from other use cases.
In general, it is correct that profile-guided optimization can theoretically reduce performance, as some of the aggressive optimizations are only done with profile because of inherent trade-off the optimization has (e.g. aggressive inlining which can be detrimental for the performance if hot functions are entirely different).
However, empirically this is not true in most cases, unless you picked really bad training input, and your code has extremely different behavior under different input. Moreover, nowadays with sampled profile, which you can collect from the real, production runs, it's extremely unlikely for this to happen.
Yes, taking advantage of PGO is hard. But with the right infrastructure you get very accurate profiles. For example many large companies will gather the profiles from the previous version running in production. I have even seen cases where the new version was fed replayed requests then recompiled with the recorded profiles.
So yes, PGO is hard, but if you collect representative profiles it is a massive improvement.
The technique itself isn't new in clang either. This is about a new implementation of it, where the difference to the existing implementation is that this happens later in the process (it's deferred to the machine-specific code generation phase, whereas the existing implementation happens in the middle-end and is target-agnostic).
The other major piece is that the hot-cold split is more efficient. Rather than thunking out the cold code via a function call it just jumps to the basic block, making it a more efficient approach (no register spilling and function call overhead)
The function call overhead itself is irrelevant, because by definition these blocks are cold. The saving/restoring of callee-clobbered registers does affect the code size of the hot function though, so that's important.
this was a common practice at Microsoft in the mid 90s, maybe 95-97ish. the set of apps involved were called BBT: "Basic Block Tools". Windows, SQL Server, among others, were post processed with profiling data. it also deduped basic blocks, reducing binary bloat from inlining even without profiling data. just needed some additional info in the debug symbols to work.
Identical code folding still breaks debugging with modern llvm. It can make it seem like the call came from an impossible place in a stack sample. Is it something that Microsoft solved long ago?
I wish compilers were able to optimize functions boundaries for visible but not inlined better so they could optimize those function call overheads away automatically. There is no reason for a compiler to limit itself to the system ABI for completely internal functions.
Besides leaving values in the registers they are already in instead of moving them to where the ABI says they should be, this could enable passing C++ objects such as std::unique_ptr in registers where the current ABI forbids this. Or to eliminate nop destructors for objects that are always moved from in the called functions.
In general, I think compilers should see function boundaries in the source code as a mere hint when it comes to code generation just like they already do with the register keyword.
Is there any existing compiler work to optimize function calls where the implementation and all uses are visible in the translation unit?