> Inlining heuristics used for -O3 are architecture-specific […] Yes, the inlini...

> Inlining heuristics used for -O3 are architecture-specific […]

Yes, the inlining strategy is non-deterministic and does not yield same results across different ISA. Moreover, it does not even guarantee the same result for a single ISA if the code was compiled for a specific submodel of a CPU with a different instruction cache line size.

Especially for the x86 ISA where instructions are variable length, it is common for instructions to inadvertantly spill over into the next I-cache line thus yielding a substantial performance penalty for a performance critical/sensitive code path. Therefore, a common technique for the optimiser was (I have not checked recently tho) to take into account the I-cache line size, group instructions in such a way that if a fat code sequence were to cross the cache line, fill in the rest of the cache line with NOP's and place the fat instruction into the next cache line. Such a problem is nearly non-existant for RISC ISA's although I would surmise that one has to watch out for tight loops anyway.

> I would expect the -O2 numbers to reflect better the actual ISA capabilities.

I would go on to add that today «-O3 -fno-inline-functions» would give a more accurate and faithful reflection of generic ISA capabilities today. For a long time, «-O3» was no more than «-O2 -finline-functions»; however, since then further optimisations have been added to «-O3» that are rather useful generic optimisations for modern CPU's (i.e. loop vectorisation and more). The article is especially lacklustre in this particular regard as the author does not go beyond generic bloviations and does not make an attempt to understand what hides beyond «-O3».