> One issue with PGO is you are optimizing for the particular subset of use cases which are profiled, on the exact machine you profile on.
The exact machine part is not true. There's nothing about this particular optimization that's machine specific - e.g. as the original post explains, this optimization gives performance boost on Intel and AMD, on Intel due to reduction in iTLB misses, and on AMD due to reduction in L1 and L2 icache misses. i.e. this kind of "working-set" reduction translates to any platform.
> In fact, you may often be taking that 2% or more away from other use cases.
In general, it is correct that profile-guided optimization can theoretically reduce performance, as some of the aggressive optimizations are only done with profile because of inherent trade-off the optimization has (e.g. aggressive inlining which can be detrimental for the performance if hot functions are entirely different).
However, empirically this is not true in most cases, unless you picked really bad training input, and your code has extremely different behavior under different input. Moreover, nowadays with sampled profile, which you can collect from the real, production runs, it's extremely unlikely for this to happen.
The exact machine part is not true. There's nothing about this particular optimization that's machine specific - e.g. as the original post explains, this optimization gives performance boost on Intel and AMD, on Intel due to reduction in iTLB misses, and on AMD due to reduction in L1 and L2 icache misses. i.e. this kind of "working-set" reduction translates to any platform.
> In fact, you may often be taking that 2% or more away from other use cases.
In general, it is correct that profile-guided optimization can theoretically reduce performance, as some of the aggressive optimizations are only done with profile because of inherent trade-off the optimization has (e.g. aggressive inlining which can be detrimental for the performance if hot functions are entirely different).
However, empirically this is not true in most cases, unless you picked really bad training input, and your code has extremely different behavior under different input. Moreover, nowadays with sampled profile, which you can collect from the real, production runs, it's extremely unlikely for this to happen.