This got longer than I intended, but that just illustrates that the rabbit hole ...

This got longer than I intended, but that just illustrates that the rabbit hole is deeper than many people are willing to go.

The data can be wrong in a number of dimensions. In general, code that gets called the most sometimes get underreported. Look at invocation counts. Ask questions.

One, they can clear the CPU cache, causing sibling functions to be overreported (one of many sources of how changing a slow function may bring less of an improvement than expected). This shows up strongest when I‘ve gotten a 10% runtime improvement from fixing a function reported as 5% of total time, or a 10x reduction in time from removing half the calls to a remote service.

Two, duplicate or idiomatic code will be scattered across different parts of the code stack, reducing the apparent magnitude of a pattern of logic below the threshold of attention of most people. Code that is in four places may represent 5% of total time and not get looked at, even if it’s an easy fix. The nickels and dimes add up quickly.

Three, functions that operate near the limits of the clock resolution will be counted properly but the cumulative timing rounded down, again putting it below the noise floor for most people. I’ve made a lot of hay re-sorting the results by invocation count and benchmarking changes to functions that never hit the top 20 list.

And last but not least, functions that create but don’t destroy their own memory allocations end up shedding timing that gets picked up by the recipients, homogenizing the timing graph. In particular in GCed languages, a function that is called in a loop that exhausts most but not all of the free heap will invariably end up stalling out in the next phase of a large calculation.

With all four of these, there are several failures of imagination I commonly saw. A flat timing graph is taken as evidence that it is time to stop optimizing, even if the target improvement has not been met. When the “tent poles” are even, getting people to care how tall they are is challenging. Few people will make six changes to the code to achieve a 10% speed up. The Rust compiler may be the first time I’ve witnessed anyone else brag about a 1% performance gain, other than me. They either don’t do it, or they apologize for achieving so little. 10% is 10%, and I don’t care where you found it, if your code quality is high enough. And perhaps most importantly, speed improvements are multiplicative, not cumulative. Use a time budget the way game devs do. If an interaction takes 10x as long as your target, every method that takes 0.5% of current run time represents 5% of your target. This is, in particular, how you spot duplicated slow code. You ignore them at our peril.

Particularly in the era of QA cycles, using a “zone defense” (optimizing one module at a time, instead of going after tall tent poles) netted a higher rate of return per hour of labor and a much larger cumulative improvement, on its own merits and by increasing the budget allotted to perf. It seems concentrating the changes in one area at a time decreases optimization fatigue. On two projects I kept that initiative alive for more than two years, and I was the one that gave up because I had cycled around the entire system and was finding few things to correct (you only learn new tricks so fast, and they go asymptotic eventually). Everyone came to expect every release to be a bit faster than the previous, and I got feedback that this narrative increased sales, and manager buy-in. People will take a chance on you when they like where you’re going, even if you haven’t gotten there yet. Narrative matters, if they trust you, and usually you have to earn it.