I guess you mean wrapping very small sections of the program with timing code and making sure the cpu executes the timing instructions in the right order. Maybe that works but it sounds like choosing between poor accuracy and massive overhead. On Linux on x86 you can use perf to get access to model-specific registers on the cpu for a large number of performance counters (from cache hit counts to branch prediction misses to all kinds of crazy things). I’m not aware of an equivalent to perf on macOS and I’m not sure the kernel provides appropriate apis. Maybe instruments is it but I’m not sure that it can give access to those counters (they surely exist on the M1 cpus) or that it can be run in a non-interactive way.