Trying to reduce high end processor performance to "operation X takes Y cycles" ...

Trying to reduce high end processor performance to "operation X takes Y cycles" likely confuses the uninitiated more than it helps once you get beyond "cache miss bad".

For the uninitiated, most high-performance CPUs of recent years:

- Are massively out-of-order. It will run any operation that has all inputs satisfied in the next slot of the right type available.

- Have multiple functional units. A recent apple CPU can and will run 5+ different integer ops, 3+ load/stores and 3+ floating point ops per cycle if it can feed them all. And it may well do zero-cost register renames on the fly for "free".

- Functional units are pipelined, you can throw 1 op in the front end of the pipe each cycle, but the result sometimes isn't available for consumption until maybe 3-20 cycles later (latency depends on the type of the op and if it can bypass into the next op executed).

- They will speculate on branch results and if they get them wrong it needs to flush the pipeline and do the right thing.

- Assorted hazards may give +/- on the timing you might get in a different situation.