I assume that the mutex's memory is warmed, so the processor doesn't have to go ...

I assume that the mutex's memory is warmed, so the processor doesn't have to go to RAM. But, it does have to synchronize with any other processors/cores in the machine to prevent them from locking the mutex at the same time.

In a Intel Nahalem or Sandy Bridge system, this goes over the QuickPath Interconnect which has a latency of ~20ns. HyperTransport fills the same role in AMD systems, and probably has a similar latency, but I don't have numbers for that.

I'm basing this on this presentation, especially the architecture diagram at 2m 40s:

http://www.infoq.com/presentations/Lock-free-Algorithms