I actually don't think that's true. My understanding is that on x86, atomic instructions have implicit lock instructions before them. (Or you can make some instructions atomic by putting a lock instruction before them.) Such instructions lock the bus and prevent other cores or SMT threads from accessing memory. In that way, you can safely perform an atomic operation on a value in the cache.
Note that this implies that atomic operations slow down others cores and SMT threads.
Locks are often implemented using an xchg instruction, which is implicitely locked.
All processor's caches are committed/flushed for the affected cache line. So its correct to say other processors are slowed down. But it also in that sense IS a main memory operation, just not yours.
Just because a lock is shared does not mean that it's contended.
For example, some multi-threading techniques attempt to access only CPU-local data but use locks purely to guard against the case where a process is moved across CPUs in the middle of an operation (thus defeating the best-effort CPU-locality).
Maybe I'm missing something, but if the cache line is only in use by one CPU, I don't see why the value would need to be immediately propagated to main memory or to any other CPU's cache until it is written as part of the normal cache write-back.
Correct. Typically the cache snoops the main memory bus. If a remote CPU starts a read on a cached memory location, the caching CPU sends a "stall" or "retry" signal to the reader, does a cache flush to main memory, and then lets the remote CPU proceed with the (now correct) main memory read.
Note that this implies that atomic operations slow down others cores and SMT threads.