I'm pretty sure it's because of the store forwarding hardware. Intel CPUs can re...

I'm pretty sure it's because of the store forwarding hardware. Intel CPUs can return a successful load of a recently-stored address with low latency (basically by keeping a special purpose cache of recent store addresses in the pipeline and returning their values before the actual commit). But that means that the "store" is viewed from the issuing CPU to have committed long before loads that might have been filled by caches on the other CPUs. There's no way to preserve both this optimization and a unified order.