The reason it doesn't help them that much in performance is that the difference in the way x86/ARM manage memory is not that ARM reorders more and x86 reorders less, but that ARM reorders openly and x86 reorders things behind your back but makes sure everything is where you think they should be if/when you look.
This is the reason that the relatively primitive memory model of x86 has turned out to be so successful despite the consensus in the field being that it's way to restrictive for performance way back in the 90's. It turns out that the x86 memory model is an easy target for a complex reordering backend to provide as a "facade", between what is actually happening the in CPU's own L1 and buffers and what everything else thinks it's doing.
The only way weaker memory orderings benefit is that they don't need as large buffers and queues on the backend, as they can retire memory operations earlier. Except that if you just implement the ARM memory model as specified and without a similar facade, it will actually lose to modern x86, because the x86 chip can make use of the freedom provided by it's machinery to reorder even more without any of it ever being visible to the software. So you end up implementing a similar system as the x86, with only very minor gains from the fact that you can sometimes properly retire accesses earlier and thus get a little more use out of your buffers.
ARM, especially 64-bit ARM, has genuine advantages over x86. (Notably, decode.) The weak memory model is not one of them, outside the very weakest and tiniest cores (where it allows some reordering with a tiny load-store backend).
Yes. In particular for a long time x86 had much cheaper memory barriers than the competition. Not only store release and load acquires were free, but sequentially consistent barriers were also fairly cheap. Intel had to add optimizations to make them fast much earlier, while weaker RISCs could get away with expensive barriers for much longer.
Now apparently M1 has very cheap atomics and barriers, to it is doing as much tracking and speculation as an x86 CPU. But probably they can get away with tracking fewer memory operations, which might improve both performance and power usage.
And on Arm cores, you aren't obligated to provide a weak memory model. Some Arm vendors provide cores with stronger memory models, mainly TSO in Arm server land and sequential consistency for NVIDIA in-house CPUs.
Retiring memory operations earlier must have some benefit though? If data is being passed between threads, earlier retirement of a store means the data is available earlier to other CPUs. For a mutex, it means the mutex is released sooner, no?
32-bit fixed size instructions, that are aligned. That allows you to have 8-wide decoders for example.
On x86, you can't really do this/didn't happen because the instructions are 1 to 15 bytes long, and you'll need to decode the prior instruction at least in part to determine where your current instr starts. (currently 4-wide decoders are available for x86 uArches)
Fixed width instructions are much easier to decode at the same time as their neighbours.
The density is a tradeoff you have to think about, but X86 is not a clean ISA.
For a more alien example, the mill cpu has very long variable length instructions for density reasons but because it's specifically designed around that and not needing any implicit parallelism they can use tricks to find instruction boundaries much more easily.
Not having suffixes helps too. AFAIK, on x86 to find the length of an instruction, first you have to decode the instruction itself (the decoding can vary depending on the prefixes) to know if it is followed by a modRM byte, then decode the modRM byte to know whether it is followed by an immediate or not and how big that immediate is, and only then can you know where the next instruction is. The modRM byte (and the SIB byte) can be thought of as a "suffix" of sorts to the instruction.
Contrast for instance with RISC-V, where the first byte of every instruction has everything necessary to know the instruction length (and you only need 2 bits of that byte unless your core supports instructions longer than 4 bytes).
This is the reason that the relatively primitive memory model of x86 has turned out to be so successful despite the consensus in the field being that it's way to restrictive for performance way back in the 90's. It turns out that the x86 memory model is an easy target for a complex reordering backend to provide as a "facade", between what is actually happening the in CPU's own L1 and buffers and what everything else thinks it's doing.
The only way weaker memory orderings benefit is that they don't need as large buffers and queues on the backend, as they can retire memory operations earlier. Except that if you just implement the ARM memory model as specified and without a similar facade, it will actually lose to modern x86, because the x86 chip can make use of the freedom provided by it's machinery to reorder even more without any of it ever being visible to the software. So you end up implementing a similar system as the x86, with only very minor gains from the fact that you can sometimes properly retire accesses earlier and thus get a little more use out of your buffers.
ARM, especially 64-bit ARM, has genuine advantages over x86. (Notably, decode.) The weak memory model is not one of them, outside the very weakest and tiniest cores (where it allows some reordering with a tiny load-store backend).