For applications where the performance is determined by array operations, which ...

For applications where the performance is determined by array operations, which can leverage AVX-512 instructions, an AMD Zen 5 core has better performance per area and per power than any ARM-based core, with the possible exception of the Fujitsu custom cores.

The Apple cores themselves do not have great performance for array operations, but when considering the CPU cores together with the shared SME/AMX accelerator, the aggregate might have a good performance per area and per power consumption, but that cannot be known with certainty, because Apple does not provide information usable for comparison purposes.

The comparison is easy only with the cores designed by Arm Holdings. For array operations, the best performance among the Arm-designed cores is obtained by Cortex-X4 a.k.a. Neoverse V3. Cortex-A720 and Cortex-A725 have half of the number of SIMD pipelines but more than half of the area, while Cortex-X925 has only 50% more SIMD pipelines but a double area. Intel's Skymont a.k.a. Darkmont have the same area and the same number of SIMD pipelines as Cortex-X4, so like Cortex-X4 they are also more efficient than the much bigger core Lion Cove, which is faster on average for non-optimized programs but it has the same maximum throughput for optimized programs.

When compared with Cortex-X4/Neoverse V3, a Zen 5 compact core has a throughput for array operations that can be up to double, while the area of a Zen 5 compact core is less than double the area of an Arm Cortex-X4. A high-clock frequency Zen 5 core has more than double the area of a Cortex-X4, but due to the high clock frequency it still has a better performance per area, even if it no longer has also a better performance per power consumption, like the Zen 5 compact cores.

So the advantage in ISA of Aarch64, which results in a simpler and smaller CPU core frontend, is not enough to ensure better performance per area and per power consumption when the backend, i.e. the execution units, does not have itself a good enough performance per area and per power consumption.

The area of Arm Cortex-X4 and of the very similar Intel Skymont core is about 1.7 square mm in a "3 nm" TSMC process (both including 1 MB of L2 cache memory). The area of a Zen 5 compact core in a "4 nm" TSMC process (with 1 MB of L2) is about 3 square mm (in Strix Point). The area of a Zen 5 compact core with full SIMD pipelines must be greater, but not by much, perhaps by 10%, and if it were done in the same "3 nm" process like Cortex-X4 and Skymont, the area would shrink , perhaps by 20% to 25% (depending on the fraction of the area occupied by SRAM). In any case there is little doubt that the area in the same fabrication process of a Zen 5 compact with full 512-bit SIMD pipelines would be less than 3.4 square mm (= double Cortex-X4), leading to a better performance per area and per power consumption than for either Cortex-X4 or Skymont (this considers only the maximum throughput for optimized programs, but for non-optimized programs the advantage could be even greater for Zen 5, which has a higher IPC on average).

Cores like Arm Cortex-X4/Neoverse V3 (also Intel Skymont/Darkmont) are optimal from the POV of performance per area and power consumption only for applications that are dominated by irregular integer and pointer operations, which cannot be accelerated using array operations (e.g. for the compilation of software projects). Until now, with the exception of the Fujitsu custom cores, which are inaccessible for most computer users, no Arm-based CPU core has been suitable for scientific/technical computing, because none has had enough performance per area and per power consumption, when performing array operations. For a given socket, both the total die area inside the package and the total power consumption are limited, so the performance per area and per power consumption of a CPU core determines the performance per socket that can be achieved.