A credible but unconfirmed rumor I've read is that Intel didn't want to do it because of the 512-bit shuffles. The E-cores (including those using the upcoming Skymont microarchitecture) are natively 128-bit and already double-pump AVX2, so to quad-pump AVX-512 would result in many-uops 512-bit shuffle instructions made out of 128-bit uops.
This would render the shuffles unusable, because you'd unpredictably have them costing either 1 uop or cycle to taking 10-30 uops/cycles depending on which core you are on at the moment. A situation similar to PEXT/PDEP, which cost almost nothing on Intel and dozens of cycles on AMD until a couple generations ago.
Why does Zen 4 not have this problem? First, they're only double-pumping instead of quad-pumping. Secondly, while most of their AVX-512 implementation is double-pumped, there seems to be a full-length shuffle unit in there.
Interesting stuff! If a shuffle is just crossed wires without any logic (as far as I know shuffle is a completely predetermined bit permutation), wouldn't it fit on the E-cores easily enough?
AVX-512 allows arbitrary shuffles, e.g., shuffle the 64 bytes in zmm0 with indices from zmm1 into zmm2. Simple shuffles like unpacks etc aren't really an issue.
Worse yet (for wiring complexity or required uops, anyway), AVX-512 also has shuffles with two data inputs, i.e. each of the 64 bytes of result can come from any of 128 different input bytes, selected by another 64-byte register.
Those large shuffles are really powerful for things like lookup tables. Large tables are suddenly way more feasible in-register, letting you replace a costly gather with an in-register permute.
This would render the shuffles unusable, because you'd unpredictably have them costing either 1 uop or cycle to taking 10-30 uops/cycles depending on which core you are on at the moment. A situation similar to PEXT/PDEP, which cost almost nothing on Intel and dozens of cycles on AMD until a couple generations ago.
Why does Zen 4 not have this problem? First, they're only double-pumping instead of quad-pumping. Secondly, while most of their AVX-512 implementation is double-pumped, there seems to be a full-length shuffle unit in there.