I wrote an optimized FFT for fun a while ago and a lot of this is quite familiar...

I wrote an optimized FFT for fun a while ago and a lot of this is quite familiar. Optimized FFTs are a fascinating field with a long history. I wouldn't recommend writing one from scratch for production instead of using an existing library, but it's a good exercise.

Using a real-to-complex FFT is really significant for performance and important to start with, as it places some additional constraints on the main FFT. In particular, the butterfly needed in the r2c and c2r passes isn't very amenable to working in bit-reversed order, so the trick of processing frequency domain in bit reversed order doesn't necessarily work. It's also important for comparison against the Fast Hartley Transform, which looks good performance-wise against a complex FFT but not against a real FFT.

I also found that radix-4 performed better than split-radix or conjugate pair FFT with SSE2/AVX SIMD. Both the instruction and data flow is cleaner, and the CPU has an easier time flooding the FMA units with simple loops than the more chaotic data flow of SRFFT/CPFFT. An FMA-based radix-4 loop can easily keep the FMA units at >95% utilization.

For data ordering, the vector-interleaved format mentioned is indeed great for the main passes, but real/imag interleaved turns out to have some benefits for the smallest butterflies. What worked best in my case was to do the deinterleave/transpose as part of an initial radix-8 pass that also handled the bit reversal a cache line at a time.